Research Article  Open Access
Erik Marchi, Fabio Vesperini, Stefano Squartini, Björn Schuller, "Deep Recurrent Neural NetworkBased Autoencoders for Acoustic Novelty Detection", Computational Intelligence and Neuroscience, vol. 2017, Article ID 4694860, 14 pages, 2017. https://doi.org/10.1155/2017/4694860
Deep Recurrent Neural NetworkBased Autoencoders for Acoustic Novelty Detection
Abstract
In the emerging field of acoustic novelty detection, most research efforts are devoted to probabilistic approaches such as mixture models or statespace models. Only recent studies introduced (pseudo)generative models for acoustic novelty detection with recurrent neural networks in the form of an autoencoder. In these approaches, auditory spectral features of the next short term frame are predicted from the previous frames by means of LongShort Term Memory recurrent denoising autoencoders. The reconstruction error between the input and the output of the autoencoder is used as activation signal to detect novel events. There is no evidence of studies focused on comparing previous efforts to automatically recognize novel events from audio signals and giving a broad and in depth evaluation of recurrent neural networkbased autoencoders. The present contribution aims to consistently evaluate our recent novel approaches to fill this white spot in the literature and provide insight by extensive evaluations carried out on three databases: A3Novelty, PASCAL CHiME, and PROMETHEUS. Besides providing an extensive analysis of novel and stateoftheart methods, the article shows how RNNbased autoencoders outperform statistical approaches up to an absolute improvement of 16.4% average measure over the three databases.
1. Introduction
Novelty detection aims at recognizing situations in which unusual events occur. The challenging task of novelty detection is usually considered as single class classification task. The “normal” data traditionally comprises a very big set which allows for an accurate modelling. The acoustic events not included in the “normal” data are treated as novel events. Novel patterns are tested by comparing them with the normal class model resulting in a novelty score. Then, the score is processed by a decision logic—typically a threshold—to decide whether the test sample is novel or normal.
A plethora of approaches have been proposed due to the practical relevance of novelty detection, especially for medical diagnosis [1–3], damage inspection [4, 5], physiological condition monitoring [6], electronic IT security [7], and video surveillance systems [8].
According to [9, 10], novelty detection techniques can be grouped into two macro categories: (i) statistical and (ii) neural networkbased approaches. Extensive studies have been made in the category of statistical and probabilistic approaches which are evidently the most widely used in the field of novelty detection. The approaches on this category are modelling data based on its statistical properties and exploiting this information to determine when an unknown test sample belongs to the learnt distribution or not. Statistical approaches have been applied to a number of applications [9] ranging from data stream mining [11], outlier detection of underwater targets [12], the recognition of cancer [1], nondestructive inspection for the analysis of mechanical components [13], and audio segmentation [14], to many others. In 1999, support vector machines (SVMs) were introduced in the field of novelty detection [15] and subsequently applied to timeseries [16, 17], jet engine vibration analysis [18], failure detection in jet engines [19], patient vitalsign monitoring [20], fMRI analysis [21], and damage detection of a gearbox wheel [22].
Neural networkbased approaches—also named reconstructionbased [23]—have gained interest in recent years along with the evident success of neural networks in several other fields. In the past decade, several works focused on the application of a neural network in the form of an autoencoder (AE) have been presented [10], given the huge impact and effectiveness of neural networks. The autoencoderbased approaches involve building a regression model using the “normal” data. The test data are processed by analysing the reconstruction error between the regression target and the encoded value. When the reconstruction error shows high score, the test data is considered novel. Examples of applications include such to detect abnormal CPU data usage [24, 25] and such to detect outliers [26–29] for damage classification under changing environmental conditions [30].
In these scenarios, very little studies have been conducted in the field of acoustic novelty detection. Recently, we observed a growing research interest in application domains involving surveillance and homeland security to monitor public places or supervise private environments where people may live alone. Driven by the increasing requirement of security, public places such as but not limited to stores, banks, subway, trains, and airports have been equipped with various sensors like cameras or microphones. As a consequence, unsupervised monitoring systems have gained much attention in the research community to investigate new and efficient signal processing approaches. The research in the area of surveillance systems mainly focusses on detecting abnormal events relying on video information [8]. However, it has to be noted that several advantages can be obtained by relying on acoustic information. In fact, acoustic signals—as opposed to video information—need low computational costs and are invariant to illumination conditions, possible occlusion, and abrupt events (e.g., a shotgun and explosions). Specifically in the field of acoustic novelty detection, studies focused only on statistical approaches by applying hidden Markov models (HMM) and Gaussian mixture models (GMM) to acoustic surveillance of abnormal situations [31–33] and to automatic space monitoring [34]. Despite the number of studies exploring statistical and probabilistic approaches, the use of neural networkbased approaches for acoustic novelty detection has only been introduced recently [35, 36].
Contribution. Only in the last two years the use of neural networks for acoustic novelty detection has gained interest in the research community. In fact, few recent studies proposed a (pseudo)generative model in the form of a denoising autoencoder with recurrent neural networks (RNNs). In particular, the use of LongShort Term Memory (LSTM) RNNs as generative model [37] was investigated in the field of text generation [38], handwriting [38], and music [39]. However, the use of LSTM as a model for audio generation was only introduced in our recent works [35, 36].
This article provides a broad and extensive evaluation of stateoftheart methods with a particular focus on novel and recent unsupervised approaches based on RNNbased autoencoders. We significantly extended the studies conducted in [35, 36] by evaluating further approaches such as oneclass SVMs (OCSVMs) and multilayer perceptrons (MLP), and most importantly we conducted a broad and in depth evaluation on three different datasets for a total number of 160 153 experiments, making this article the first to present such a complete evaluation in the field of acoustic novelty detection.
We evaluate and compare all these methods with three different databases: A3Novelty, PASCAL CHiME, and PROMETHEUS. We provide evidence that RNNbased autoencoders significantly outperform other methods by outperforming statistical approaches up to an absolute improvement of 16.4% average measure over the three databases.
The remainder of this contribution is structured as follows: First, a basic description of the different statistical methods is given in Section 2. Then, the feedforward and LSTM RNNs together with autoencoderbased schemes for acoustic novelty detection are described (Sections 3 and 4). Next the thresholding strategy and features employed in the experiments are given in Section 5. The used databases are introduced in Section 6 and the experimental setup is discussed in Section 7 before discussing the evaluation of obtained results in Section 8. Section 9 finally presents our conclusions.
2. Statistical Methods
In this section we introduce statistical approaches such as GMM, HMM, and OCSVM. We formally define the input vector , where is the number of acoustic features (cf. Section 5).
2.1. Gaussian Mixture Models
GMMs estimate the probability density of the “normal” class, given training data, using a number of Gaussian components. The training phase of a GMM exploits the means algorithm or other suited training algorithms and the ExpectationMaximisation (EM) algorithm [40]. The former initializes the parameters while iterations of EM algorithm lead to the final model. Given a predefined threshold (defined in Section 5), if the probability produced by the GMM with a test sample is lower than the threshold, the sample is detected as novel event.
2.2. Hidden Markov Models
A further statistical model is the HMM [41]. HMMs differ from GMMs in terms of input temporal evolution. Indeed, while a diagonal GMM tends to approximate the whole training data probability distribution by means of a number of Gaussian components a HMM models the variations of the input signal through its hidden states. The HMM topology employed in this work is leftright and it is trained by means of the BaumWelch algorithm [41] while regarding the novelty detection phase, the decision is based on the sequence paradigm. Considering a leftright HMM having hidden states, a sequence is a set of feature vectors: . The emission probabilities of these observable events are determined by a probability distribution, one for each state [9]. We trained an HMM on what we call “normal” material and exploited the loglikelihoods as novelty scores. In the testing phase, the unseen signal is segmented into a fixed length depending on the number of states of the HMM, and if the loglikelihood is higher than the defined threshold (cf. Section 5), the segment is detected as novel.
2.3. OneClass Support Vector Machines
A OCSVM [42] maps an input example onto a highdimensional feature space and iteratively searches for the hyperplane that maximises the distance between the training examples from the origin. In this constellation, the OCSVM can be seen as a twoclass SVM where the origin is the unique member of the second class, whereas the training examples belong to the first class. Given the training data , where is the number of observations, the class separation is performed by solving the following: where is the support vector, are slack variables, is the offset, and maps into a dot product space such that the dot product in the image of can be computed by evaluating a certain kernel function such as a linear or Gaussian radial base function:The parameter sets an upper bound on the fraction of the outliers defined to be the data being outside the estimated region of normality. Thus, the decision values are obtained with the following function:We trained a OCSVM on what we call “normal” material and used the decision values as novelty scores. During testing, the OCSVM provides a decision value for the unseen pattern, and if the decision value is higher than the defined threshold (cf. Section 5), the segment is detected as novel.
3. FeedForward and Recurrent Neural Networks
This section introduces the MLP and the LSTM RNNs employed in our acoustic novelty detectors.
The first neural network type we used is a multilayer perceptron [43]. In a MLP the units are arranged in layers, with feedforward connections from one layer to the next. Each node outputs an activation function applied over the weighted sum of its inputs. The activation function can be linear, a hyperbolic function () or the sigmoid function. Input examples are fed to the input layer, and the resulting output is propagated via the hidden layers towards the output layer. This process is known as the forward pass of the network. This type of neural networks only relies on the current input and not on any past or future inputs.
The second neural network type we employed is the LSTM RNN [44]. Compared to a conventional RNN, the hidden units are replaced by socalled memory blocks. These memory blocks can store information in the “cell variable” . In this way, the network can exploit longrange temporal context. Each memory block consists of a memory cell and three gates: the input gate, output gate, and forget gate, as depicted in Figure 1.
The memory cell is controlled by the input, output, and forget gates.
The stored cell variable can be reset by the forget gate, while the functions responsible for reading input from and writing output to are controlled by the input and output gates, respectively:where and stand for elementwise hyperbolic tangent and elementwise multiplication, respectively. The output of the input gates is denoted by the variable , while the output of the output and forget gates are indicated by and , respectively. The variable denotes a weight matrix, and indicates a bias term.
Each LSTM unit is a separate and independent block. In fact, the size of is the same as , , , and . The size corresponds to the number of LSTM units in the hidden layer. In order to have the gates being dependent uniquely from the memory cell within the same LSTM unit, the matrices of the weights from the cells to the gates are diagonal.
Furthermore, we employed bidirectional RNN (BRNN) [45], which are capable of learning the context in both temporal directions. In fact, a BRNN contains two distinct hidden layers, which are processing the input vector in each direction. The output layer is then connected to both hidden layers. A more complex architecture can be obtained by combining a LSTM unit with a BRNN, which is referred to as bidirectional LSTM (BLSTM) [46]. BLSTM exploits context from both temporal directions. Note that, in the case of BLSTM, it is not possible to perform online processing as a short buffer to look ahead is required.
When the layout of a neural network comprises more hidden layers, it is defined as deep neural network (DNN) [47]. An incrementally higher level representation of the input data is provided when multiple hidden layers are stacked on each other (deep learning).
In the case of multiple layers, the output of a BRNN is computed aswhere the forward and backward activation of the th (last) hidden layer are denoted by and , respectively. The reconstructed signal is generated by using an identity activation function at the output. The best network layout was obtained by conducting a number of preliminary evaluations. Several configurations were evaluated by changing the size and the number of hidden layers (i.e., the number of LSTM units for each layer).
The training procedure was iterated up to a maximum of 100 epochs. The standard gradient descent with backpropagation of the sum squared error was used to recursively update the network weights. Those were initialized with a random Gaussian distribution with mean 0 and standard deviation 0.1, as it usually provides an acceptable initialization in our experience.
4. Autoencoders for Acoustic Novelty Detection
This section introduces the concepts of autoencoders and describes the basic autoencoder, compression autoencoder, denoising autoencoder, and nonlinear predictive autoencoder [36].
4.1. Basic Autoencoder
A basic autoencoder is a neural network trained to set the target values equal to the inputs. Its structure typically consists of only one hidden layer, while the input and the output layers have the same size. The training set consists of background environmental sounds, while test set is composed of recordings containing abnormal sounds. It is used to find common data representation from the input [48, 49]. Formally, in response to an input example , the hidden representation iswhere is a nonlinear activation function, typically a logistic sigmoid function applied componentwisely, is a weight matrix, and is a bias vector.
The network output maps the hidden representation back to a reconstruction :where is a weight matrix and is a bias vector.
Given an input set of examples , AE training consists in finding parameters that minimise the reconstruction error, which corresponds to minimising the following objective function: A wellknown approach to minimise the objective function is the stochastic gradient descent with error backpropagation. The layout of the AE is shown in Figure 2(a).
4.2. Compression Autoencoder
The compression autoencoder (CAE) learns a compressed representation of the input when the number of hidden units is smaller than the number of input units . For example, if some of the input features are correlated, these correlations are learnt and reconstructed by the CAE. The structure of the CAE is given in Figure 2(b).
4.3. Denoising Autoencoder
In the denoising AE (DAE) [50] configuration the network is trained to reconstruct the original input from a corrupted version of it. The initial input is corrupted by means of additive isotropic Gaussian noise in order to obtain . The corrupted input is then mapped, as with the AE, to a hidden representationforcing the hidden layer to retrieve more robust features and prevent it from simply learning the identity. Thus, the original signal is reconstructed as follows:The structure of the denoising autoencoder is shown in Figure 2(c). In the training phase, the set of network weights and biases are updated in order to have as close as possible to the uncorrupted input . This procedure corresponds to minimising the reconstruction error objective function (8). In our approach, to corrupt the initial input we make use of additive isotropic Gaussian noise, in order to obtain .
4.4. Nonlinear Predictive Autoencoder
The basic idea of a nonlinear predictive (NP) AE is to train the AE in order to predict the current frame from a previous observation. Formally, the input up to a given time frame is mapped to a hidden representation :where and denote weights and bias, respectively. From this, we reconstruct an approximation of the original signal as follows:where is the prediction delay and . A prediction delay of corresponds to a shift of 10 ms in the audio signal in our setting (cf. Section 5). The training of the parameters is performed by minimising the objective function (8)—the difference is that is now based on nonlinear prediction according to (11) and (12). Thus, the parameters are trained to minimise the average reconstruction error over the training set, to have as close as possible to the prediction delay. The resulting structure of the nonlinear predictive denoising autoencoder (NPDAE) is similar to the one depicted in Figure 2(c), but with input and output updated as described above.
5. Thresholding and Features
This section describes the thresholding decision strategy and the features employed in our experiments.
5.1. Thresholding
Auditory spectral features (ASF) in Section 5.2 used in this work are composed by 54 coefficients, which means that the input and output layer of the network have 54 units each. The trained AE reconstructs each sample and novel events are identified by processing the reconstruction error signal with an adaptive threshold. The input audio signal is segmented into sequences of 30 seconds of length. In the testing phase, we compute—on a frame basis—the average Euclidean distance between the networks’ outputs and each standardized input feature value. In order to compress the reconstruction error to a single value, the distances are summed up and divided by the number of coefficients. Then we apply a threshold to obtain a binary signal, shifting from the median of the error signal of a sequence by a multiplicative coefficient . The coefficient ranges from to :
Figure 3 shows the reconstruction error for a given sequence. The figure clearly depicts a low reconstruction error in reproducing normal input such as talking, television sounds, and other normal environmental sounds.
(a)
(b)
(c)
5.2. Acoustic Features
An efficient representation of the audio signal can be achieved by extracting the auditory spectral features (ASF) [51]. The audio signal is split into frames with the size equal to 30 ms and a frame step of 10 ms, and then the ASF are obtained by applying Short Time Fourier Transform (STFT), which yields the power spectrogram of the frame. Mel spectrograms (with being the frame index and the frequency bin index) are calculated converting the power spectrogram to the Melfrequency scale using a filterbank with 26 triangular filters. A logarithmic scaling is chosen to match the human perception of loudness:In addition, the positive firstorder differences are calculated from each Mel spectrogram following
Furthermore, the frame energy and its derivative are also included as feature ending up in a total number of 54 coefficients. For better reproducibility, the features extraction process is computed with our opensource audio analysis toolkit openSMILE [52].
6. Databases
This section describes the three databases evaluated in our experiments: A3Novelty, PASCAL CHiME, and PROMETHEUS.
6.1. A3Novelty
The A3Novelty Corpus (http://www.a3lab.dii.univpm.it/research/a3novelty) includes around 56 hours of recording acquired in a laboratory of the Università Politecnica delle Marche. These recordings were performed during different day and night hours, so very different acoustic conditions are available. A variety of novel events were randomly played back by a speaker (e.g., scream, fall, alarm, or breakage of objects) during the recordings.
Eight microphones were used in the recording room for the acquisitions: four Behringer B5 microphones with cardioid pattern and an array of four AKG C400 BL microphones spaced by 4 cm, and then A MOTU 8pre sound card and the NUTech software were utilised to record the microphone signals. The sampling rate was equal to 48 kHz.
The abnormal event sounds (cf. Table 1) can be grouped into four categories and they are freely available to download from http://www.freesound.org/:(i)Sirens, three different types of sirens or alarm sounds.(ii)Falls, two occurrences of a person or an object falling to the ground.(iii)Breakage of objects, noise produced by the breakage of an object after the impact with the ground.(iv)Screams, four different human screams, both produced by a single person or by a group of people.

The A3Novelty Corpus is composed of two types of recordings: background, which contains only background sounds such as human speech, technical tools noise, and environmental sounds and background with novelty, which contains in addition to the background the artificially generated novelty events.
In the original A3Novelty database the recordings are segmented in sequences of 30 seconds. In order to limit the size of training data, we randomly selected 300 sequences from the background partition to compose training material (150 minutes) and 180 sequences from the background with novelty partition to compose the testing set (90 minutes). The test set contains 13 novelty occurrences.
For reproducibility, the list of randomly selected recordings and the train and test set are made available (http://www.a3lab.dii.univpm.it/research/a3novelty).
6.2. PASCAL CHiME
The original dataset is composed of around 7 hours of recordings of a home environment, taken from the PASCAL CHiME speech separation and recognition challenge [53]. It consists of a typical inhome scenario (a living room), recorded during different days and times, while the inhabitants (two adults and two children) perform common actions, such as talking, watching television, playing, or eating. The dataset was recorded with a binaural microphone and a samplerate of 16 kHz. In the original PASCAL CHiME database the recordings are segmented in sequences of 5 minutes’ duration. In order to limit the size of training data, we randomly selected sequences to compose 100 minutes of background for the training set and around 70 minutes for the testing set. For reproducibility, the list of randomly selected recordings and the train and test set are made available (http://a3lab.dii.univpm.it/webdav/audio/Novelty_Detection_Dataset.tar.gz). The test set was generated adding different types of sounds (taken from http://www.freesound.org/), such as screams, alarms, falls, and fractures (cf. Table 1), after their normalization to the volume of the background recordings. The events in the test set were added at random position; thus the distance between one event and another is not fixed.
6.3. PROMETHEUS
The PROMETHEUS database [31] contains recordings of various scenarios designed to serve a wide range of realworld applications. The database includes (1) a smartroom indoor home environment including phases where a user is interacting with an automated speechdriven home assistant and (2) an outdoor public space consisting of (a) interaction of people with an ATM, (b) an outdoor security scenario in which people are waiting in front of a counter, and (3) an indoor office corridor scenario for security monitoring in standard indoor space. These scenarios substantially differ in terms of acoustic environment. The indoor scenarios were recorded under quiet acoustic conditions, whereas the outdoor recordings were conducted in an openair public area and contain nonstationary background noise. The smarthome scenario contains recordings of five professional actors performing five singleperson and 14 multipleperson action scripts. The main activities include humanmachine interaction with a virtual home agent and a number of alternating normal and abnormal activities specifically designed to monitor and interpret human behaviour. The singleperson and multipleperson actions include abnormal events, such as falls, alarm followed by panic, atypical vocalic reactions (pain, fear, and anger), or fractures. Examples are walking to the couch, sitting, or interacting with the smart environment to turn the TV on, open the windows, or decrease the temperature. The scenarios were recorded three to five times, by changing the actors and their roles in the action scripts. Table 1 provides details on the number of abnormal events per scenario, including average time duration.
7. Experimental SetUp
The networks were trained with the gradient steepest descent algorithm on the sum of squared errors (SSE). In the case of all the LSTM and BLSTM networks, we used a constant value of learning rate since it showed better performances in our previous works [35], whereas different values of were used for MLP networks. Different noise sigma values were applied to the DAE. No Gaussian noise was applied to the basic AE and to the CAE following the architectures described in Section 4. The prediction delay was applied for different values: . The AEs were trained using our opensource CUDA RecurREnt Neural Network Toolkit (CURRENNT) [54] ensuring reproducibility. As evaluation metrics we used measure in order to compare the results with previous works [35, 36]. We evaluated several topologies for the nonlinear predictive DAE ranging from 5412854 to 216216216 and from 543054 to 545454 in the case of CAE and basic AE, respectively. Every network topology was evaluated for each of the 100 epochs of training. In order to compare our results with our previous studies we kept the same optimisation procedure as applied in [35, 36]. We employed further three stateoftheart approaches based on OCSVM, GMM, and HMM. In the case of OCSVM, we trained models at different complexity values . Radial basis function kernels were used with different gamma values , and we controlled the fraction of outliers in the training phase with different values . The OCSVM was trained via the LIBSVM library [55]. In the case of GMM, models were trained at different numbers of Gaussian components with , whereas leftright HMMs were trained with different numbers of states and Gaussian components with . GMMs and HMMs were trained using the Torch [56] toolkit. The decision values produced as output of the OCSVM and the probability estimates produced as output of the probabilistic models were postprocessed with a similar thresholding algorithm (cf. Section 5) in order to fairly compare the performance among the different methods. For all the experiments and settings we maintained the same feature set.
8. Results
In this section we present and comment on the results obtained in our evaluation across the three databases.
8.1. A3Novelty
Evaluations on the A3Novelty Corpus are reported in the second column of Table 2. In this dataset GMMs and HMMs perform similarly; however, they are outperformed by the OCSVM with a maximum improvement of 3.6% absolute measure. The autoencoderbased approaches are significantly boosting the performance up to 98.7%. We observe a vast absolute improvement by up to 6.9% against the probabilistic approaches. Among the three CAE, AE, and DAE, we observe that compression and denoising layouts with BLSTM units perform closely to each other at up to 98.7% in the case of the BLSTMCAE. This can be due to the fact that the dataset contains fewer variations in the background material used for training, and the feature selection operated internally by the AE increases the sensitivity of the reconstruction error.

The nonlinear predictive results are shown in the last part of Table 2. We provide performance in the three named configurations and with the three named unit types. In concordance to what we found in the PASCAL database, the NPBLSTMDAE method provided the best performance in terms of measure of up to 99.4%. A significant absolute improvement (onetailed ztest [57], (in the rest of the manuscript we reported as “significant” the improvements with at least under the onetailed ztest [57])) of 10.0% measure is observed against the GMMbased approach, while an absolute improvement of 7.6% measure is exhibited with respect to the OCSVM method. We observe an overall improvement of 1% between the “ordinary” and the “predictive” architectures.
The performance obtained by progressively increasing the prediction delay () values (from 0 up to 10) is reported in Figure 4. We evaluated the compression autoencoder (CAE), the basic autoencoder (AE), and the denoising autoencoder (DAE) with MLP, LSTM, and BLSTM units, and we applied different layouts (cf. Section 7) per network type. However, for the sake of brevity, we only show the best configurations. The best results across all the three unit types are 99.4% and 99.1% measure for the NPBLSTMDAE and NPLSTMDAE networks, respectively. These are obtained with a prediction delay of 5 frames, which translates into an overall delay of 50 ms. In general, the best performances are achieved with or . Increasing the prediction delay up to 10 frames produces a heavy decrease in performance down to 97.8% measure.
8.2. PASCAL CHiME
In the first column of Table 2 we report the performance obtained on the PASCAL dataset using different approaches. Parts of the results obtained on this database were also presented in [36]. Here, we conducted additional experiments to evaluate OCSVM and MLP approaches. The oneclass SVM shows lower performance compared to probabilistic approaches such as GMM and HMM, which seems to work reasonably well up to 91.4% measure. The OCSVM low performance can be due to the fact that the dataset was generated artificially and the abnormal sound dynamics were normalized with respect to the “normal” material making the margin maximisation more complex and less effective. Next, we evaluated AEbased approaches in the three configurations: compression (CAE), basic (AE), and denoising (DAE). We also evaluated MLP, LSTM, and BLSTM unit types. Among the three configurations we observe that denoising ones perform better than the others independently of the type of unit. In particular, the best performance is obtained with the denoising autoencoder realised as BLSTM RNN showing up to 93.4% measure. The last three groups of rows in Table 2 show results of the NP approach again in the three configurations and with the three unit types.
The NPBLSTMDAE achieved the best result of up to 94.4% measure. Significant absolute improvements of 4.0%, 3.0%, and 1% are observed over GMM, HMM, and “ordinary” BLSTMDAE approaches, respectively.
Interestingly, applying the nonlinear prediction scheme to the compression autoencoders NP(B)LSTMCAE (92.8% measure) also increased the performances in comparison with the (B)LSTMCAE (91.3% measure). In fact, in a previous work [35], the compression learning process alone showed scarce results. However, here the CAE with the nonlinear prediction encodes information on the input more effectively.
Figure 5 depicts results for increasing values of the prediction delay (), ranging from 0 to 10. We evaluated CAE, AE, and DAE with MLP, LSTM, and BLSTM neural networks with different layouts (cf. Section 7) per network type. However, due to space restrictions, we only report the best performances. Here, the best performances are obtained with a prediction delay of 3 frames (30 ms) for the NPBLSTMDAE network (94.4% measure) and of one frame in the case of NPLSTMDAE (94.2% measure). As in the A3Novelty database, we observe a similar decrease in performance down to 86.2% measure when the prediction delay increases up to 10, which corresponds to 100 ms. In fact, applying a higher prediction delay (e.g., 100 ms) induces higher values of the reconstruction error in the presence of fast periodic events, which subsequently leads to an increased false detection rate.
8.3. PROMETHEUS
This subsection elaborates on the results obtained on the four subsets present in the PROMETHEUS database.
8.3.1. ATM
The ATM scenario evaluations are shown in the third column of Table 2. The GMM and HMM perform similarly at chance level. In fact, we observe an measure of 50.2% and 52.0% for GMMs and HMMs, respectively. The oneclass SVM shows slightly better performance of up to 60.2%. On the other hand, AEbased approaches in the three configurations—compression (CAE), traditional (AE), and denoising (DAE)—show a significant improvement in performance up to 19.3% absolute measure against the OCSVM. Among the three configurations we observe that DAE performs better independently of the type of network. In particular, the best performance considering the ordinary (without nonlinear prediction) approach is obtained with the DAE with a LSTM network leading to an measure of 79.5%.
The last three groups of rows in Table 2 show results of the nonlinear predictive approach (NP). The nonlinear predictive denoising autoencoder performs best up to 81.6% measure. Surprisingly, the best performance is obtained using MLP units suggesting that for long events—as those contained in the ATM scenario (with an average duration of 6.0 s, cf. Table 1)—memoryenhanced units such as (B)LSTM are not as effective as for shorter events.
A significant absolute improvement of 21.4% measure is observed against the OCSVM approach, while an absolute improvement of 31.4% measure is exhibited with respect to the GMMbased method. Among the two autoencoderbased approaches we report an absolute improvement of 1.0% between the, namely, “ordinary” and “predictive” structures. It must be observed that the performance presented in [31] are higher than the one provided in this article since the tolerance window used in that study was set to 1 s whereas here we aimed at a higher temporal resolution with a tolerance window of 200 ms which is suitable also for abrupt events.
Figure 6 depicts performance for progressive values of the prediction delay () ranging from 0 to 10, applying a CAE, AE, and DAE with MLP, LSTM, and BLSTM networks. Several layouts (cf. Section 7) were evaluated per network type; however, we report only the best configurations. Setting a prediction delay of 1 frame, which corresponds to a total prediction delay of 10 ms, leads to the best performance of up to 81.6% measure in the NPMLPDAE network. In the case of the NPBLSTMDAE we observe better performance with a delay of 2 frames up to 80.7% measure. In general, we do not observe a consistent trend by increasing the prediction delay, corroborating the fact that, for long events, as those contained in the ATM scenario, memoryenhanced units and a nonlinear predictive approach are not as effective as for shorter events.
8.3.2. Corridor
The evaluations on the corridor subset are shown in the fourth column of Table 2. The GMM and HMM perform similarly at chance level. We observe an measure of 49.4% and 49.6% for GMM and HMM, respectively. The OCSVM shows better performance up to 65.3%. As observed in the ATM scenario, again a significant improvement in performance up to 16.5% absolute measure is observed using the autoencoderbased approaches in the three configurations (CAE, AE, and DAE) with respect to the OCSVM. Among the three configurations, we observe that the denoising autoencoder performs better than the others. The best performance is obtained with the denoising autoencoder with a BLSTM unit of up to 79.8% measure.
The “predictive” approach is reported in the last three groups of rows in Table 2. Interestingly, the nonlinear predictive autoencoders do not improve the performance as we have seen in the other scenarios. A plausible explanation can be found based on the nature of the novelty events present in the subset. In fact, the subset contains very long events with an average duration of up to 14.0 s per event. With such long events, the generative model does not introduce a more sensitive reconstruction error. However, the delta in performance between the BLSTMDAE and the NPBLSTMDAE is rather small (1.3% measure) in favour of the “ordinary” approach. The best performance (79.8%) is obtained using BLSTM units confirming that memoryenhanced units are more effective in the presence of short events. In fact this scenario—besides very long events—also contains fall and pain short events with an average duration of 1.0 s and 3.0 s, respectively.
A significant absolute improvement up to 16.5% measure is observed against the OCSVM approach, while being even higher with respect to the GMM and HMM.
8.3.3. Outdoor
The evaluations on the outdoor subset are shown in the fifth column of Table 2. The OCSVM, GMM, and HMM perform better in this scenario as opposed to ATM and corridor. We observe an measure of 57.3%, 56.4%, and 56.0% for OCSVM, GMM, and HMM, respectively. In this scenario, the improvement brought by the autoencoder is not as vast as in the previous subsets but still significant. We report an absolute improvement of 11.2% measure between OCSVM and BLSTMDAE. Again, the denoising autoencoder performs better than the other configurations. In particular, the best performance obtained with BLSTMDAE is 68.5% measure.
As observed in the corridor scenario, the nonlinear predictive autoencoders (last three groups of rows in Table 2) do not improve the performance. These results corroborate our previous explanation that the long duration nature of the novelty events present in the subset affects the sensitivity of the reconstruction error in the generative model. However, the delta in performance between the BLSTMDAE and NPBLSTMDAE is rather small (1.3% measure).
It must be observed that the performance in this scenario is rather low compared to the one obtained in the other datasets. We believe that the presence of anger novel sounds introduces a higher degree of complexity in our autoencoderbased approach. In fact, anger novel events may contain different level of aroused content which could be acoustically similar to neutral spoken content present in the training material. Under this condition, the generative model shows a low reconstruction error. This issue could be solved by setting the novel events to only contain the aroused segments or—considering anger as a longterm speaker state—increasing the temporal resolution of our system.
8.3.4. SmartRoom
The smartroom scenario evaluations are shown in the sixth column of Table 2. The OCSVM, GMM, and HMM perform better in this scenario as opposed to ATM, corridor, and outdoor. We observe an measure of 57.4%, 59.1%, and 59.1% for OCSVM, GMM, and HMM, respectively. In this scenario, the improvement brought about by the autoencoder is still significant. We report an absolute improvement of 6.0% measure between GMM/HMM and the BLSTMDAE. Again, the denoising autoencoder performs better than the other configurations. In particular, the best performance in the ordinary approach is obtained with the BLSTMDAE of up to 65.1% measure.
The last three groups of rows in Table 2 show results of the nonlinear predictive approach (NP). The NPBLSTMDAE performs best at up to 65.6% measure.
As in the outdoor subset, we report a low performance in the smartroom subset as well. In fact, the subset contains several long novel events related to spoken content expressing pain and fear. As commented in the outdoor scenario, under this condition the generative model may be able to reconstruct the novel event without producing a high reconstruction error.
8.4. Overall
Overall, the experimental results proved that the DAE methods achieved superior performances compared to the CAE/AE schemes. This is due to the combination of two leaning processes of a denoising autoencoder, such as the process of encoding of the input by preserving the information about the input itself and simultaneously reversing the effect of a corruption process applied to the input of the autoencoder.
In particular, the predictive approach with (B)LSTM units showed the best performance of up to 89.3% average measure among all the six different datasets weighted by the number of instances per database (cf. Table 2).
To better understand the improvement brought by the RNNbased approaches, we provide in Figure 7 the comparison between stateoftheart methods in terms of weighted average measure computed across the A3Novelty Corpus, PASCAL CHiME, and PROMETHEUS. In general, we observe that the recently proposed NPBLSTMDAE method provided the best performance in terms of average measure of up to 89.3%. A significant absolute improvement of 16.0% average measure is observed against the OCSVM approach, while an absolute improvement of 10.6% and 10.4% average measure is exhibited with respect to the GMM and HMMbased methods. An absolute improvement of 0.6% is observed over the “ordinary” BLSTMDAE. It has to be noted that the average measure is computed including the PROMETHEUS database for which the performance has been shown to be lower because it contains longterm events and a lower resolution in the labels (1 s).
The RNNbased schemes also bring an evident benefit when applied to the “normal” autoencoders (i.e., with no denoising or compression); in fact, the NPBLSTMAE achieves an measure of 88.5%. Furthermore, when we applied the nonlinear prediction scheme to a denoising autoencoder, the performance achieved with LSTM was in this case comparable with BLSTM units and also outperformed stateoftheart approaches.
In conclusion, the combination of the nonlinear prediction paradigm and the various (B)LSTM autoencoders proved to be effective, outperforming significantly other stateoftheart methods. Additionally, there is evidence that memoryenhanced units such as LSTM and BLSTM outperformed MLP without memory, showing that the knowledge of the temporal context can improve the novelty detector abilities.
9. Conclusions and Outlook
We presented a broad and extensive evaluation of stateoftheart methods with a particular focus on novel and recent unsupervised approaches based on RNNbased autoencoders. We significantly extended the studies conducted in [35, 36] by evaluating further approaches such as oneclass support vector machines (OCSVMs) and multilayer perceptron (MLP), and most importantly we conducted a broad evaluation on three different datasets for a total number of 160153 experiments, making this article the first to present such a complete evaluation in the field of acoustic novelty detection. We show evidently that RNNbased autoencoders significantly outperform other methods by achieving up to 89.3% weighted average measure on the three databases, with a significant absolute improvement of 10.4% against the best performance obtained with statistical approaches (HMM). Overall, a significant increase in performance was achieved by combining the (B)LSTM autoencoderbased architecture with the nonlinear prediction scheme.
Future works will focus on using multiresolution features [51, 58], likely more suitable to deal with different event durations in order to face the issues encountered in the PROMETHEUS database. Further studies will be conducted on other architectures of RNNbased autoencoders ranging from dynamic Bayesian networks [59] to convolutional neural networks [60].
Competing Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under Grant Agreement no. 338164 (ERC Starting Grant iHEARu), no. 645378 (RIA ARIA VALUSPA), and no. 688835 (RIA DEENIGMA).
References
 L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady, “Novelty detection for the identification of masses in mammograms,” in Proceedings of the 4th International Conference on Artificial Neural Networks, pp. 442–447, June 1995. View at: Google Scholar
 J. Quinn and C. Williams, “Known unknowns: novelty detection in condition monitoring,” in Pattern Recognition and Image Analysis, J. Mart, J. Bened, A. Mendona, and J. Serrat, Eds., vol. 4477 of Lecture Notes in Computer Science, pp. 1–6, Springer, Berlin, Germany, 2007. View at: Google Scholar
 L. Clifton, D. A. Clifton, P. J. Watkinson, and L. Tarassenko, “Identification of patient deterioration in vitalsign data using oneclass support vector machines,” in Proceedings of the Federated Conference on Computer Science and Information Systems (FedCSIS '11), pp. 125–131, September 2011. View at: Google Scholar
 Y.H. Liu, Y.C. Liu, and Y.J. Chen, “Fast support vector data descriptions for novelty detection,” IEEE Transactions on Neural Networks, vol. 21, no. 8, pp. 1296–1313, 2010. View at: Publisher Site  Google Scholar
 C. Surace and K. Worden, “Novelty detection in a changing environment: a negative selection approach,” Mechanical Systems and Signal Processing, vol. 24, no. 4, pp. 1114–1128, 2010. View at: Publisher Site  Google Scholar
 J. A. Quinn, C. K. I. Williams, and N. McIntosh, “Factorial switching linear dynamical systems applied to physiological condition monitoring,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 9, pp. 1537–1551, 2009. View at: Publisher Site  Google Scholar
 A. Patcha and J.M. Park, “An overview of anomaly detection techniques: existing solutions and latest technological trends,” Computer Networks, vol. 51, no. 12, pp. 3448–3470, 2007. View at: Publisher Site  Google Scholar
 M. Markou and S. Singh, “A neural networkbased novelty detector for image sequence analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 10, pp. 1664–1677, 2006. View at: Publisher Site  Google Scholar
 M. Markou and S. Singh, “Novelty detection: a review—part 1: statistical approaches,” Signal Processing, vol. 83, no. 12, pp. 2481–2497, 2003. View at: Publisher Site  Google Scholar
 M. Markou and S. Singh, “Novelty detection: a review—part 2: neural network based approaches,” Signal Processing, vol. 83, no. 12, pp. 2499–2521, 2003. View at: Publisher Site  Google Scholar
 E. R. de Faria, I. R. Goncalves, J. a. Gama, and A. C. Carvalho, “Evaluation of multiclass novelty detection algorithms for data streams,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 11, pp. 2961–2973, 2015. View at: Publisher Site  Google Scholar
 C. Satheesh Chandran, S. Kamal, A. Mujeeb, and M. H. Supriya, “Novel class detection of underwater targets using selforganizing neural networks,” in Proceedings of the IEEE Underwater Technology (UT '15), Chennai, India, February 2015. View at: Publisher Site  Google Scholar
 K. Worden, G. Manson, and D. Allman, “Experimental validation of a structural health monitoring methodology: part I. Novelty detection on a laboratory structure,” Journal of Sound and Vibration, vol. 259, no. 2, pp. 323–343, 2003. View at: Publisher Site  Google Scholar
 J. Foote, “Automatic audio segmentation using a measure of audio novelty,” in Proceedings of the IEEE Internatinal Conference on Multimedia and Expo (ICME '00), vol. 1, pp. 452–455, August 2000. View at: Google Scholar
 B. Schölkopf, R. C. Williamson, A. J. Smola, J. ShaweTaylor, and J. C. Platt, “Support vector method for novelty detection,” in Proceedings of the Neural Information Processing Systems (NIPS '99), vol. 12, pp. 582–588, Denver, Colo, USA, 1999. View at: Google Scholar
 J. Ma and S. Perkins, “Timeseries novelty detection using oneclass support vector machines,” in Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN '03), vol. 3, pp. 1741–1745, Portland, Ore, USA, July 2003. View at: Publisher Site  Google Scholar
 J. Ma and S. Perkins, “Online novelty detection on temporal sequences,” in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '03), pp. 613–618, ACM, Washington, DC, USA, August 2003. View at: Publisher Site  Google Scholar
 P. Hayton, B. Schölkopf, L. Tarassenko, and P. Anuzis, “Support vector novelty detection applied to jet engine vibration spectra,” in Proceedings of the 14th Annual Neural Information Processing Systems Conference (NIPS '00), pp. 946–952, December 2000. View at: Google Scholar
 L. Tarassenko, A. Nairac, N. Townsend, and P. Cowley, “Novelty detection in jet engines,” in Proceedings of the IEE Colloquium on Condition Monitoring: Machinery, External Structures and Health, pp. 4/1–4/5, Birmingham, UK, April 1999. View at: Publisher Site  Google Scholar
 L. Clifton, D. A. Clifton, Y. Zhang, P. Watkinson, L. Tarassenko, and H. Yin, “Probabilistic novelty detection with support vector machines,” IEEE Transactions on Reliability, vol. 63, no. 2, pp. 455–467, 2014. View at: Publisher Site  Google Scholar
 D. R. Hardoon and L. M. Manevitz, “Oneclass machine learning approach for fmri analysis,” in Proceedings of the Postgraduate Research Conference in Electronics, Photonics, Communications and Networks, and Computer Science, Lancaster, UK, April 2005. View at: Google Scholar
 M. Davy, F. Desobry, A. Gretton, and C. Doncarli, “An online support vector machine for abnormal events detection,” Signal Processing, vol. 86, no. 8, pp. 2009–2025, 2006. View at: Publisher Site  Google Scholar
 M. A. F. Pimentel, D. A. Clifton, L. Clifton, and L. Tarassenko, “A review of novelty detection,” Signal Processing, vol. 99, pp. 215–249, 2014. View at: Publisher Site  Google Scholar
 N. Japkowicz, C. Myers, M. Gluck et al., “A novelty detection approach to classification,” in Proceedings of the 14th international joint conference on Artificial intelligence (IJCAI '95), vol. 1, pp. 518–523, Montreal, Canada, 1995. View at: Google Scholar
 B. B. Thompson, R. J. Marks, J. J. Choi, M. A. ElSharkawi, M.Y. Huang, and C. Bunje, “Implicit learning in autoencoder novelty assessment,” in Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN '02), vol. 3, pp. 2878–2883, IEEE, Honolulu, Hawaii, USA, 2002. View at: Google Scholar
 S. Hawkins, H. He, G. Williams, and R. Baxter, “Outlier detection using replicator neural networks,” in Data Warehousing and Knowledge Discovery, pp. 170–180, Springer, Berlin, Germany, 2002. View at: Google Scholar
 N. Japkowicz, “Supervised versus unsupervised binarylearning by feedforward neural networks,” Machine Learning, vol. 42, no. 12, pp. 97–122, 2001. View at: Publisher Site  Google Scholar
 L. Manevitz and M. Yousef, “Oneclass document classification via Neural Networks,” Neurocomputing, vol. 70, no. 7–9, pp. 1466–1481, 2007. View at: Publisher Site  Google Scholar
 G. Williams, R. Baxter, H. He, S. Hawkins, and L. Gu, “A comparative study of RNN for outlier detection in data mining,” in Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM '02), pp. 709–712, IEEE Computer Society, December 2002. View at: Google Scholar
 H. Sohn, K. Worden, and C. R. Farrar, “Statistical damage classification under changing environmental and operational conditions,” Journal of Intelligent Material Systems and Structures, vol. 13, no. 9, pp. 561–574, 2002. View at: Publisher Site  Google Scholar
 S. Ntalampiras, I. Potamitis, and N. Fakotakis, “Probabilistic novelty detection for acoustic surveillance under realworld conditions,” IEEE Transactions on Multimedia, vol. 13, no. 4, pp. 713–719, 2011. View at: Publisher Site  Google Scholar
 E. Principi, S. Squartini, R. Bonfigli, G. Ferroni, and F. Piazza, “An integrated system for voice command recognition and emergency detection based on audio signals,” Expert Systems with Applications, vol. 42, no. 13, pp. 5668–5683, 2015. View at: Publisher Site  Google Scholar
 A. Ito, A. Aiba, M. Ito, and S. Makino, “Evaluation of abnormal sound detection using multistage GMM in various environments,” in Proceedings of the 12th Annual Conference of the International Speech Communication Association (INTERSPEECH '11), pp. 301–304, Florence, Italy, August 2011. View at: Google Scholar
 S. Ntalampiras, I. Potamitis, and N. Fakotakis, “On acoustic surveillance of hazardous situations,” in Proceedings of the 34th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '09), pp. 165–168, April 2009. View at: Publisher Site  Google Scholar
 E. Marchi, F. Vesperini, F. Eyben, S. Squartini, and B. Schuller, “A novel approach for automatic acoustic novelty detection using a denoising autoencoder with bidirectional LSTM neural networks,” in Proceedings of the 40th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '15), p. 5, IEEE, Brisbane, Australia, April 2015. View at: Google Scholar
 E. Marchi, F. Vesperini, F. Weninger, F. Eyben, S. Squartini, and B. Schuller, “Nonlinear prediction with LSTM recurrent neural networks for acoustic novelty detection,” in Proceedings of the 2015 International Joint Conference on Neural Networks (IJCNN '15), p. 5, IEEE, Killarney, Ireland, July 2015. View at: Google Scholar
 F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning precise timing with LSTM recurrent networks,” The Journal of Machine Learning Research, vol. 3, no. 1, pp. 115–143, 2003. View at: Publisher Site  Google Scholar  MathSciNet
 A. Graves, “Generating sequences with recurrent neural networks,” https://arxiv.org/abs/1308.0850. View at: Google Scholar
 D. Eck and J. Schmidhuber, “Finding temporal structure in music: blues improvisation with lstm recurrent networks,” in Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pp. 747–756, IEEE, Martigny, Switzerland, September 2002. View at: Publisher Site  Google Scholar
 J. A. Bilmes, “A gentle tutorial of the EM algorithm and its application to parameter estimation for gaussian mixture and hidden Markov models,” Tech. Rep. 97021, International Computer Science Institute, Berkeley, Calif, USA, 1998. View at: Google Scholar
 L. R. Rabiner, “Tutorial on hidden Markov models and selected applications in speech recognition,” Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, 1989. View at: Publisher Site  Google Scholar
 B. Schölkopf, R. C. Williamson, A. J. Smola, J. ShaweTaylor, and J. C. Platt, “Support vector method for novelty detection,” in Advances in Neural Information Processing Systems 12, S. Solla, T. Leen, and K. Müller, Eds., pp. 582–588, MIT Press, 2000. View at: Google Scholar
 D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. View at: Publisher Site  Google Scholar
 S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. View at: Publisher Site  Google Scholar
 M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997. View at: Publisher Site  Google Scholar
 A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 56, pp. 602–610, 2005. View at: Publisher Site  Google Scholar
 G. Hinton, L. Deng, D. Yu et al., “Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012. View at: Publisher Site  Google Scholar
 I. Goodfellow, H. Lee, Q. V. Le, A. Saxe, and A. Y. Ng, “Measuring invariances in deep networks,” in Advances in Neural Information Processing Systems, Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, Eds., vol. 22, pp. 646–654, Curran & Associates Inc, Red Hook, NY, USA, 2009. View at: Google Scholar
 Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, Eds., pp. 153–160, MIT Press, 2007. View at: Google Scholar
 P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.A. Manzagol, “Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion,” Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010. View at: Google Scholar  MathSciNet
 E. Marchi, G. Ferroni, F. Eyben, L. Gabrielli, S. Squartini, and B. Schuller, “Multiresolution linear prediction based features for audio onset detection with bidirectional LSTM neural networks,” in Proceedings of the 39th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '14), pp. 2183–2187, IEEE, Florence, Italy, May 2014. View at: Publisher Site  Google Scholar
 F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in openSMILE, the munich opensource multimedia feature extractor,” in Proceedings of the 21st ACM International Conference on Multimedia (MM '13), pp. 835–838, ACM, Barcelona, Spain, October 2013. View at: Publisher Site  Google Scholar
 J. Barker, E. Vincent, N. Ma, H. Christensen, and P. Green, “The pascal chime speech separation and recognition challenge,” Computer Speech & Language, vol. 27, no. 3, pp. 621–633, 2013. View at: Publisher Site  Google Scholar
 F. Weninger, J. Bergmann, and B. Schuller, “Introducing CURRENNT: the Munich opensource CUDA RecurREnt neural network toolkit,” Journal of Machine Learning Research, vol. 16, pp. 547–551, 2015. View at: Google Scholar  MathSciNet
 C.C. Chang and C.J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, article 27, 2011. View at: Publisher Site  Google Scholar
 R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: a matlablike environment for machine learning,” BigLearn, NIPS Workshop, no. EPFLCONF192376, 2011. View at: Google Scholar
 M. D. Smucker, J. Allan, and B. Carterette, “A comparison of statistical significance tests for information retrieval evaluation,” in Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management (CIKM '07), pp. 623–632, ACM, Lisbon, Portugal, November 2007. View at: Google Scholar
 E. Marchi, G. Ferroni, F. Eyben, S. Squartini, and B. Schuller, “Audio onset detection: a wavelet packet based approach with recurrent neural networks,” in Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN '14), pp. 3585–3591, Beijing, China, July 2014. View at: Publisher Site  Google Scholar
 N. Friedman, K. Murphy, and S. Russell, “Learning the structure of dynamic probabilistic networks,” in Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI '98), pp. 139–147, Morgan Kaufmann, Madison, Wisc, USA, July 1998. View at: Google Scholar
 S. Lawrence, C. L. Giles, A. C. Tsoi, and A. D. Back, “Face recognition: a convolutional neuralnetwork approach,” IEEE Transactions on Neural Networks, vol. 8, no. 1, pp. 98–113, 1997. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2017 Erik Marchi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.