Computational Intelligence and Neuroscience

Volume 2017, Article ID 4694860, 14 pages

https://doi.org/10.1155/2017/4694860

## Deep Recurrent Neural Network-Based Autoencoders for Acoustic Novelty Detection

^{1}Machine Intelligence & Signal Processing Group, Technische Universität München, Munich, Germany^{2}audEERING GmbH, Gilching, Germany^{3}Chair of Complex & Intelligent Systems, University of Passau, Passau, Germany^{4}A3LAB, Department of Information Engineering, Università Politecnica delle Marche, Ancona, Italy^{5}Department of Computing, Imperial College London, London, UK

Correspondence should be addressed to Erik Marchi; ed.mut@ihcram.kire

Received 12 July 2016; Accepted 25 September 2016; Published 15 January 2017

Academic Editor: Stefan Haufe

Copyright © 2017 Erik Marchi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In the emerging field of acoustic novelty detection, most research efforts are devoted to probabilistic approaches such as mixture models or state-space models. Only recent studies introduced (pseudo-)generative models for acoustic novelty detection with recurrent neural networks in the form of an autoencoder. In these approaches, auditory spectral features of the next short term frame are predicted from the previous frames by means of Long-Short Term Memory recurrent denoising autoencoders. The reconstruction error between the input and the output of the autoencoder is used as activation signal to detect novel events. There is no evidence of studies focused on comparing previous efforts to automatically recognize novel events from audio signals and giving a broad and in depth evaluation of recurrent neural network-based autoencoders. The present contribution aims to consistently evaluate our recent novel approaches to fill this white spot in the literature and provide insight by extensive evaluations carried out on three databases: A3Novelty, PASCAL CHiME, and PROMETHEUS. Besides providing an extensive analysis of novel and state-of-the-art methods, the article shows how RNN-based autoencoders outperform statistical approaches up to an absolute improvement of 16.4% average -measure over the three databases.

#### 1. Introduction

Novelty detection aims at recognizing situations in which unusual events occur. The challenging task of novelty detection is usually considered as single class classification task. The “normal” data traditionally comprises a very big set which allows for an accurate modelling. The acoustic events not included in the “normal” data are treated as* novel* events. Novel patterns are tested by comparing them with the normal class model resulting in a novelty score. Then, the score is processed by a decision logic—typically a threshold—to decide whether the test sample is novel or normal.

A plethora of approaches have been proposed due to the practical relevance of novelty detection, especially for medical diagnosis [1–3], damage inspection [4, 5], physiological condition monitoring [6], electronic IT security [7], and video surveillance systems [8].

According to [9, 10], novelty detection techniques can be grouped into two macro categories: (i) statistical and (ii) neural network-based approaches. Extensive studies have been made in the category of statistical and probabilistic approaches which are evidently the most widely used in the field of novelty detection. The approaches on this category are modelling data based on its statistical properties and exploiting this information to determine when an unknown test sample belongs to the learnt distribution or not. Statistical approaches have been applied to a number of applications [9] ranging from data stream mining [11], outlier detection of underwater targets [12], the recognition of cancer [1], nondestructive inspection for the analysis of mechanical components [13], and audio segmentation [14], to many others. In 1999, support vector machines (SVMs) were introduced in the field of novelty detection [15] and subsequently applied to time-series [16, 17], jet engine vibration analysis [18], failure detection in jet engines [19], patient vital-sign monitoring [20], fMRI analysis [21], and damage detection of a gearbox wheel [22].

Neural network-based approaches—also named reconstruction-based [23]—have gained interest in recent years along with the evident success of neural networks in several other fields. In the past decade, several works focused on the application of a neural network in the form of an autoencoder (AE) have been presented [10], given the huge impact and effectiveness of neural networks. The autoencoder-based approaches involve building a regression model using the “normal” data. The test data are processed by analysing the reconstruction error between the regression target and the encoded value. When the reconstruction error shows high score, the test data is considered novel. Examples of applications include such to detect abnormal CPU data usage [24, 25] and such to detect outliers [26–29] for damage classification under changing environmental conditions [30].

In these scenarios, very little studies have been conducted in the field of acoustic novelty detection. Recently, we observed a growing research interest in application domains involving surveillance and homeland security to monitor public places or supervise private environments where people may live alone. Driven by the increasing requirement of security, public places such as but not limited to stores, banks, subway, trains, and airports have been equipped with various sensors like cameras or microphones. As a consequence, unsupervised monitoring systems have gained much attention in the research community to investigate new and efficient signal processing approaches. The research in the area of surveillance systems mainly focusses on detecting abnormal events relying on video information [8]. However, it has to be noted that several advantages can be obtained by relying on acoustic information. In fact, acoustic signals—as opposed to video information—need low computational costs and are invariant to illumination conditions, possible occlusion, and abrupt events (e.g., a shotgun and explosions). Specifically in the field of acoustic novelty detection, studies focused only on statistical approaches by applying hidden Markov models (HMM) and Gaussian mixture models (GMM) to acoustic surveillance of abnormal situations [31–33] and to automatic space monitoring [34]. Despite the number of studies exploring statistical and probabilistic approaches, the use of neural network-based approaches for acoustic novelty detection has only been introduced recently [35, 36].

*Contribution*. Only in the last two years the use of neural networks for acoustic novelty detection has gained interest in the research community. In fact, few recent studies proposed a (pseudo-)generative model in the form of a denoising autoencoder with recurrent neural networks (RNNs). In particular, the use of Long-Short Term Memory (LSTM) RNNs as generative model [37] was investigated in the field of text generation [38], handwriting [38], and music [39]. However, the use of LSTM as a model for audio generation was only introduced in our recent works [35, 36].

This article provides a broad and extensive evaluation of state-of-the-art methods with a particular focus on novel and recent unsupervised approaches based on RNN-based autoencoders. We significantly extended the studies conducted in [35, 36] by evaluating further approaches such as one-class SVMs (OCSVMs) and multilayer perceptrons (MLP), and most importantly we conducted a broad and in depth evaluation on three different datasets for a total number of 160 153 experiments, making this article the first to present such a complete evaluation in the field of acoustic novelty detection.

We evaluate and compare all these methods with three different databases: A3Novelty, PASCAL CHiME, and PROMETHEUS. We provide evidence that RNN-based autoencoders significantly outperform other methods by outperforming statistical approaches up to an absolute improvement of 16.4% average -measure over the three databases.

The remainder of this contribution is structured as follows: First, a basic description of the different statistical methods is given in Section 2. Then, the feed-forward and LSTM RNNs together with autoencoder-based schemes for acoustic novelty detection are described (Sections 3 and 4). Next the thresholding strategy and features employed in the experiments are given in Section 5. The used databases are introduced in Section 6 and the experimental set-up is discussed in Section 7 before discussing the evaluation of obtained results in Section 8. Section 9 finally presents our conclusions.

#### 2. Statistical Methods

In this section we introduce statistical approaches such as GMM, HMM, and OCSVM. We formally define the input vector , where is the number of acoustic features (cf. Section 5).

##### 2.1. Gaussian Mixture Models

GMMs estimate the probability density of the “normal” class, given training data, using a number of Gaussian components. The training phase of a GMM exploits the -means algorithm or other suited training algorithms and the Expectation-Maximisation (EM) algorithm [40]. The former initializes the parameters while iterations of EM algorithm lead to the final model. Given a predefined threshold (defined in Section 5), if the probability produced by the GMM with a test sample is lower than the threshold, the sample is detected as novel event.

##### 2.2. Hidden Markov Models

A further statistical model is the HMM [41]. HMMs differ from GMMs in terms of input temporal evolution. Indeed, while a diagonal GMM tends to approximate the whole training data probability distribution by means of a number of Gaussian components a HMM models the variations of the input signal through its hidden states. The HMM topology employed in this work is* left-right* and it is trained by means of the* Baum-Welch* algorithm [41] while regarding the novelty detection phase, the decision is based on the* sequence* paradigm. Considering a left-right HMM having hidden states, a* sequence* is a set of feature vectors: . The emission probabilities of these observable events are determined by a probability distribution, one for each state [9]. We trained an HMM on what we call “normal” material and exploited the log-likelihoods as novelty scores. In the testing phase, the unseen signal is segmented into a fixed length depending on the number of states of the HMM, and if the log-likelihood is higher than the defined threshold (cf. Section 5), the segment is detected as novel.

##### 2.3. One-Class Support Vector Machines

A OCSVM [42] maps an input example onto a high-dimensional feature space and iteratively searches for the hyperplane that maximises the distance between the training examples from the origin. In this constellation, the OCSVM can be seen as a two-class SVM where the origin is the unique member of the second class, whereas the training examples belong to the first class. Given the training data , where is the number of observations, the class separation is performed by solving the following: where is the support vector, are slack variables, is the offset, and maps into a dot product space such that the dot product in the image of can be computed by evaluating a certain kernel function such as a linear or Gaussian radial base function:The parameter sets an upper bound on the fraction of the outliers defined to be the data being outside the estimated region of normality. Thus, the decision values are obtained with the following function:We trained a OCSVM on what we call “normal” material and used the decision values as novelty scores. During testing, the OCSVM provides a decision value for the unseen pattern, and if the decision value is higher than the defined threshold (cf. Section 5), the segment is detected as novel.

#### 3. Feed-Forward and Recurrent Neural Networks

This section introduces the MLP and the LSTM RNNs employed in our acoustic novelty detectors.

The first neural network type we used is a multilayer perceptron [43]. In a MLP the units are arranged in layers, with feed-forward connections from one layer to the next. Each node outputs an activation function applied over the weighted sum of its inputs. The activation function can be linear, a hyperbolic function () or the sigmoid function. Input examples are fed to the input layer, and the resulting output is propagated via the hidden layers towards the output layer. This process is known as the forward pass of the network. This type of neural networks only relies on the current input and not on any past or future inputs.

The second neural network type we employed is the LSTM RNN [44]. Compared to a conventional RNN, the hidden units are replaced by so-called memory blocks. These memory blocks can store information in the “cell variable” . In this way, the network can exploit long-range temporal context. Each memory block consists of a memory cell and three gates: the input gate, output gate, and forget gate, as depicted in Figure 1.