Abstract

Abnormal event detection has attracted widespread attention due to its importance in video surveillance scenarios. The lack of abnormally labeled samples makes this problem more difficult to solve. A partially supervised learning method only using normal samples to train the detection model for video abnormal event detection and location is proposed. Assuming that the distribution of all normal samples complies to the Gaussian distribution, the abnormal sample will appear with a lower probability in this Gaussian distribution. The method is developed based on the variational autoencoder (VAE), through end-to-end deep learning technology, which constrains the hidden layer representation of the normal sample to a Gaussian distribution. Given the test sample, its hidden layer representation is obtained through the variational autoencoder, which represents the probability of belonging to the Gaussian distribution. It is judged abnormal or not according to the detection threshold. Based on two publicly available datasets, i.e., UCSD dataset and Avenue dataset, the experimental are conducted. The results show that the proposed method achieves 92.3% and 82.1% frame-level AUC at a speed of 571 frames per second on average, which demonstrate the effectiveness and efficiency of our framework compared with other state-of-the-art approaches.

1. Introduction

With the development of chip technology and cost reeducation of bandwidth and storage equipment cost, etc., network digital cameras have replaced traditional analog cameras and are widely deployed in museums, banks, airport, etc. In order to strengthen public safety protection and prevent crime, the video surveillance has entered the era of blowout. According to HIS Data Display [1], the new video surveillance cameras installed in 2016 worldwide will produce approximately 566 GB of data in one day. To 2023, the data amount is estimated to reach 3500 GB. The rapid growth of video data puts forward higher requirements for video understanding. Intelligent surveillance technology has replaced traditional video surveillance personnel to achieve real-time structured processing and analysis of massive video data. As one of the key technologies of intelligent monitoring technology, abnormal event detection is from real-time detection in massive surveillance video data, which are a small number of abnormal events that are inconsistent with most normal events.

In recent years, abnormal event detection has gradually become a research hotspot in the field of computer vision and pattern recognition. The main difficulty is that the scenes of abnormal events are diverse. It is difficult to define an interface covering the boundaries of various possible abnormal events. A common solution is to define an abnormal event as a low probability event relative to a normal event, which enables statistical processing of abnormal events, deviated from expectations, and events that are inconsistent with normal samples are abnormal events. Same as the most popular ideas in the field of computer vision and pattern recognition, the existing methods for detecting abnormal events can be roughly divided into two steps [24]: event representation and anomaly detection model. Event representation is to extract appropriate features from the video to represent the event. Due to the ambiguity of event definition, the event can be characterized by object-level features or pixel-level features. The former often uses object trajectory features [5] or object appearance characteristics [6] (such as sports history images and sports energy images) to indicate an event. However, object-level features rely on detecting and tracking objects, which is difficult to handle in a crowded scene, especially for moving objects that block each other. For pixel-level features, they are often extracted from two-dimensional image blocks or three-dimensional video cubes to represent, such as spatiotemporal gradient (STG) [7], optical histograms of optical flow (HOF) [8, 9], and mixture of dynamic textures (MDT) [4]. After obtaining the characteristics that represent the event, the next question is to build an anomaly detection model. The anomaly detection model is to establish rules or models for normal events. Then, the test event that violates the rules or does not conform to the model is treated as an exception. Common models are cluster-based detection models [10], detection model based on state inference [11], and detection model based on sparse reconstruction [8, 12]. Among them, the cluster-based detection model clusters similar normal events together. Therefore, samples far away from these cluster centers during the testing phase are regarded as abnormal events. The state inference model assumes that normal events will undergo a fixed change over time. And, the abnormal event does not conform to this change. For detection models based on sparse reconstruction, the main principle is that the reconstruction of normal events has a small error relative to the reconstruction of abnormal events.

Although the above methods have achieved certain results in previous studies, there is a problem because the event representation and anomaly detection models are designed separately. Such operations cause researchers to spend too much effort to design them separately, but these methods often fail; when the video scene changes, generalization ability is poor. Recently, deep learning has achieved excellent results in the fields of computer vision and pattern recognition and intelligent manufacturing, such as object recognition [6, 13], object detection [14], behavior recognition [15], and health diagnosis. The key to the success of deep learning methods is that the two steps of feature representation and pattern recognition are jointly optimized, which can maximize the performance of the joint collaboration between them. It can further improve the generalization ability of the method for different scenarios. Driven by the success of deep learning technology, researchers began to apply it to abnormal event detection [1618]. In [16], a three-channel architecture was proposed which used autoencoder on each channel (Autoencoder) [17]. To learn features, a single-class support vector machine (SVM) is employed afterwards to predict the anomaly score of each channel. Finally, the abnormal scores of the three channels are merged as the final basis for judging abnormalities. Sabokrou et al. introduced a cascaded anomaly detection method, which detected abnormal events based on the reconstruction error of the autoencoder and the sparsity of the sparse autoencoder. Based on manual features and short video clips, Hasan et al. adopted the fully connected autoencoder and fully convolution autoencoder to learn the time regularity of normal events. Then, according to the reconstruction error, the time regularity score of normal events was calculated to detect abnormalities. However, these methods are based on deep reconstruction treat samples that are different from normal samples as anomalies. It ignores the small probability of abnormal events. A large number of normal samples that did not appear are often misjudged as abnormal, leading to false alarms. Unlike these methods above, in this paper, we propose an end-to-end deep learning framework for abnormal event detection. The proposed method is based on variational autoencoder (VAE) [1922], which can map high-dimensional raw input data to low-dimensional hidden layer representations through deep learning technology. And, it constrains the low-dimensional hidden layer representation to conform to a Gaussian distribution. Therefore, the hidden layer of the normal sample indicates that the probability value calculated for the Gaussian distribution is relatively large. The hidden layer of abnormal samples indicates that the probability value calculated for the Gaussian distribution will be relatively small. Actually, obtaining the hidden layer representation and constraining to a Gaussian distribution can, respectively, correspond to the two main steps of anomaly detection: event representation and anomaly detection model. In the proposed method, the two main steps are jointly optimized through an end-to-end deep learning framework, which can improve the generalization ability. Experimental results on two public datasets show that the proposed method has strong generalization ability and the detection performance reaches the level of current technology development.

2. VAE for Anomaly Detection

The overall process of the proposed method can be described as follows. During the training phase, the space-time cube of normal samples is densely sampled. The original pixels are directly used as the input of the VAE to learn the Gaussian distribution in the hidden layer representation of the input data. Then, for a test sample, the hidden layer representation of the test sample is obtained through the VAE, which calculates the probability that it belongs to the Gaussian distribution and uses it as an anomaly score. At last, the samples with abnormal scores below the threshold are judged to be abnormal. In this section, we first briefly introduce the principle of the autoencoder. Then, the proposed method of video abnormal event detection based on variational autoencoder is elaborated.

2.1. Principle of Autoencoder

Autoencoder [17] maps the input data to the hidden layer space to get its hidden layer representation. Through its hidden layer representation, the original input data can be reconstructed. Self-encoder by encoder and decoder composition can be expressed aswhere and represent the input of the autoencoder and the input of the reconstruction, respectively, is the hidden representation for , and and are the parameters of the neural network. In order to minimize the input and reconstruct the input , the reconstruction error between is obtained as follows:

The hidden layer representation of the autoencoder is often used as effective features, which directly enter into the subsequent pattern recognition model. In order to improve the expressive ability of hidden layer representation, the noise reduction autoencoder [23] and sparse autoencoder [17] were developed by introducing noise and increasing sparsity constraints. The hidden layer representation is robust and sparsity against partial damage of data.

Suing error-based reconstruction [18, 24, 25] or directly extracting the hidden layer representation as a feature [16], autoencoders have been successfully used to solve anomaly detection tasks. However, these methods ignore the probability model in which normal samples occur with high probability and abnormal samples occur with low probability. To solve this problem, we assume that the hidden layer representation of the normal sample conforms to the Gaussian distribution, and a video abnormal event detection method based on variational autoencoder is proposed.

2.2. Anomaly Detection Model Based on VAE

Given normal training samples , where the dimension of the sample is , then the VAE [1922] learns that the hidden layer represents the Gaussian distribution in the space. In the hidden layer representation space, assuming that the training samples conform to the Gaussian distribution, which means that all training samples are clustered into one cluster center, the samples far from the cluster center are abnormal samples.

Specifically, the hidden layer representation satisfieswhere is the identity matrix. Similar to the reconstruction process of the autoencoder, VAE makes the data generated by the model very similar to the input data. Similar to the architecture of traditional autoencoders, VAE also includes two neural networks:(1)Inferred network: a probabilistic encoder will enter mapped to hidden representation close to reality posterior distribution (2)Generate network: a generative decoder , which expresses the hidden layer without relying on any specific input prior reconstruction to original training data

Among them, and represent the parameters of the two networks, respectively. Denote the network as “Encoder,” the training data is mapped to hidden layer representation . The generative network can be seen as “decoder,” and the hidden layer refactors to training data .

According to the theory of VAE [20], the loss function can be expressed as

In (5), the first item is the expected log likelihood of the training data, which facilitates the decoder to rebuild training data . It can be considered as reconstruction error. When the reconstruction effect is good, the value of this item is smaller. According to the principle of Monte Carlo sampling, for each sample in the training data , for collection a , , there iswhere is the hidden layer representation for .

The second item is Kullback–Leibler divergence between and [9], which represent the distribution that the encoder wants to learn and the prior distribution represented by the hidden layer. Kullback–Leibler divergence can measure the difference between two probability distributions. For two similar probability distributions, the Kullback–Leibler divergence is very small. Based on the hypothesis, is the normal distribution , and there is

According to (3), can be further expressed as

According to (6) and (7), the second term of (4) can be expressed as

Through the reparameterization method [24], the network parameters can be adjusted by (4), suing STD [26]. The VAE is essentially based on the autoencoder, which adds a Kullback–Leibler divergence. The hidden layer representation obtained by the encoder not only can reconstruct the input samples but also conforms to a Gaussian distribution. Therefore, it is possible to detect abnormal events through the learned VAE.

2.3. Prediction

After learning the network weights of the VAE, for a test sample , the hidden layer representation from the inferred network can be obtained. According to (6), the probability of belonging to the Gaussian distribution is

If the test sample is a normal sample, it must appear in the high probability area of the Gaussian distribution. In contrast, the hidden layer of the abnormal sample indicates that the probability value calculated for the Gaussian distribution will be relatively small. Therefore, in order to infer whether the test sample is an abnormal sample, the threshold to make judgments can be set for as follows:where determines the threshold of the sensitivity of the detection method in this paper.

3. Experiment

In order to verify the effectiveness of the proposed method, experiments were conducted on two data sets, i.e., UCSD Ped1 dataset [8] and Avenue dataset [26]. And, the results are compared with several existing methods. Afterwards, we will introduce the experimental data, evaluation index, experimental details, and experimental results in detail.

3.1. Experimental Data and Evaluation Indicators

UCSD Ped1 dataset: the dataset records scenes on the sidewalk through a fixed camera, and the lens angle is slightly tilted. It contains 34 normal and 36 anomaly samples with the size of . Each video clip contains 200 frames. Normal events are pedestrians on the sidewalk. The abnormal events mainly include bicycles, skate, small car, and pedestrians walking on the lawn.

Avenue dataset: the dataset uses a fixed camera to record the scene in front of the school corridor and the lens angle is slightly tilted. It contains 15 normal and 21 anomaly samples with the size of . The dataset has a total of 30,652 frames. Normal events include pedestrians walking parallel to the camera. And, abnormal events include people running, throwing objects, and loitering. In Figure 1, some examples of events in two datasets are given, in which the upper pictures from each figure are normal ones while those on the bottom are anomalies.

Frame-level evaluation index and pixel-level evaluation index [11] are used to evaluate the performance of the detection method. For frame-level evaluation indicators, if a frame in the test sample contains at least one abnormal pixel, it is determined that the frame is an abnormal frame. For pixel-level evaluation indicators, if the anomalous area overlaps with the real anomaly marked area by more than 40%, it is determined that the frame is an abnormal frame. Whether it is a frame-level evaluation index or a pixel-level evaluation index, the detection rate (True Positive Rate, TPR) and false alarm rate (False Positive Rate, FPR) are calculated at first. Then, by changing the threshold in (10), the area under the curve (AUC) can be plotted.

3.2. Experimental Setup

For the two datasets, every frame is resized as 160 × 120. Each normal sample video clip is divided into the size of 10 × 10 × 5 with nonoverlapping space-time cubes. Then, these space-time cubes are converted into vectors with the size of 500 × 1 and normalized as the network input to train the weight of the variational autoencoder. In the proposed network, there are four hidden layers with 500, 500, 2000, and 30 neurons respectively. It uses a completely symmetrical network structure. The optimizer chooses the Adam Optimizer [8], and the initial learning rate is set to be 0.001. And, after every 1000 iterations, the learning rate reduces to 1/10 and the process stops at 10,000 iterations. The parameters are set as and the batch size is 100. In the testing phase, the test video is also divided into sizes of 10 × 10 × 1 with nonoverlapping space-time cubes. They are input into the proposed network to obtain its hidden layer representation. Then, based on (10) whether the area is abnormal can be determined. The experimental hardware platform is NVIDIA GTX1070TI with video memory 8 GB. The software environment is Tensorflow and Python. In order to fully evaluate the performance of the proposed method, several comparison methods are drawn from current literatures, i.e., [7, 10, 17] and [22]. For simplicity, there are denoted as “Method 1,” “Method 2,” “Method 3,” and “Method 4,” respectively.

3.3. Results and Discussion

Figure 2 gives the results on the UCSD Ped1 dataset, where Figures 2(a) and 2(b) show the frame-level and pixel-level ROC curves. Figure 2 also provides the ROC curves of the proposed method and comparison ones. In the first three methods, the two steps of event representation and the establishment of the anomaly detection model are carried out separately. Among them, Method 1 extracts mixed dynamic texture features and then establishes a statistical inference anomaly detection model. Method 2 extracts spatiotemporal gradient features and then adopts sparse reconstruction method for anomaly detection. Method 3 uses autoencoder to extract features and single-class support vector machine for anomaly detection. Method 4 is an end-to-end deep learning method. The results of these four methods are obtained from the corresponding papers, among them Method 4 does not provide ROC curve.

As can be seen from Figure 2, the proposed method achieved the best results on the frame-level evaluation criteria. On the pixel-level evaluation standard, the results of the proposed method are not much different from those of the other two methods, i.e., Method 2 and Method 3, but obviously better than that of Method 1. Table 1 shows the comparison results of different algorithms on the UCSD Ped1 dataset at the frame level and the pixel level. The proposed method achieves 92.3% frame-level AUC and 71.4% pixel-level AUC, which are better than all other comparison methods. It is worth noting that learning temporal regularity is also an end-to-end deep learning method. However, the experimental results are clearly lower than the proposed method. This is because the method uses each frame of the video as the input of the neural network.

Table 2 shows the frame-level detection results on the Avenue dataset. On the Avenue dataset, only Method 2 and Method 4 are tested. And, Method 4 does not give the corresponding ROC curve. Compared with the other two methods, the proposed method achieves 82.1% results in frame-level AUC, which is higher than the other two methods by 1.3% and 3.8%, respectively. The results prove that the proposed method achieves high detection accuracy and good generalization on the Avenue dataset.

Figure 3 shows examples of partially correct detection results on two datasets. Among them, (a) and (b) are the test results of the USCD Ped1 dataset and (c) and (d) are the test results from the Avenue dataset. It can be observed from Figure 3 that the proposed method can detect different types of abnormal events, including bicycle, trolley, skateboard, and trolley. So, its performance for anomaly detection can be further validated.

Table 3 shows the comparison of detection speed between the proposed method and other one on the UCSD ped1 dataset. The results of the comparison methods come from their corresponding articles. The hardware environment of the whole experiment process is Intel Core i7-8700 k 3.7 GHz CPU, NVIDIA GeForce GTX 1070Ti (8 GB video memory) GPU and 16 GB RAM memory. The computing platform is Python 3.7 and Tensorflow 1.7. As can be seen from Table 3, the detection speed of the proposed method is 571 fps, which obviously surpasses the detection speed of other comparison methods.

4. Conclusion

In this paper, a method of video anomaly detection and location based on VAE is proposed using an end-to-end deep learning framework. The method assumes that all normal samples conform to a Gaussian distribution. The probability value of the abnormal sample in the Gaussian distribution is relatively small. In the proposed method, the two steps of event representation and establishment of anomaly detection model are, respectively, converted into the hidden layer representation and Gaussian distribution constraint in the VAE. In addition, the two steps are jointly optimized to improve the accuracy and generalization ability of the method. The quantitative results in the two public datasets show that the proposed method has reached the current technological development level. The next step of the research will consider the realization of the proposed method on more complex datasets.

Data Availability

The datasets used in this paper are publicly available.

Conflicts of Interest

The authors author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was sponsored in part by National Natural Science Foundation of China (61671309).