Abstract

With the rapid development of the aviation industry, it is particularly important to ensure the safe flight of aircraft. How to find potential hazards in the process of aircraft flight has always been one of the important topics of civil aviation research. At present, the Quick Access Recorder (QAR) is the most widely used equipment to store the data recorded on aircraft. QAR data contain a lot of valuable and unexplored information, which records the true status of the aircraft in detail. Therefore, finding abnormal data from QAR data lays an important foundation for obtaining the cause of abnormality and providing a guarantee for flight. In this paper, in order to discover the abnormal information in the QAR data, we applied a VAE-LSTM model with a multihead self-attention mechanism. Compared to the VAE and LSTM models alone, our model performs much better in anomaly detection and prediction, detecting all types of anomalies. We conducted extensive experiments on real-world QAR data sets to prove the efficiency and accuracy of our proposed neural network model. The experimental results proved that our proposed model can outperform state-of-the-art models under different experimental settings.

1. Introduction

With the continuous growth of civil aviation passenger traffic, aviation safety has become a significant issue in the world. Safety is the prerequisite for the steady development of all industries and the basis for the survival of the transportation sector. Ensuring aviation safety has always been a major challenge for aviation activities [1].

Flight Operational Quality Assurance (FOQA) is an important scientific method of aviation safety management, which is used to monitor the data generated by aircraft. Over the last few decades, with improved sensing capabilities, there are different recorders that have been installed on aircraft to monitor the aircraft systems and flight crew performance. In these recorders, the Quick Access Recorder (QAR) is easier to install and configure compared to the Flight Data Recorder (FDR) and Cockpit Voice Recorder (CVR). QAR is a flash recorder for aircraft data acquisition systems and is also a key data source for airlines to evaluate flight quality and aircraft engine operations [2]. It covers most of the parameters of aircraft flight, including aircraft attitude parameters and engine-related data. By analysing a number of flight parameters recorded by QAR, the anomalies can be detected to avoid safety hazards and improve flight quality.

Nowadays, aircraft failure detection and early warning based on QAR data have become one of the important fields of civil aviation scientific research. However, there are many factors that can affect the quality of QAR data, such as working environment, signal transmission, data precision, and data decode computation Therefore, the original QAR data contain many anomalies and cannot be used directly without processing. In order to improve the quality of QAR data, it is necessary to perform anomaly detection on QAR data. Anomaly detection from QAR data is also one of the important strategies of FOQA. Finding anomalies from QAR data in time can prevent many unnecessary losses. To ensure the safe flight of the aircraft, it calls for an efficient and accurate anomaly detection method using advanced techniques.

QAR data records hundreds of parameters in each flight, including N11 and N12 engine thrust parameters, altitude, and vertical acceleration. It is one type of multivariate time series data. It also has the characteristics of a large amount of data, strong temporal structure, and regularity of change trends. Compared with classical time series data, QAR data have some peculiar features. The classical time series data are collected from stationary sensors. The QAR data are collected during the flight, which is divided into multiple flight phases. In different phases, the flight aircraft is working at different status and environments. Thus, the recorded parameters in QAR data present different distributions. Figure 1 takes the altitude parameters and the thrust parameters of the N12 engine in the QAR as an example. In actual flight, QAR data record the whole flight of an aircraft from take-off to landing, including TAXI OUT, TAKE OFF, 2 SEGMENT, INI. CLIMB, CLIMB, CRUISE, DESCENT, APPROACH, FINAL, LANDING, and TAXI IN, 11 phases in total, as shown in Figure 1.

In recent years, many scholars have proposed many methods for anomaly detection of time series, among which deep learning methods are popular increasingly among scholars. Due to the process of forecasting and anomaly detection with a large amount of time series data, it is unrealistic to label a large amount of data for training a model. Therefore, unsupervised anomaly detection is the preferred solution for most scholars. Anomalies can be divided into three types according to different manifestations, namely, point anomalies, collective anomalies, and background anomalies. Point anomalies are the easiest to find and can usually be marked by simple threshold or clustering methods. In contrast, collective and background anomalies are the most common in life and deserve more in-depth study. Dealing with background anomalies usually takes into account the relationship between adjacent data, and the use of models based on predictive methods is very effective for detecting such anomalies. For example, Ergen and Kozat [3] used algorithms based on long short-term memory (LSTM) neural networks to identify anomalies by calculating the difference between predicted and actual values. Collective anomalies are usually subsequences or anomalies in the entire sequence. The first step in detection is usually to divide the time series into equal-sized windows and treat the extracted subsequences as the entire sequence. For example, methods based on auto-encoder (AE) [4] and variational auto-encoder (VAE) [5], which utilize reconstruction differences for anomaly detection, have been shown to be effective.

The separate VAE model only considers the time dependence within the window and cannot analyze the information outside the window. This paper proposes a VAE-LSTM hybrid deep model based on a multihead self-attention mechanism, which integrates VAE and LSTM as a whole for unsupervised anomaly detection. Instead of directly inputting the raw data into the LSTM model like other methods, we pretrain the VAE model first and then use the low-dimensional feature vector generated by the encoder as the input of the LSTM model. Using the VAE model to effectively capture the contextual information in the window enables the LSTM model to learn longer correlations in time series. First, after pretraining the VAE model, the encoder is used to divide the QAR data into windows of a specific shape to extract the features of the recorded parameters in the QAR data, and the generated low-dimensional feature vector is used as the input of the LSTM model. Next, we use the LSTM model to train the data for the memory function of the time series. We also improve the LSTM model by incorporating a multihead self-attention mechanism, which is derived from the Transformer model [6]. Attention mechanisms are usually used in related fields such as text classification and text translation. In this paper, we apply it in time series anomaly detection. The self-attention mechanism can adjust the weight of the data, which is equivalent to a feature extraction of the data itself, and it is easier to capture long-distance interdependent features. The multihead self-attention mechanism operates multiple self-attention mechanisms in parallel, reducing the amount of computation by reducing the dimension. Finally, the feature vector generated by the multihead self-attention mechanism module is reconstructed by the decoder for anomaly detection. We utilize a multihead self-attention mechanism for deep feature extraction. Because compared with traditional deep learning methods, it can explore hidden features without relying on complex neural network structures and have higher efficiency and performance than them. It makes it easier to capture long-distance interdependent features. We believe that combining the multihead self-attention mechanism with the LSTM model can better focus on the long-term dependencies of time series. In this way, our model can effectively detect both short-term anomalies and long-term anomalies.

In summary, the main contributions of this work are as follows:(i)We first pretrain the VAE model and optimize the model by maximizing the ELBO loss for feature learning. We propose a novel anomaly detection model for QAR data.(ii)We improve the LSTM model by incorporating a multihead self-attention mechanism to capture long-term correlations in QAR data. It is able to detect all types of exceptions.(iii)In order to further improve the classification accuracy, we adopt a threshold selection method that maximizes the F1 metric, which effectively reduces false positives caused by improper threshold selection.(iv)We conducted extensive experiments on the real QAR dataset to evaluate our model and compared it with other deep learning methods. Experiments show that our model has a significant improvement over other methods.

QAR data are multivariate time series data with unique characteristics compared to classical time series data. Few works focus on the anomaly detection of QAR data. In this section, we first discuss existing state-of-the-art methods in the field of anomaly detection and analyze their strengths and weaknesses in order to justify our proposed method.

Anomaly detection of time series has always been a complex and challenging task in many disciplines and has been widely studied by many scholars. In anomaly detection, temporal continuity is important. Outliers are often those that are defined as unusual due to a lack of continuity in their short or long history. Therefore, anomalies in time series can be divided into two categories: short-term anomalies and long-term anomalies. Short-term anomalies occur when there are sudden changes in series values or short time intervals in the time series. The long-term anomaly is the entire time series or a subsequence that is identified as anomalous. In the past, the field of anomaly detection has generated a large amount of literature. We can roughly classify their proposed methods into three categories: statistics-based methods, classical machine learning-based methods, and deep learning-based methods [7].

2.1. Statistical Methods

The most common methods are autoregressive moving average (ARMA) and one of its generalizations, differential autoregressive moving average (ARIMA) [8]. They are one of the classic prediction-based anomaly detection models and are suitable for univariate time series. The ARIMA model uses previous data to fit a linear equation for prediction, describes the relationship between current and historical values, and uses its own historical data to predict new data. It requires that the sequence be stationary, and for nonstationary sequences, it needs to be stationary by difference. Ottosen and Kumar [9] used the ARIMA anomaly detection technique to detect short-term anomalies in low-cost air quality datasets by calculating prediction errors based on the absolute value of the residuals. However, the disadvantage of this method is that it can only predict phenomena related to the previous data, and the number of autoregressions and the parameters of prediction error need to be selected appropriately.

2.2. Machine Learning Methods

Common machine learning algorithms include clustering methods such as K-means clustering [10]. The K-means algorithm is the basic and most widely used partitioning algorithm in clustering methods. The sample data are clustered by the specified number of categories K, and the corresponding cluster centroids are used to detect anomalies in the monitoring data. Li et al. [11] proposed a cluster-based algorithm to detect excessive QAR events. It converts each flight data into a high-dimensional vector and uses the DBSCAN algorithm to cluster the matrix row vectors. The purpose is to identify exceptions without knowing the normative standard. Zhao et al. [12] proposed an algorithm based on a Gaussian mixture model (GMM) that incrementally updates the clusters according to the data instead of reclustering and adapts to the new data through an expectation-maximization algorithm to handle dynamically changing data in flight data. Zeng et al. [13] used a density-based DBSCAN clustering method to detect aircraft onboard and controller data that deviate from the normal range. Edward Smart et al. [14] proposed a two-stage approach based on a support vector machine (SVM) classifier to detect anomalies in the descent stage of a specific flight. The first stage quantifies anomalies at specific altitudes during the flight, and the second stage ranks all flights to identify the most likely anomalies. Although the above algorithms can detect abnormal flights from QAR data, they do not take into account the temporal patterns between the data and do not better explain why the abnormality occurs.

2.3. Deep Learning Methods

Compared with the above two methods, deep learning-based anomaly detection models can capture more complex hidden features and temporal correlations in time series, so they have received extensive attention in recent years. Broadly speaking, they can be divided into two categories: predictive models and generative models. Predictive models detect anomalies based on the error of the prediction as an anomaly score. In particular, convolutional neural network (CNN), recurrent neural network (RNN), and an improved model based on it, long short-term memory network LSTM [15], have achieved remarkable results. They all have a powerful ability to learn from data. In addition, LSTM networks can model longer data; it has control structures (gates) to regulate stored memory and learn and capture normal behavior. When encountering data that deviate significantly from normal data, it predicts a large error to indicate anomalies. Hundman et al. [16] used an LSTM model for the prediction of spacecraft telemetry data and used dynamic thresholding of errors to identify anomalies. Khorram et al. [17] combined CNN and LSTM as a novel convolutional long-short-term memory recurrent neural network for fault detection, achieving high generalization accuracy and resistance to overfitting. However, such a separate prediction model is not only computationally expensive but also may lead to large deviations in the prediction results due to some uncertain factors. LSTM models are also very sensitive to the choice of parameters. As a result, many advanced generative models have emerged, including variational autoencoder (VAE) [18] and generative adversarial networks (GAN) [19]. At their core, they learn representations of normal patterns. Kishore et al. [20] proposed a deep autoencoder (DAE) applied to the raw time series data of multiple aircraft sensors and used the error of AE reconstruction to determine whether the data were abnormal. Combining convolutional neural network (CNN) with VAE, Memarzadeh et al. [21] developed a convolutional variational autoencoder (CVAE) applied to the data of abnormal commercial flight departures. Wang et al. [22] proposed a sequential parameter attention-based convolutional autoencoder (SPA-CAE) model for feature extraction from Changshui Airport in Kunming QAR data. Provotar et al. [23] used LSTM layers in an autoencoder framework. Considering that, compared with normal data, abnormal data are difficult to be represented by low-dimensional feature vectors. By inputting the data into the LSTM autoencoder, the error of the AE reconstruction is used to judge whether the data are abnormal. These generative models hold great promise in the field of anomaly detection. However, these reconstruction-based models are difficult to capture long-range temporal dependencies and cannot explicitly address potential interactions between features. On the other hand, simply adding a network such as LSTM to a feedforward layer in AE or VAE does not perform detection well.

In summary, the information inside the window after dividing the window and the correlation between the window and the remaining time series are essential in anomaly detection. Although many approaches have been proposed, it is often impossible to achieve both. The correlation between windows is ignored and only one type of anomaly is detected. Based on these reasons, we propose a new VAE-LSTM hybrid deep model based on a multihead self-attention mechanism, which can effectively identify multiple types of anomalies without the limitation of window size.

3. Model

In this section, we introduce the overall workflow and internal structure of the VAE-based MHSA-LSTM hybrid model, as shown in Figure 2. We will introduce our model training process in an unsupervised way and explain the anomaly detection process on QAR data.

3.1. Problem Definition

A univariate time series is an ordered sequence of n real-valued variables arranged in chronological order. It can be formalized as , where n is the length of the time series. Anomalies are observations or sequences of observations that deviate significantly from the general distribution of the data. In this paper, our goal is to discover outliers in QAR data through anomaly detection. Our method is divided into two parts: model training and anomaly detection. as the training input can get a reconstructed sample , calculate the anomaly score between and , and compare it with the threshold to get the anomaly. Given a binary variable , is used to indicate that an anomaly occurred in the window of time , and , no exception occurred.

3.2. Data Preprocessing

Data preprocessing is essential when building neural network models and can often determine the results of model training. First, we need to divide the given time series into a training set and a test set. A continuous data segment that does not contain anomalies is used as training data, and the rest with abnormal data is used as test data. Then, to improve the robustness of the model, we need to standardize the training set and test set. We first standardize the training set and then use the standardized parameters (mean and variance) of the training set to standardize the test set. The data standardization formula can be expressed as the following equation:where and are, respectively, represented as the mean and variance of the training set.

3.3. Training Model
3.3.1. Pretraining Using VAE Model

The VAE model is a typical generative model, which consists of two parts: an encoder and a decoder. First, we preprocess the input data and send it to the encoder, which can encode higher data dimensions into a potential representation space , which is random and low-dimensional. The mean and variance of the output generate the corresponding latent variable that satisfies the unit Gaussian distribution, so we can express the encoder as , and the parameter represents the mapping of the network from to . The other part of the decoder of the VAE model can decode the latent variable into the generated data which is similar to the real data and obeys the normal distribution with mean and variance , that is, . Thus, we can express the decoder as , and the parameter represents the reconstruction of the network from to . Figure 3 shows the structure of the VAE network model.

In order to train the VAE model, we convert the training data into a local window as the input of the model, extract the features through the encoder, compress it into the latent space, and then reconstruct it. Given a time series , where , each data point is the result of measurement at a characteristic time. To improve the accuracy of the model, we need to divide the entire time series into multiple subsequences, which are represented by time windows. We define , a time window of length at a given time : . Because we use m data to predict the output, a total of windows can be generated for training the VAE model. In this way, the time series can be represented by the training input window sequence : . After training, the model finally outputs the reconstructed window after the reconstruction of the window through the decoder.

The loss function is the most basic and critical element used to measure the pros and cons of a model. The loss function of the VAE model is used to measure the information loss in the reconstruction process, and it is composed of the sum of the reconstruction error and the regularization term. Our VAE is trained with a loss function as shown in the following equation:

The first term is the reconstructed negative log-likelihood loss (evidence lower bound), and the second term is the difference between and . Our goal of training the VAE model is to minimize the sum of this reconstruction loss and divergence, which is equivalent to maximizing the loss to find the most suitable parameters and [24]. The objective function is the following equation:

Through training, we optimized the parameters of the model while improving the loss, and the network finally converged, and a good generative model was obtained.

3.3.2. LSTM Model Based on Multihead Self-Attention

A VAE model alone cannot achieve forecasting of time series because a VAE cannot encode or decode data outside the time window. Therefore, we use the LSTM model to act on the data after the dimension reduction of the VAE model, extract time features, and perform sequence prediction. We also introduced a multihead self-attention mechanism in the LSTM model to capture relevant information in different subspaces and highlight the importance of different features. The model structure diagram is shown in Figure 4. Below, we introduce the detailed information about the model.

(1) LSTM. The LSTM model is a type of recurrent neural network (RNN) that can solve the gradient descent or explosion problem that RNN may generate and learn and remember long-term relationships. Therefore, the LSTM network has achieved great success in time series data analysis [25]. The LSTM model is composed of LSTM memory cells. Each memory cell contains three gates with different functions, which are the input gate, output gate, and forget gate. These three gates are used to determine whether to accumulate or eliminate the information in the memory unit and to selectively retain the characteristics of the sequence. In this way, the network can determine the predicted output under this gating mechanism. Therefore, LSTM has become the basic framework for the task of processing sequential data with time information.

After pretraining the VAE model, we start to train the LSTM model. To prevent our model from overfitting, we divide the given training data into a sequence of nonoverlapping rolling windows, which can be expressed as . Then, the window sequence is encoded into a lower dimension by the encoder in the pretrained VAE model, and the output embedding can be expressed as , where represents the embedding of the -th window in . We train the encoder’s output as the input of the LSTM model and predict the next sequence based on the embedding of each window. Specifically, the LSTM model has memory units, and each unit has a different set of internal weight parameters, namely, and . In each unit, there are two input data, respectively, the output and state and of the previous neuron and the input of the current unit. Then, the hidden state of the output of the final unit can be expressed as the following equation:

We express the hidden state of the embedded sequence after passing through LSTM units as the following equation:

(2) Multihead Self-Attention. Multihead self-attention is the core part of the transformer encoder-decoder model. It optimizes the traditional attention mechanism and greatly improves its performance. When performing feature extraction on a time series, you can focus your attention on a window sequence and assign weights to each time point of the sequence so as to determine the weight of their influence on the final output prediction results. An attention function is composed of a vector query, a key, and a value. The common attention mechanism is to make and equal to the input value, and comes from the outside. After calculating the weight coefficients through the vectors and , the weighted summation with the vector is performed to obtain the attention score. The self-attention mechanism obtains , , and by making its own linear changes to the input value. Calculating the association between its own data is a feature extraction of the data itself. The calculation method of the self-attention mechanism is as shown in the following equation:where is the query, is the key, is the value, and is the number of hidden units of the neural network. The multihead self-attention mechanism performs separate operations on the basis of the self-attention mechanism. Each head generates three vectors , , and through linear transformation and then performs self-attention calculations. Calculating once is a head, and calculating times is the so-called long head. Finally, each head is spliced and converted into the same dimension as the input sequence. The formula is expressed as the following equations:

We deploy it after the LSTM model, because when calculating each head, the parameters after the linear transformation of , , and are different, which needs to be learned by the model. We use to represent. The attention layer takes the entire hidden state as input and multiplies it with the parameter, , to calculate the self-attention value of each head. The calculation formulas are shown as the following equations:

After operations, we join each operation result to get a feature representation . Finally, the obtained feature representation vector is sent to the softmax layer for prediction, and the prediction results are as shown in the following equation:where and are the weight matrix and bias of the final linear layer. Finally, we train our model by minimizing the error between the original data and the predicted data.

3.4. Anomaly Detection

Our anomaly detection method is divided into three stages: preprocessing, training, and detection. Among them, the training and detection stages share the first data preprocessing stage, and the data are standardized and divided into time windows of length . After training, our model can be used for anomaly detection. First, we input the preprocessed test set sequence into the LSTM model, which represents the pm data contained in time . Then, we use the pretrained VAE model to reduce the dimensionality of and encode into a low-dimensional space by extracting features to obtain an embedding sequence . The coded representation is used in the prediction stage of the LSTM model. The LSTM model predicts the next embedding by learning , as shown in the following equation:

Finally, we use the decoder of the VAE model to perform feature restoration and reconstruct the predicted into a new window , which is as shown in the following equation:where .

In the anomaly detection stage, our model will get a total of two results, which are the predicted value calculated based on the prediction model and the reconstructed value obtained based on the reconstruction model. We measure the degree of anomaly by calculating the root mean square error (RMSE) between the reconstructed window and the original window as the anomaly score of the window. The higher the abnormality score, the greater the possibility of abnormality. The formula to calculate the RMSE error is as shown in the following equation:

Among them, is the true value and is the reconstruction value. The calculated result is the sum of the reconstruction errors of each time step of the entire window. The sum of the errors of these data points can be used as the anomaly score of the entire window. In order to effectively detect anomalies, we also need to set a threshold on the anomaly score. If the anomaly score is higher than this threshold, we will regard the window sequence as a window where anomalies may occur.

For this kind of binary classification problem, it is essential to choose an appropriate threshold, which can maximize the performance of the classifier. Some commonly used methods of threshold selection include artificially setting a fixed threshold. When the reconstructed value is greater than (or less than) the fixed threshold, it is judged that the value is abnormal. There are also some models that detect anomalies through the 3-sigma method. Standard deviation is a commonly used quantitative form that reflects the degree of data dispersion, and the dispersion is the most basic and important indicator for evaluating the quality of a method. Therefore, when the outlier exceeds 3 times the standard deviation, it can be regarded as an outlier. The advantage of these methods is simplicity, but obviously, the solution of setting a fixed threshold during deployment is not enough, and it is prone to false positives and under-reports, and the scene adaptability is low. In order to avoid the above situations and better illustrate our model, we use a method of maximizing the F1 metric to automatically select the best threshold. The F1-score value is the harmonic average of the precision rate and the recall rate, and the accuracy and recall rate of the model can be considered at the same time in the detection. It can be calculated by the following equation:

In the formula, represents the accuracy rate of the detection model, and represents the recall rate of the detection. First, we compute the reconstruction error RMSE for each window as the anomaly score and then compute the F1-score for multiple thresholds using an iterative grid search between the minimum and maximum reconstruction errors. We record the selected threshold when the F1-score value is the highest and use it as the optimal threshold. Any sequence of windows above this threshold will be considered anomalous. Because the number of anomalies in QAR data is low, we mainly focus on continuous anomalies or anomalous segments. If any point in the anomaly segment is correctly detected, all points in the anomaly window are identified as true positives, and the others are considered normal.

4. Experimental Evaluations

In this section, we conduct several comparative experiments to demonstrate the effectiveness of our method from different perspectives. We first introduce the real-world QAR dataset used in the experiments, evaluate the performance of our model on the dataset, and compare it with other state-of-the-art methods (4.1). Second, we analyze how different parameter sizes affect the performance of the method (4.2). Then, we analyze the time performance of the model (4.3), and finally, we evaluate our algorithm (4.4) on different real-time series datasets.

4.1. The Comparisons with Different Methods
4.1.1. Datasets

Each QAR data file records the whole process of an aircraft from take-off to landing, including 11 stages, TAXI OUT, TAKE OFF, 2 SEGMENT, INI. CLIMB, CLIMB, CRUISE, DESCENT, APPROACH, FINAL, LANDING, and TAXI IN. These stages can be generally divided into three processes: the climb, cruise, and descent of the aircraft. Figure 5 shows the thrust parameters of the N1 engine generated by an aircraft of an airline during a voyage. It can be clearly seen that the aircraft tends to climb, stabilize, and then descend during the voyage.

Since the amplitude and speed of the data changes in different flight stages of each flight aircraft are different, we adopt a segmentation method. We divide the entire data into three segments: climate, cruise, and descend, according to the flight stage parameter FLIGHT_PHASE, and pass through each stage, respectively. The sliding window extracts local features for anomaly detection. Segmentation not only reduces the dimension of the data but also reduces the amount of computation and enhances the adaptability of the algorithm to QAR data. In this experiment, the N1 parameters generated by 100 normal flights of the same aircraft in the real world are selected, and each segment is connected to a file for training and anomaly detection after segmentation, and the data will have obvious circularity. There are a total of 82,606 sampling values in the climbing stage; 107,758 sampling values in the cruise stage; and 69,330 sampling values in the descending stage.

4.1.2. Experimental Setup

Our proposed method is mainly implemented by the Python programming language. It uses the well-known Tensorflow and Keras deep learning frameworks and includes multiple statistics and visualization packages, including Sckit-learn, Pandas, and Numpy. For the hyperparameters used in the model training process, we set the hidden size of the LSTM unit to 64 by default; for dimension of the hidden layer in the VAE model is set to 512, and for dimension of the latent variable is set to 10; the number of heads of multihead self-attention is set to 6; and the number of samples for each training batch size is set to 64. In the training details, the learning rate of the VAE model and LSTM model is set to 0.0002; adaptive moment estimation (Adam) is used as the optimizer to optimize the gradient. The reconstruction loss of the mean square error MSE serves as the loss function of the LSTM model, while the loss function of the VAE consists of the reconstruction loss of the mean square error MSE and the Kullback-Leibler divergence loss of the difference between the target distributions. The model is trained for 50 epochs. For the other comparison models, we also use the hyperparameters described above. All models that require sliding windows are compared under the condition that the default window length is 144. When testing each model, we retained the results with the highest F1-score.

4.1.3. Evaluation Metrics

The performance indicators used in the comparison experiments are precision, recall, and F1-score, which are commonly used evaluation indicators in anomaly detection. Equation (15) already gives the definition of the F1-score. Equations (16) and (17) give the definitions of precision and recall.where stands for true positives, stands for false positives, and stands for false negatives. When a window is detected and marked as abnormal, where is the number of correctly detected abnormal points, is the number of normal points that are incorrectly predicted as abnormal points, and is the number of abnormal points that are incorrectly predicted to be normal. Accuracy is the ratio of the number of correctly predicted samples to all predicted samples of a particular class and can be used to measure the quality of model prediction. Recall is calculated as the ratio of correctly predicted samples to the total number of instances of the same type. The higher the recall, the easier it is for the model to detect anomalies. Higher recall is very important. Precision is generally paired with recall to evaluate model performance, but sometimes there are contradictions. Therefore, in order to have a more comprehensive evaluation of anomaly detection, we more comprehensively consider the F1-score, which is the harmonic mean of precision and recall.

4.1.4. Results

To demonstrate the overall performance of our proposed method, we compared it with four other unsupervised anomaly detection models. They are isolation forest (IF) [26], long-short-term memory (LSTMS) [16], LSTM-AE [27], and LSTM-VAE [28]. We report for each method the results associated with the highest F1-score values. Table 1 details the performance results of all methods on the climb, cruise, and descent datasets. The results show that our model significantly outperforms other methods in precision, recall, and F1-score on all datasets, where the precision is able to improve by 0.5–0.7. It can also be observed that our model performance is well-balanced across different stages of QAR data. Figures 68 show the visualization results of our method for anomaly detection on the climb, cruise, and descent datasets. From these figures, we can see that our method can correctly find the time window in which abnormal events occur, which proves that our model has a high recall rate. The very few false positives plotted in the graph are because, historically, such a spike has been infrequent, so it was detected as an anomaly by our model. We may need more domain knowledge to solve this problem in the future.

The methods we compare include the machine learning method IF, the traditional predictive model LSTM, and the combination of LSTM with autoencoders and generative models. It can be observed that the IF method performs the worst. IF builds a collection of iTrees for a given dataset, and then the instances go through all iTrees. The anomaly score is the average of all path lengths. It does not observe time information. In time series, time correlation is essential. The prediction model composed of LSTMs may lead to large deviations in the results due to the uncertainty of the prediction results. Thus, the results of precision and recall are relatively low. The autoencoder reconstructs time series through an encoder-decoder framework. On this basis, LSTM is combined with the autoencoder, and the encoder and decoder of AE are composed of multiple LSTM units. The main role of AE is to reduce the dimensionality of the data, form a low-dimensional latent vector, and combine it with LSTM to capture the long-term correlation of time series. However, in contrast, as a generative model, VAE can generate new data completely different from the training data through training and satisfy the standard normal distribution. It can be seen from the experimental results that the combination of VAE and LSTM is much better than AE. Detection performance improved, but significant performance fluctuations were seen between different stage datasets. Our method adds a multihead self-attention mechanism on top of this and calculates the dependencies between long-distance windows separately through multiple heads. The weighting calculation is applied to the reconstruction of the VAE decoder. Therefore, our model captures the long-term dependencies of time series more easily than other methods. The results also show that our model achieves 100% recall on the cruise and descent datasets. This shows that we have no abnormal points that are wrongly predicted to be normal and can effectively detect both short-term and long-term anomalies. Overall, our method shows better performance than other methods.

4.2. Effect of Parameters

In this section, we investigate the different effects of different parameters and factors on the method’s performance, and all experiments are done using the three datasets of QAR.

4.2.1. Effect of Different Window Sizes

The first factor is the different window sizes in different datasets. The window size has an impact on the results of anomaly detection, because it not only affects the speed and efficiency of anomaly detection but also directly affects the detection accuracy. It is crucial to model the data within the window interval by choosing the appropriate window size for different datasets. We set the window size to 20, 48, and 144 for the experiments, and other parameters remained the same. The results are shown in Figure 9. From the results, we can observe that on all datasets, when the window size is increased, higher precision, recall, and F1-score can be obtained. This means that if the duration of the window is too short, the model may fail to learn that long-term anomalies have occurred. In QAR data, anomalies that occur during flight are more likely to be continuous segments than isolated points. This proves that our model structure can detect abnormal events for a longer period of time, and the data are relatively stable in the climb and cruise stages, which makes it more suitable for relatively large size windows to improve the detection efficiency.

4.2.2. Effect of Latent Variable Dimension

In addition to the window length, we also investigate the link between for the dimension of latent variable and detection performance. In VAE, the dimension of the latent variable space is a crucial parameter, which represents the important information required in the original data and can determine the representation ability of the latent space. VAE uses a probability distribution over the latent space to sample new data that can represent the characteristics of the original data. The embedding results obtained by sampling in different dimensions are different, and the reconstructed data are also very different. We set the dimensions of the latent variable to 5, 10, 15, and 20 to observe its performance impact on the anomaly detection reconstruction process. Figure 10 shows the experimental results. The results show that if the latent variable is located in a very large dimension, it will cause unnecessary redundancy to hinder the learning of the model, which may lead to the performance degradation of the VAE model training data. However, this does not mean that the smaller the latent variable space, the better. Considering that there is a special case, when the dimension is too small, VAE will lose a lot of information in the encoding stage and cannot decode. The model cannot fully capture the time dependency, resulting in poor model performance. It can be seen from the figure that the F1-score is relatively stable when the dimension is moderate. This confirms the above discussion. A suitable latent space size can make the model more robust in anomaly detection.

4.2.3. Effect of Head Number in MHSA Mechanism

In order to explore the effect of the number of heads on the model performance in the multihead self-attention mechanism, we set different head numbers of 2, 4, 6, and 8 for experiments. The experimental results are shown in Table 2. The results showed that in the climb and descent stages, the F1-score was the highest when head = 6. In the cruise phase, the F1-score is the highest when head = 4. Overall, the performance of the model fluctuates. As the number of heads increases, each head captures different aspects of information, and the model can capture more temporal information. The model performs the worst when there are only 2 heads, but an excessive number of heads makes the information captured between each self-attention head redundant, which weakens the model’s ability to extract effective correlations. Combining the experimental results and efficiency, we set the number of heads to 6 in our implementation.

4.3. Analysis of Training Time

In this subsection, we also record the running time of epochs in each stage dataset and compare our method with several other deep learning hybrid models. All methods are compared on the same system. Table 3 shows the results obtained. The results show that our model is less time-consuming than other models, because we added a multihead self-attention mechanism to the LSTM. The parallel operation of multiple self-attention mechanisms can not only extract hidden features at a deeper level but also reduce the dimension and the amount of calculation. Therefore, we not only achieved good performance in anomaly detection but also reduced training time and improved operating efficiency.

4.4. The Comparisons of Using Different Datasets

In this subsection, to verify the feasibility of our method, we conduct experiments on several different public benchmark datasets. They are the KPI and NAB datasets that are often used to perform experiments in time series anomaly detection. Normal and abnormal are already marked in these datasets. The KPI dataset is from the AIOps Challenge held by Tsinghua University in 2018 [29]. Many Internet companies monitor the data generated by various performance indicators in order to ensure the stability of web services, such as CPU usage and server health, and other performance indicators. We randomly selected two time series from the KPI dataset for experiments. The NAB dataset, provided by artificial neural network company Numenta, contains a variety of streaming data in real-time applications, consisting of multiple labeled real-world and artificial time series data files. We selected the CPU usage of Amazon Web Services (AWS) servers and AWS EC2 servers collected by the Amazon Cloudwatch service as our dataset. Table 4 lists the data such as size, mean, standard deviation, and anomaly ratio of the four datasets, and it can be seen that these four datasets are significantly different. We divided each data set into two parts: training set and test set, because our model needs to use normal data to train, so we removed the abnormality in the training data and got normal data. Outliers in the test set are reserved for testing.

Table 5 shows the experimental results. It can be clearly seen that our method outperforms other methods on these four public datasets. The accuracy of our model on these datasets is different. The F1-score of most datasets is above 0.9, and most datasets have achieved a 100% recall rate, which indicates that the number of false negatives is low. Because of the diversity of KPI and NAB datasets, some are cyclical, and some are unstable and fluctuating. This proves that our method performs well, can also detect different types of data anomalies, and has good generalization ability.

5. Conclusion

In this paper, we propose VAE-based MHSA-LSTM, an unsupervised deep learning-based method for anomaly detection in time series. The method can be divided into two stages. One is the model training stage. First, the variational autoencoder model is pretrained, and the features of normal data are learned, which can form stable local features in each window. The second is the anomaly detection stage, which uses the learning ability of the LSTM model for temporal representation and the feature extraction ability of the self-attention mechanism to identify anomalies based on the anomaly scores of the sample reconstruction calculation window. The VAE-based MHSA-LSTM combines encoder-decoder, generator, and multihead self-attention mechanism, which can detect all types of anomalies more comprehensively, quickly, and accurately. In the experimental part, we apply VAE-based MHSA-LSTM to the QAR dataset generated by real-world flights. Compared with several other classical reconstruction-based time series anomaly detection methods, the results show that our method has a better effect. In addition, we also applied our method on other public datasets with stable results.

Although our method achieves good performance and can accurately detect anomalies, there are still some limitations. Our model needs to be trained on the training data before anomaly detection, and the training set must ensure that there is no abnormal data. This presents some difficulties with the collection and processing of data. Therefore, in the future, we will explore the space for further development based on some of the ideas presented in this article.

Data Availability

The data used in this study are available from the corresponding author upon request.

Conflicts of Interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Acknowledgments

This work was supported by the project of Natural Science Foundation of China (Nos. 61402329 and 61972456) and the Natural Science Foundation of Tianjin (Nos. 19JCYBJC15400 and 21YDTPJC00440).