Abstract

Accurately detecting and identifying abnormal behaviors on the Internet are a challenging task. In this work, an anomaly detection scheme is proposed that employs the behavior attribute matrix and adjacency matrix to characterize user behavior patterns. Then, anomaly detection is conducted by analyzing the residual matrix. By analyzing network traffic and anomaly characteristics, we construct the behavior attribute matrix, which incorporates seven features that characterize user behavior patterns. To include the effects of network environment, we employ the similarity between IP addresses to form the adjacency matrix. Further, we employ CUR matrix decomposition to mine the changing trends of the matrices and obtain the residual pattern characteristics that are used to detect anomalies. To validate the effectiveness and accuracy of the proposed scheme, two datasets are used: (1) the public MAWI dataset, collected from the WIDE backbone network, which is used to validate accuracy; (2) the campus network dataset, collected from the northwest center of Chinese Education and Research Network (CERNET), which is used to verify practicability. The experimental results demonstrate that the proposed scheme can not only accurately detect and identify abnormal behaviors but also trace the source of anomalies.

1. Introduction

User behavior profiling, along with anomaly detection in network traffic, plays an important role in network management, which helps keep the network under control. User normal behavior patterns often present as stable and routine for long periods, but abnormal behaviors cause unexpected changes to the normal patterns. Therefore, behavior pattern changes can be used for anomaly detection. Features extracted from raw traffic packets are usually used to capture the dynamic changes in behavior patterns, such as total number of packets or flows in a specific time window, and then, machine learning methods are combined to mine abnormal changes [15]. These methods are effective for detecting obvious changes caused by abnormal behaviors. However, the attack technologies are becoming more and more intelligent, and anomalies only cause slight changes in the traffic patterns. Meanwhile, the traffic volume continues to increase, and characterizing user behavior and accurately identifying the anomalies from massive network traffic are still challenging tasks for network security monitoring.

Accurate behavior pattern characterization is the foundation for anomaly detection. Many techniques have been proposed in the past decade, such as statistical analysis [68], data mining [9, 10], and machine learning [11, 12]. Nearly all of those methods only extract attributes according to network traffic, such as number of packets in a specific time window, without consideration of the effect of the network environment, but we find that network environment is another important factor for smart attack detection. As we know, botnet is one of the famous smart attacks that appeared recently, and one of the methods for detecting botnet is to analyze the “co-occurrence” behavior patterns of the hosts in one subnet [1316], which means some hosts in one subnet have similar access patterns, such as always access to the same URL at the same time point, and those hosts may be infected by bots. In this study, we employ the adjacency matrix to capture this kind of pattern.

On the other hand, how to identify the anomalies from the massive traffic patterns is a typical needle in a haystack (NIHA) problem. Matrix decomposition is an effective tool for anomaly detection [17, 18], which can divide the massive patterns into two parts: one is the major pattern in the original patterns and the other is corresponding to the abnormal changes, which is suitable for anomaly detection in the network today. Thus, we can use a matrix decomposition technique to distinguish normal and abnormal traffic.

To this end, we proposed an anomaly detection scheme in this study, which jointly employs the attribute matrix and the adjacency matrix to characterize user behavior patterns and network environment characteristics and uses matrix decomposition to identify anomalies. We extract seven statistical features from network traffic to construct an attribute matrix to capture the user behavior characteristics in the time window T. We also employ the similarity of the specific IP addresses to construct an adjacency matrix to capture the behavior characteristics related to the network environment. We jointly employ the attribute matrix and the adjacency matrix to construct a model to describe user behavior characteristics accurately. Then, we employ CUR matrix decomposition [19] to mine the major behavior pattern from the joint model and obtain a residual matrix, which can be used to identify anomalies.

We employ two kinds of traffic datasets to verify the effectiveness and accuracy of our method. The first is the public MAWI dataset [20], which is collected from the WIDE backbone network, a trans-Pacific transit link between Japan and the USA. This dataset is labeled and used to evaluate the performance of our proposed method. The second dataset is collected from the northwest center of CERNET. The users in the monitoring network include students, faculty members, and contract personnel from service-providing companies. The behavior patterns contained are sufficiently complex and can be employed to measure the practicability of our method. The experimental results based on the two datasets show that the proposed method achieves an anomaly detection accuracy rate higher than 90% without any prior knowledge. Furthermore, our method can also trace anomalies for efficient network management.

Our contributions to this study can be summarized as follows:(1)We Propose an Anomaly Detection Scheme Including Both the Features Extracted from Traffic Volumes and Network Environment: We employ seven traffic features to capture the dynamic changes in the traffic volumes and employ similarity between IP addresses to construct an adjacency matrix to characterize the user behavior related to the network environment.(2)We Formulate the Anomaly Detection Problem as a Matrix Decomposition Problem: We formulate the anomaly detection problem as a matrix decomposition problem and employ CUR matrix decomposition to perform anomaly detection. The experimental results verify that the developed method is a simple and effective means to monitor the security of an enterprise network.(3)We Verify the Effectiveness and Accuracy of Our Methods on Two Different Datasets: The first is the public dataset MAWI, and the second is the dataset collected from the northwest center of CERNET. The MAWI dataset is used to validate the accuracy of the proposed scheme, while the CERNET dataset is employed to verify the practicability.

The remainder of this study is organized as follows. Section 2 presents the related work; Section 3 outlines the motivations and design goals, after which the feature definitions and framework description are presented in Section 4. In Section 5, we put forward a detailed description of the anomaly detection model. The experimental results and analysis are presented in Sections 6 and 7, after which the conclusion follows in Section 8.

The goal of anomaly detection is to find the rare occurrences that do not conform to the patterns of the majority in the datasets [21, 22], which has been widely applied in many fields, including security, finance, health care, and social network [2327]. A variety of techniques have been proposed for identifying anomalies, which can be presented from two aspects: supervised and unsupervised anomaly detection methods [28]. The works related to our work are summarized as follows.

As for supervised anomaly detection techniques, they usually require a labeled dataset for model training. A support vector machine (SVM) can classify the samples as normal and anomalies by maximizing the classification margin to detect anomalies. Kong et al. [29] designed an abnormal traffic identification system (ATIS) based on SVM. Gu et al. [30] proposed an intrusion detection framework based on an SVM ensemble classifier with increasing feature selection. The naive Bayesian is another simple and effective tool to detect anomalies, and many algorithms have been proposed based on Bayes’ theorem. Swarnkar et al. [31] proposed a naive Bayesian class classifier based on packet payload analysis to detect HTTP attacks. Han et al. [32] developed a naive Bayesian model for network intrusion detection based on principal component analysis (PCA). Nie et al. [33] designed a Bayesian network to model the causal relationships between network entries. Neural networks (NNs) are also widely used for anomaly detection as they can increase the accuracy of anomaly detection systems. Hodo et al. [34] employed packet traces to train an artificial neural network to detect DDoS attacks. Kwon et al. [35] used a convolutional neural network (CNN) to detect anomalies, which can select traffic features automatically from the raw dataset. A recurrent neural network (RNN) was proposed in [36] to learn temporal behaviors in large-scale network traffic data. Those methods are effective in identifying anomalies with accurately labeled datasets; however, high-quality labeled datasets are very difficult to construct in the network today.

Unsupervised anomaly detection techniques are widely used recently, as they do not require labeled dataset for model training. K-means is one basic approach to the unsupervised anomaly detection [37]. The authors in [38] used K-means to cluster the network connections into normal and anomalous communities. However, it is difficult to select the suitable k, as it depends on the applications and environments. Recently, Chen et al. [39] proposed a convolutional autoencoder (CAE)-based anomaly detection model. Said Elsayed et al. [40] proposed a hyper-approach based on long short-term memory (LSTM) autoencoder and one-class support vector machine (OC-SVM) to detect anomalies. Although the methods hold high accuracy, it is difficult to obtain a clear explanation of the results; furthermore, it is hard to trace the anomalies and apply control policy. Principal component analysis (PCA) is another unsupervised method used for anomaly detection, which can capture the normal and abnormal behaviors of the data by projecting the data instances to the principal components [41]. Wang and Battiti [42] proposed an intrusion detection method combined PCA with SVD, which can identify intrusions based on the error between the original data vector and its reconstruction data vector. However, it is not efficient for interpretation as the principal components are a linear combination of all original variables [17]. To interpret these results, the work in [43] introduced a new method, named sparse principal component analysis (SPCA), to produce modified principal components with sparse loading. Although this method can improve the interpretation, a linear relationship still exists between the principal components and the original variable. However, the variables usually do not hold the linear relationship. The sample-based matrix decomposition methods are proposed to deal with those problems, which select rows or columns from the original matrix to form the low-rank matrices. Kumar et al. [44] proposed CUR matrix decomposition to interpret the decomposition process. However, the decomposition process will occupy a large amount of memory. Sun et al. [45] proposed a new method named compact matrix decomposition (CMD), which can avoid repeated selection, in turn, reduces the computational complexity. However, this method needs to seek a non-orthogonal base by sampling the columns and/or rows of the original matrix, which will produce over-complete bases. Tong et al. [46] proposed a Colibri method to deal with these challenges. This method can iteratively find a nonredundant base and accordingly save space and time cost. However, it fails to improve the accuracy compared with the CUR and CMD matrix decomposition.

Inspired by the related works, we propose an anomaly detection method based on matrix decomposition. By combining the advantages of CUR and network characteristics, the developed methods can not only detect known and unknown anomalies, but also trace the source of the anomalies.

3. Basic Assumption and Design Goals

3.1. Basic Assumption and Its Verification

To capture the characteristics of network environment, we assume that the users who hold IP addresses with the same prefix have similar behavior patterns, and we verify this assumption from the following three aspects.

Firstly, we analyze the general principle of the IP address arrangement. Generally, IP addresses have no relationship with user behavior patterns; however, to manage conveniently, network administrators usually assign IP addresses with the same prefix to the users in one specific area. The IP address arrangement process can be summarized as follows: (1) the Internet Assigned Numbers Authority (IANA) assigns IP address pools to five different regional internet registry (RIR) organizations in the world. (2) The regional organization assigns the IP addresses to different Internet service providers (ISPs). (3) The IP address blocks are assigned to different countries by the ISP. (4) Network administrators assign IP address blocks to different areas when they construct their local area network (LAN). Based on the above analysis, we can find that IP addresses with the same prefix are often assigned to the same area. Furthermore, the users in the same area often have similar behavior patterns as they have similar network requirements. Thus, we can conclude that users who hold IP addresses with the same prefix may hold similar behavior patterns.

Secondly, there are also some researchers working on traffic pattern profiling found that the users who hold IP addresses with the same prefix have similar behavior patterns. Jiang found that traffic behavior with the same prefix often keeps stable over time, which can be used for anomaly detection [47]. Xu found that the hosts with the same network prefixes have similar behavior among different Internet applications [48, 49]. Jiang found that the behavior similarity captured by the aggregated flows with the same network prefixes can be used to construct an abnormal identification mechanism [50]. Those works further verify the assumption.

Thirdly, we analyze the behavior patterns of the IP addresses with the same p prefixes in the MAWI and CERNET datasets. We randomly select three IP blocks, and the results are shown in Figure 1, where a-c are the results of the MAWI dataset, d-f are that of the CERNET dataset, a and d are the results of p = 8, b and e are the results of p = 16, and c and f are the results of p = 24. From the figure, we can find that users who hold IP addresses with the same prefix have similar behavior patterns, especially the behavior patterns in the CERNET dataset.

3.2. Design Goals

Based on the above analysis results, we mainly focus on developing a new anomaly detection method, which is effective at mining the anomalies in the network today, and the design goals are listed as follows:(1)Improve the Management Efficiency: to control the anomalies, tracing the anomalies is important and necessary. The IP address should be retained during the detection process. In our developed method, we regard each specific IP address as an index of the column to construct the attribute matrix, which can achieve the goal of abnormal IP addresses tracing easily.(2)Improve the Detection Accuracy: to develop an accurate abnormal detection model with the consideration of network environment, we construct an adjacency matrix to describe the network environment. The adjacency matrix is made up of IP address of similar degree, which is calculated by the binary similarity of the IP addresses.(3)Improve the Practicality: the designed method should be deployed on most enterprise networks without new hardware components, and the features used should be easily extracted. Furthermore, the method should be sensitive to special abnormal behaviors that can cause slight changes so as to detect the new anomalies.

4. Feature Definition and Framework Description

4.1. Volume Feature Definition

Firstly, to apply some measurement methods to traffic adopted encryption technology, more and more anomaly detection methods extract features from the packet headers. The information in the packet headers is shown in Table 1.

Secondly, we analyze the characteristics of typical attacks, and the results are shown in Table 2. From the table, we can find that different attacks may lead to obvious changes to the statistics of the attributes in the packet headers, in turn, the extracted features will be changed, and those changes can be used to detect anomalies.

Finally, we can find that different network attack behaviors will lead to changes in the different attributes based on the analysis above. Their definitions are shown in Table 3. We employ the port scan attack as an example to verify the efficiency of the features extracted. Port scan attack is usually employed by hackers to find vulnerable hosts and ports, and one simple example of port scan is shown in Figure 2. The attack host will send connection requests to the ports of one same destination host, and if the destination port is open and the host provides the corresponding service, the hackers can receive responses. This kind of behavior pattern is typical and representative of the smart attacks, and this kind of pattern can be easily captured by OD, NDDA, and NDDP defined above.

4.2. Network Environment Feature Definition

We employ the adjacency matrix composed of similarity between the IP addresses of the monitoring network to capture the characteristics of the network environment. An example of the similarity calculation between two IP addresses is presented in Table 4. The distance d is defined as the number of 1 in m, which can be calculated by applying XOR to the binary of two given IP addresses. Then, we select the first 1 as label from left to right of the operation results and set all bits behind the label to 0, and other bits are set to 1. The similarity s is defined as the distance d divided by 32, and the maximum value of the similarity is 1, which means that the two IP addresses are identical, while the minimum value is 0, which denotes that the two IP addresses are completely different. If the similarity is bigger than a specific threshold, the value of the two IP addresses in the adjacency matrix is set 1 and the construction method is shown as follows:where i and j represent any IP address and thresh denotes the threshold predefined.

We employ a campus network to illustrate the features extracted to characterize the network environment. One simple topology of a typical campus network is shown in Figure 3. The users in the same area often have similar behavior characteristics as they have similar roles, such as the users in the office area often engage in teaching-related activities as they are teachers, while the users in the dormitory area often engage in entertainment-related activities after classes as they are students.

4.3. Anomaly Detection Framework Developed

The detailed description of the scheme is illustrated in Figure 4, which consists of four steps. The main symbols used in the study are summarized in Table 5.

Step 1. Traffic Matrix Construction. We employ IP addresses and their features to design an attribute matrix to describe the traffic patterns in time window . Let denote a set of n IP addresses, and denote the d-dimensional features per IP address. At the same time, we also construct an adjacency matrix using IP address similarity as a constraint to capture the pattern relationship between IP addresses. We combine the attribute matrix and adjacency matrix to design an anomaly detection model.

Step 2. Matrix Decomposition. We employ CUR matrix decomposition to select several representative features and IP addresses from the attribute matrix X to reconstruct an attribute matrix and attempt to keep the reconstruction matrix as similar as possible to the attribute matrix X.

Step 3. Residual Calculation. The difference between the attribute matrix X and the reconstruction matrix is referred to as the residual matrix, which is defined as  = X − .

Step 4. Anomaly Detection. The residual matrix presents the pattern changes in each feature of the IP addresses. We calculate the sum of each column of the residual matrix, which represents the total pattern changes in each IP address, and then, they are ranked in descending order. IP addresses with larger values are regarded as anomalies.

5. Anomaly Detection Model

5.1. Model for Attribute Matrix Decomposition

The seven defined traffic features can construct an attribute matrix , as the following formula shows, where the columns are the specific IP addresses, the rows are the seven features, and the element is the statistical values of specific features in time window .

We employ CUR matrix decomposition to obtain user normal behavior patterns from the original attribute matrix X, which can select several rows and columns from X to reconstruct an attribute matrix, and the reconstructed matrix can indicate the major pattern of the original attribute matrix. The process can be formulated as follows:where , , , and are three low-rank matrices, indicates the residual matrix, and control the row and column sparsity of the matrix W, and is used to control that of matrix . If is a diagonal matrix and its k elements are 1, the other elements are 0, and keeps k columns of X unchanged and sets n-k columns to zero vectors. Similarly, WX can be regarded as a coefficient matrix, and is chosen as a representative source IP address when W of WX is not a zero vector. Apparently, ensures that only a few source IP addresses are chosen.

5.2. Model for Adjacency Matrix Decomposition

We use adjacency matrix to capture the characteristics of network environment, and one simple example is shown as follows:where the rows and columns are the specific IP addresses. If two users hold IP addresses with high similarity, they should have similar behavior patterns in the residual matrix . The matrix decomposition model constructed between the residual matrix and adjacency matrix is shown as follows:where A is the adjacency matrix and L is the Laplacian matrix of it.

5.3. Anomaly Detection Model Construction
5.3.1. Anomaly Detection Model

Based on the attribute matrix and adjacency matrix constructed, we develop an anomaly detection model, which is shown as follows:where is a parameter to indicate the importance of the adjacency matrix. Formula (6) may be non-convex if W and change simultaneously, but we can get the optimal values of them by fixing one of them. If we fix W, we can rewrite formula (6) to (7) by setting the derivative of to zero.where I, , and are diagonal matrices with positive diagonal entries, and is a diagonal matrix with the elements . Similarly, if we fix and set the derivative of W to zero, we can obtain the following equation:where and are diagonal matrices with and , respectively. We can use the two lemmas below to calculate W.

Lemma 1. For any matrix , if the matrix is right multiplied by a diagonal matrix , the expression AB can be rewritten as , where , i = 1,2, …, m, j = 1,2, …, n, and denotes the Hadamard product.

Lemma 2. For any formula as (9) about matrix H:where and are symmetric and positive semidefinite matrices, and are two diagonal matrices, and a and b are two parameters. Formula (9) can be rewritten as follows:where , , , C and R are two orthogonal matrices, and are two diagonal matrices composed of eigenvalues. According to the two lemmas, we pre-multiply formula (8) by and post-multiply it by . Equation (8) can be reformulated as follows:where , , , , , and C and R are orthogonal matrices composed of the eigenvectors of and , respectively. and are two diagonal matrices composed of the eigenvalues of and , respectively. Then, we define , , and . According to Lemma 1, formula (11) can be expressed as follows:For each column vector, it can be denoted as follows:We can get the matrix U by calculating each column vector , and the matrix W can be obtained according to the formula .

5.3.2. Running Flowcharts

The detailed running process of the developed model is illustrated in algorithm 1. Firstly, the attribute matrix X and the adjacency matrix A are constructed and selected as input. After setting other parameters, the algorithm iteratively selects some representative IP addresses and features by means of the variable matrix W to reconstruct the attribute matrix X. Lines 1 to 4 build the Laplacian matrix L and initialize the identity matrices , , and , the residual matrix , orthogonal matrices C and R, and diagonal matrices and . As shown in lines 5 to 11, the optimal residual matrix can be obtained when the objective function in equation (6) converges by iteratively updating W and . Line 12 calculates the anomaly scores for each IP address by means of its norm in the residual matrix . Lines 13 to 17 judge whether the IP addresses are anomalies or not.

Input: attribute matrix X, adjacency matrix A, parameters , , , ,
Output: top k source IP addresses satisfying the condition .
(1)Build Laplacian matrix L from the adjacency matrix A;
(2)Initialize , and to be the identity matrix;
(3)Initialize ;
(4)Build orthogonal matrices C, R by the eigenvectors of and , and diagonal matrices , through the eigenvalues of and , respectively;
(5)while the objective function in equation (6) does not converge do
(6)Update by equation (14);
(7)Update by setting ;
(8)Update by setting ;
(9)Update by equation (7);
(10)Update by setting ;
(11)end
(12)Compute the anomaly score for the source IP address as ;
(13)for In do
(14)if the score of the source IP address > then
(15)Output the source IP address;
(16)end
(17)end

6. Performance Evaluation with Public Data

6.1. Dataset and Evaluation Metrics
6.1.1. MAWI Dataset

It is important to find a public dataset with reliable ground truth to evaluate anomaly detection methods. However, some existing datasets are outdated due to lack of new attack trends or traffic patterns, such as the 1998/99 DARPA dataset [51] and KDD CUP 99 dataset [52]. Other datasets, such as DDoS 2016 [53], are created in a simulated network environment, and LBNL dataset [54] is publicly available but does not provide attack labels. Although UNB ISCX dataset [55] provides attack labels, it is not always publicly available. Due to the drawbacks of the datasets, we use the MAWI dataset to validate the performance of our proposed approach. The dataset is collected from the WIDE backbone network, a trans-Pacific transit link between Japan and the USA, and it is updated daily to include new patterns of new applications or anomalies, and the payloads of the packets are removed due to privacy protection issues. Furthermore, a graph-based methodology that combines different anomaly detectors is used to label the dataset [20]. Firstly, it uses a similarity estimator to uncover the relations between the outputs of different anomaly detectors, then construct an undirected graph using the anomalies detected by different anomaly detectors, and mine different communities from the undirected graph. Secondly, they employ the confidence score and combination strategies to decide whether one specific community corresponds to an anomalous or not.

We employ the dataset collected from 2019-05-05 14:00:00 to 2019-05-05 14:15:00 to evaluate the proposed method. The statistical information about the dataset is presented in Table 6, where #anomalies represents the number of anomalies, #flows is the total number of flows, #diffsrcIP denotes the number of different source IP addresses, #diffsrcPort is the number of different source ports, #diffdstIP denotes the number of different destination IP addresses, and #diffdstPort is the number of different destination ports. GRE denotes generic routing encapsulation protocol, while encapsulating security payload (ESP) is a typical protocol in IPsec architecture.

We analyze the percentage of different anomalies in the datasets, and the results are shown in Figure 5. From the figure, we can find that the top 4 anomalies occupy more than 90% in the MWAI dataset, including ntscUDPUDPPrp (network_scan_udp_udp_response), mptp (multipoint_to_point), mptmp (multipoint_to_multipoint), and ptmp (point_to_multipoint). Furthermore, those anomalies are typical and representative, such as mptmp that usually employs many controlled hosts to attack several destination hosts coordinately, which is similar to Bonet and Worm attacks. Thus, we mainly employ the anomalies with ntscUDPUDPPrp, mptp, mptmp, and ptmp labels to evaluate our approach.

6.1.2. Evaluation Metrics

We use precision (P), recall (R), and , which are widely used to evaluate accuracy of many approaches [56]. Their definitions are provided as follows.

P is the ratio between the number of true anomalies detected and total number of anomalies detected, which is shown in the following formula, where TP is the number of true anomalies detected and PA is the total number of anomalies detected.where R is the ratio between the number of true anomalies detected and the total number of anomalies in the dataset. Its definition is presented in the following formula, where TA represents the total number of anomalies in the dataset.where is the harmonic mean between P and R, which is defined as follows:

6.2. Parameter Selection

There are six parameters in the proposed algorithm, the parameter s presents the similarity of two IP addresses, and we employ this parameter to ensure that two IP addresses are similar enough; thus, the similarity threshold thresh for the adjacency matrix is 0.9. The parameters , , and are used for controlling the sparsity of the attribute matrix and residual matrix, and the parameter is used for controlling the importance of the adjacency matrix, and their values are important for the model performance. We firstly keep one of the parameters increasing and other three parameters fixed and then employ the changing trends of to select the parameter values. The results are shown in Figure 6. We can find that increases rapidly when parameter is becoming larger and then decreases quickly and tends to be stable, which means that the IP address structure plays an important role in anomaly detection. For the parameters , , and , increases rapidly with the parameters increasing and then tends to stable. Therefore, we set the initial values of the four parameters in the algorithm according to the changing trends of .  = 0.017,  = 0.015,  = 0.012, and  = 0.018.

The parameter has more impacts on the final accuracy. To select optimal for different time windows, we calculate the P, R, and with different . For the two evaluation metrics P and R, they are existing constrained relation, and we employ to make some trade-offs. The results of specific time window are presented in Table 7. From the table, we can find that R is larger than P when the threshold is smaller than 0.005. Contrarily, R is smaller than P when the threshold is larger than 0.005. Therefore, the optimal threshold is selected as 0.005 with the largest . We can obtain the optimal thresholds for each time window based on the above analysis, which is shown in Figure 7.

To achieve the goal of optimal parameter automatic selection, we use the exponentially weighted moving average (EWMA) [57] method to predict optimal parameter . Assume and are the optimal and the forecast for the time window t − 1, respectively. The EWMA method can be given as follows:where 0 b 1 is the weight factor. We employ the mean squared error (MSE) to evaluate the effectiveness of the prediction for optimal thresholds, and the definition of the MSE is shown as follows:where the parameter m is the number of windows, and and are the optimal threshold and the prediction threshold, respectively. We employ the manually selected optimal (results of the beginning fifteen time points) to obtain the weight factor b used for EWMA, and then, the EWMA is used to predict the threshold used for the following time points. The analysis results are shown in Figure 7.

6.3. Performance Evaluation

Network traffic is a kind of time-series data, the traffic volume usually is massive, and it is very difficult to process in real time. To improve the practicality, we employ time window mechanism and the size of the time window is set at 6 seconds in the provided figure. We have 150 time windows for the MAWI dataset and analyze the anomalies in each time window using different anomaly detection approaches. For matrix decomposition methods, we use the proposed method and SVD and SPCA approaches to assign anomaly value for traffic in each time window and employ the threshold to identify anomalies; if the anomaly value is larger than the threshold , the corresponding network traffic is regarded as anomalies. Then, we can calculate their P, R, and in each time window as the dataset is labeled. Similarly, we also apply the LOF, CBLOF, ROS, and COF approaches to assign anomaly value for traffic in each time window, and we can obtain the values of these methods in different time windows.

6.3.1. Comparison with Matrix Decomposition Methods

We select two popular abnormal detection methods using matrix decomposition to evaluate our methods. Wang employed singular value decomposition (SVD) to compute the eigenvalues and their corresponding eigenvectors of the covariance matrix [42] and then select the k eigenvectors with the top k largest eigenvalues to form a new matrix. For a new vector, SVD first projects it to the k-dimension subspace and then calculates the distance between the new data vector and its reconstruction using squared Euclidean distance. If the distance exceeds a given threshold, the vector is identified as anomalous. In this study, we employ SVD to the attribute matrix and treat the features extracted in the coming time window as given vector to perform abnormal detection.

Erichson proposed the SPCA approach to improve the interpretation of low-rank matrix decomposition [43]. The developed methods can be formulated as a regression-type optimization problem of the PCA method. Sparse loadings are obtained by imposing the lasso constraint or elastic net on the regression coefficients. The modified principal components are therefore further obtained based on sparse loading. In this study, we apply the SPCA to the attribute matrix and obtain a new matrix composed of principal components. Then, the distance between the features extracted in the coming time window and its reconstruction is calculated and abnormal detection is performed.

The experimental results are presented in Figure 8. As the figure shows, our method outperforms than the other two methods. The attribute matrix is used to capture the dynamics of the traffic patterns, while the adjacency matrix is used to characterize the environment. Thus, our method holds higher accuracy by including more consideration during the process of matrix decomposition. Furthermore, the SVD method lacks interpretability as the eigenvectors are usually a linear combination of all columns of the original matrix. As the SPCA method, although it improves the interpretability based on sparse principal components, it cannot interpret the results completely. The proposed method can easily interpret the detection process by selecting representative rows and columns, in turn, tracing the anomalies for efficient network management.

6.3.2. Comparison with Other Data Mining Methods

Tuan proposed an anomaly detection method based on the local outlier factor (LOF) [58]. The LOF is defined as the ratio between the local reachability densities of object p’s k-nearest neighbors and the local reachability density of p. A larger LOF means that the local reachability density of p is smaller than the local reachability densities of p’s k-nearest neighbors, and the object p has a higher probability to be an anomaly. In this study, we select k as the square root of all the data to perform abnormal detection, which has been proven to be an optimal option [59].

He proposed the FindCBLOF algorithm for discovering outliers, which employs the clustering algorithm to divide the dataset into large and small clusters [60]. They used the cluster-based local outlier factor (CBLOF), which is defined as the distance between the being detected item and its closest large cluster to measure the significance of an outlier; the larger CBLOF of the being detected item, the more likely it to be an anomaly.

Tang proposed an outlier detection scheme by calculating the chaining distance [61]. They use the connectivity-based outlier factor (COF) to indicate the probability of a being detected object to be an anomaly. The probability of a being detected object p to be an anomaly is defined as the ratio between the chaining distance of p’s k-nearest neighbors and the average of the chaining distance of its neighbor’s k-distance neighbors.

Pei proposed a reference-based outlier detection approach, which is capable of reducing the number of distance calculations compared with the LOF method [62]. They first calculate the distance between each reference point and the data points, then find the k reference-based nearest neighbors for each data point, and compute their average neighborhood density. Then, the minimum of the neighborhood density of a data point is used to define the reference-based outlier score (ROS). Data points with higher ROS are considered to be anomalies. In this study, we select the parameter k similar to LOF and select the reference points using the grid vertices [62].

The experimental results are shown in Figure 9. As the figure shows, of the proposed method is larger than those of the LOF, CBLOF, ROS, and COF. The four approaches only employ user behavior characteristics to mine anomalies, while the proposed method considers user behavior and network environment simultaneously; thus, the proposed method is more practicable. Furthermore, compared with them, the matrix decomposition not only can mine the abrupt changes caused by the abnormal behavior patterns but also that of slight changes.

6.4. Analysis of the Time Complexity

The time complexity of the proposed scheme mainly consists of two parts: feature matrix establishment and matrix decomposition. There are two matrices used to profile the traffic patterns, including the attribute matrix and the adjacency matrix. If we do not use any data structure to optimize the establishment process, both of the establishment time complexity of them are O, where n is the total number of unique IP addresses in the monitoring network. If we use the hash method to optimize the established process, the time complexity of the attribute matrix establishment can be reduced to O. As for the adjacency matrix, as the IP addresses of the monitored network are fixed, and we can obtain the IP similarity by offline calculation. The most time-consuming operation involved in the proposed algorithm is the matrix inverse operation, whose computational complexity is O when updating the residual matrix at each iteration. Moreover, the computational complexity of updating W is O. The total time complexity of the developed method is m ∗ O, where m is the number of iterations, n is the number of unique IP addresses in the monitoring network, and d is the dimension of attribute extracted. We can find that time complexity of the proposed method is irrelevant to the traffic volumes of the monitoring network, and only the increment of the number of IP addresses in the monitoring network can increase the time complexity. However, for an enterprise network, the total number of IP addresses is usually fixed. Furthermore, the developed method does not require any prior knowledge, which can be used for unknown anomaly detection. Another advantage of the developed method is that it can trace the anomalies easily, which can greatly improve the management efficiency. In conclusion, the proposed method is suitable for online security monitoring for medium-sized enterprise networks.

7. Application to Actual Network

7.1. Network Environment and Anomaly Mining Method
7.1.1. Campus Network Selected

We apply the proposed method to the campus network of Xi’an Jiaotong University, the northwest center of CERNET. The selected network contains thousands of users with self-governed IP addresses, including students, faculty members, and contract personnel from service-providing companies. The services used include HTTP, email, FTP, and VoIP. To evaluate our method, we collect a trace lasting seven days from the campus network, named as CERNET dataset. The statistical results of the dataset are presented in Figure 10. The x-axis represents time points (three minutes), and the y-axis represents the size of bytes and number of packets, respectively. As the figure shows, the user behavior patterns change dynamically and have obvious routine characteristics.

We select two different time intervals of one specific day, 2019-06-05 04:00:00 to 2019-06-05 04:59:00 and 2019-06-05 13:00:00 to 2019-06-05 13:59:00, to include different patterns. The basic statistical information on the datasets is shown in Table 8. We mainly employ this dataset to verify the practicality of the proposed methods.

7.1.2. Anomaly Mining Approach

The selected monitoring network contains 1000 hosts with public IP addresses. The obtained residual matrix and the residual values of a specific time window are shown in Figure 11. Figure 11(a) is the residual matrix, x-axis is IP addresses, y-axis is seven features, and z-axis represents the changing range of residual values. As the figure shows, most of the residual values of the residual matrix are very small. It is difficult to identify abnormal hosts based on the analysis of residual values of each feature. Therefore, we calculate the residual values of each host using norm. The results are shown in Figure 11(b), x-axis is IP addresses, and y-axis is residual values. We analyzed the IP addresses with larger residual values to identify anomalies, which can not only determine whether an IP address is an anomaly or not but also identify the detailed patterns of the anomaly by analyzing its specific features.

We analyze the host that holds the largest residual value as an example to explain our method, its IP address is 115.154.XXX.XXX, and the residual value is 1.28. Both residual values of the OD and the NDDA are the largest in the residual matrix. It means that the host sends massive connections to different destination hosts. We can claim that the host may perform network scan attacks, as an attack host will send a great number of scan packets to multiple different destination hosts when network scan attack appears. The host with the 202.117.XXX.XXX holds the second largest residual value. The values of the NDR and the SNF of the host are 0.84 and 0.61, respectively, which shows that many different hosts send a lot of packets to this host. For DDoS attacks, they usually employ many hosts controlled by hackers to send a lot of packets to the destination host in a short time period, which makes the destination host cannot provide normal service for users as the attack consumes most of the resources [63, 64]. Features NDR and SNF in the residual matrix are larger compared with the others, which means that many different hosts may send massive packets to the specific host in a short time period, and these characteristics are in accordance with that of DDoS attacks. Therefore, we can conclude that the host may suffer DDoS attacks.

7.1.3. Threshold Analysis

User behaviors exhibit periodic characteristics of day and night; thus, we should select different thresholds for the day and night time. To set suitable thresholds, we select five different time windows in the daytime and nighttime to analyze their respective residual values. The results are presented in Figure 12, x-axis is IP addresses, and y-axis is residual values. As Figure 12(a) shows, the residual values of some IP addresses are larger than 0.2 in different time windows, which occupies about 0.1%. Those behaviors are regarded as anomalies, and we set the threshold to 0.2 for daytime monitoring. As Figure 12(b) shows, the residual values of several IP addresses are larger than 0.3 in different time windows. We can set the threshold to 0.3 for nighttime monitoring. We find that the daytime threshold is smaller than that of the nighttime, as more people use the Internet in the daytime, and causes more complicated user behavior.

7.2. Performance Evaluation

We select five time windows from the daytime and nighttime, respectively. Taking the time window 13:00:00 as an example, eight residual values of the IP addresses are larger than 0.2. Seven of these IP addresses are identified as anomalies based on the analysis of their residual matrices and residual values. For the IP address 202.117.XXX.XXX, the values of NDR and LT in the residual matrix are large, which indicates that the IP address has been attacked by other hosts. For the IP address 115.154.XXX.XXX, the values of OD and NDDA are large, which means that the host is performing scan attack. Similarly, we analyze other IP addresses using their residual matrices and the results are presented in Table 9, where top k denotes the number of anomalies detected by our proposed method, and #Anomalies presents the number of the true anomalies by analyzing the detected anomalies. From the table, we can determine that the precision of our proposed method sits at around 90%.

8. Conclusion

Detecting and controlling the network anomalies are one of the most important problems for network management. In this study, we propose a novel anomaly detection method based on matrix decomposition. By analyzing the behavior characteristics of attacks, we extract seven features from the network traffic to construct an attribute matrix for use in characterizing the difference between normal and abnormal user behavior patterns. We combine the attribute matrix and adjacency matrix to construct an anomaly detection model and then employ CUR matrix decomposition to mine user behavior patterns and obtain a residual matrix to identify anomalies. We use two datasets, MAWI and CERNET, to evaluate the performance of our proposed method. The experimental results show that the proposed method achieves a detection accuracy larger than 90%, which verifies that it outperforms other related methods. Moreover, the developed method can not only locate anomalies and interpret the anomaly detection process but also can identify new anomalies without any prior knowledge. In future work, we will focus on how to reduce the computational complexity and improve the practicality of the algorithm.

Data Availability

In this study, the public data can be available through the URL http://www.fukuda-lab.org/mawilab/v1.1/2019/05/05/20190505.html and the data collected from CERNET are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to express their grateful thanks to Liang Zhao for his data preprocessing efforts, as well as to Guodong Li, from the campus network center, for his efforts in traffic trace collection. The research presented in this study is supported in part by the National Natural Science Foundation of China (62172324 and 62102310) and China Postdoctoral Science Foundation (2020M683689XB).