Abstract

Due to the explosive growth of data in the Internet, more and more applications are being deployed on Big Data platforms. However, as the scale of data continues to increase, the probability of anomalies in the platform is also increasing. However, traditional anomaly detection techniques cannot effectively handle the massive amount of historical data and can hardly meet the security requirements of big data platforms. In order to solve the above problems, this paper proposes a security anomaly detection method for big data platforms based on quantum optimization clustering. Firstly, a framework of big data platform anomaly detection system is designed based on distributed software architecture through Hadoop and Spark big data open source technology. The system achieves effective detection of network anomalies by collecting and analyzing big data platform server log data. Secondly, an offline anomaly detection algorithm based on quantum ant colony optimized affinity propagation clustering is designed for various anomalies mined from historical data. The bias parameters of the affinity propagation clustering are treated as individual ants to construct an ant colony, and the clustering accuracy is set as fitness. Finally, in order to improve the accuracy of the optimal path search of the ant colony, quantum bit encoding is applied to the ant colony position to refine the granularity of the individual ant colony position update. The experimental results show that the proposed method can effectively complete the anomaly clustering detection of massive data. With a reasonable threshold, the quantum ant colony–based affinity propagation clustering has high detection accuracy.

1. Introduction

With the rapid development of big data technologies, big data platform architectures are becoming more complex, and the security requirements of big data platforms for new risks continue to increase. However, as the size of data and the functionality of modules continue to increase, the probability of anomalies in the platform grows [14]. For example, DDoS attacks on websites by hackers can bring down big data platform servers. Bursts of access traffic generated by large numbers of users during the festive season cause big data applications to crash. The proliferation of viruses and Trojan horses can lead to the leakage of personal information from applications [5, 6]. These security issues can cause incalculable financial losses to individuals and society. Therefore, accurate and timely detection of anomalies in big data platforms is of great practical importance.

Currently, the security of a Big Data platform is mainly provided by its infrastructure. However, due to the lack of necessary semantic interpretation and upper layer security mechanisms, the lower layer infrastructure is not capable of fully detecting anomalous events in the platform. For example, traditional firewalls and other security devices are unable to effectively detect and prevent anomalous events [79]. In order to achieve anomaly detection in big data platforms, we need to perform data mining on server logs and network traffic. Due to the fast, infinite, variable, and continuous nature of these massive data, it is very difficult to analyze these data manually. Over the years, research on automated anomaly detection has received much attention and has been widely used in areas such as intrusion detection, fault diagnosis, identity recognition, and e-mail filtering [1014]. Trojans, viruses, and system vulnerabilities are now well resolved by software such as firewalls and security assistants. However, there are still some limitations to traditional detection methods for network anomalies such as burst flow anomalies and DDoS attacks. This is because the data in the platform are highly variable and it is difficult to detect both of these anomalies completely with expert rule-based detection methods. In addition, software such as firewalls and security assistants can consume a lot of computational resources to detect large amounts of data. In summary, in order to ensure the security of big data platforms and to overcome the limitations of traditional anomaly detection algorithms, it is promising to study key technologies for anomaly detection in big data platforms.

The definition of anomaly refers to a special sample (pattern) which is inconsistent with most observed values in the observed set and produced by completely different mechanisms. For example, in regression algorithms, attribute values with significant deviations from expectations are regarded as anomalies [15, 16]. In statistical models, sample data that are distant from the series and do not obey the distribution are considered as anomalies. Assuming that there are qualitatively (or quantitatively) describable differences between normal and abnormal patterns, then anomaly detection is the process of identifying the differences that exist from the observed set of samples using statistical methods, data mining, and other theories. Anomaly detection algorithms are widely relevant and applicable to various fields, such as cyber-attack detection, credit card fraud detection, financial loan approval, medical drug research, etc. [1720].

Early research into anomaly detection has mainly used misuse detection. Misuse detection constructs detection features based on existing network attacks and matches them to the corresponding attacks. The key step in misuse detection is the filtering and labeling of logs, which requires the extraction of important information from a large number of files. Therefore, misuse detection requires expert knowledge of appropriate WEB attacks to be able to detect predefined attacks. Misuse detection appears to be powerless against unknown anomalies. Anomaly detection methods mainly use normal WEB log data for analysis and training, and then build anomaly detection models that can distinguish unknown anomalous behavior. The current attack strategies faced by big data platforms mainly include [2123]: DDoS, protocol vulnerability attacks, application service vulnerability attacks, and Trojan horses, which are all unknown anomalies. Therefore, Intrusion-Detection Systems (IDS) based on misuse detection cannot solve the security problems faced by Big Data platforms.

The anomaly detection data source of the Big Data platform is mainly based on server hosts and network traffic. Server host-based detection uses user logs and server access logs as the data source, and analyses them online or offline using relevant anomaly detection algorithms. Server host-based detection does not require external equipment and is insensitive to traffic data. The data source for network traffic-based anomaly detection is mainly network traffic from devices such as routers. Network traffic-based anomaly detection does not have access to the real-time status of the host system, resulting in less accurate detection.

There are currently three main types of anomaly detection algorithms [24, 25]: (1) Statistical analysis, (2) rule fields, and (3) data mining. Statistical analysis-based anomaly detection algorithms will assume that the data to be tested obeys a certain random distribution and identify anomalies through inconsistency detection methods. Said et al. [26] proposed a session anomaly degree calculation model based on the information of request URL, request time and other fields. Aissaoui’s et al. [27] used log files to segment session attributes and used Bayesian parameter estimation to determine the session anomaly level. In the low-dimensional case, we can use statistical knowledge for distribution determination. However, when faced with large amounts of data, the data are usually high-latitude and therefore cannot be statistically analyzed.

Rule-based anomaly detection algorithms use expert experience to build up a rule base and complete a pattern matching process based on the rule base information to determine whether an anomaly is present. Rule-based anomaly detection algorithms are often used in misuse detection systems, such as the widely used Snort system, which has over 20,000 rules, each of which is a summary of expert experience. Metcalfe [28] construct a complete rule set based on the same elements in the data sequence under normal conditions. Sequences that are not identical to the rule set are defined as anomalies when detected. Rule-based anomaly detection algorithms are unable to detect unknown anomalies and require a priori knowledge to detect intrusion anomalies. However, in a Big Data environment, where anomalous attacks occur every day and expert experience is limited, building and maintaining a comprehensive rule base is a very difficult task.

Data mining–based anomaly detection algorithms preprocess the data to be tested and then extract the appropriate patterns from these data. If the extracted patterns do not match the normal behavior, they are categorized as anomalies. Data mining–based anomaly detection algorithms are divided into three main categories: detection algorithms based on correlation sequences, detection algorithms based on classification analysis, and detection algorithms based on cluster analysis, the most popular of which is cluster analysis, which belongs to the field of unsupervised detection. Anomaly detection algorithms based on cluster analysis will divide the data into multiple clusters and classify the anomaly clusters based on the similarity within and between clusters to perform the determination of anomaly detection. For example, Sunardi et al. [29] proposed a clustering-based anomaly detection algorithm for DDoS attacks that can automatically identify web crawlers. However, the method is too complex to train and has high complexity.

In summary, anomaly detection in big data platforms is a complex problem, especially for the anomalies present in massive historical data. Therefore, this paper proposes a quantum-optimized clustering-based anomaly detection method for big data platform security. The aim of the research was to accurately detect multiple anomalies present in the massive data of a big data platform in a reasonable time using an improved clustering analysis algorithm without the need for expert experience and rule bases.

The main innovations and contributions of this paper include.(1)A framework for a big data platform anomaly detection system is designed on the basis of distributed software architecture through Hadoop and Spark open source technologies for big data. The system achieves effective detection of network anomalies by collecting and analyzing the server log data of the big data platform.(2)To improve the performance of affinity propagation (AP) clustering, a quantum ant colony–based bias parameter optimization strategy is introduced, thereby enhancing the applicability of clustering. The similarity matrix and random values of the bias parameters are calculated after sample initialization. The selection paths are continuously optimized through changes in the pheromone values in the quantum ant colony optimization algorithm, resulting in stable clustering results.

The rest of the paper is organized as follows: In Section 2, the anomaly detection system framework for big data platforms are studied in detail, while Section 3 provides the offline anomaly detection method based on quantum ant colony optimised AP clustering. Section 4 provides the experimental results and analysis. Finally, the paper is concluded in Section 5.

2. Design of an Anomaly Detection System Framework for Big Data Platforms

The development of new open source technologies such as Hadoop/Spark has made it easy for Big Data–related companies to process and analyze all kinds of Big Data. Traditional anomaly detection systems are only oriented towards the anomaly detection segment and do not form an organic combination with the big data framework. Most of the traditional anomaly detection systems are rule-based anomaly detection algorithms. As there is no mature solution yet, it is important to study the system framework for big data anomaly detection and analysis.

2.1. System Logical Architecture Design

The objective of this paper was to design an intelligent security analysis system for big data platforms based on Hadoop and Spark technologies. The logical architecture of the system is shown in Figure 1. The intelligent security analysis system based on the Lambda framework consists of an offline processing layer, an online processing layer, and a service layer [30, 31]. The architecture combines the data detection functions of offline and online environments, and has the advantages of high fault tolerance, high scalability, and fast processing speed. In addition, the architecture supports various big data components such as Hadoop, Spark and kafka, and is suitable for the deployment of anomaly detection in big data platforms.

The offline processing layer is responsible for the storage and processing of large-scale data. This layer is mainly implemented using Hadoop and Spark. The collected historical data are stored on HDFS, while the data processing results are stored on HBase. The data are preprocessed and anomaly detected in an offline environment using Spark’s efficient computing capabilities. The online processing layer is responsible for storing the incoming data streams with the distributed message cache kafka. The service layer is responsible for implementing fast user-interactive queries using Spark SQL.

2.2. System Physical Architecture Design

The physical architecture of the system is shown in Figure 2, which mainly accomplishes the tasks of log collection, log storage, and data processing. Firstly, the collection of standalone logs is accomplished by deploying Flume Agent on each server in the Big Data platform. Secondly, for the log storage task, the online streaming data are cached into kafka to ensure secure data transfer. For large batches of offline data, HDFS is used for storage. Finally, the Spark-Streaming technology in the Spark framework is used to process the real-time data. The offline data are then detected and analyzed using clustering algorithms.

The key modules of the system include the log preprocessing module, the monitoring and alerting module, and the data presentation module. The core function of the log preprocessing module is responsible for the collection of logs and the normalization of logs. In designing the log management module, this paper uses the Apache Flume open source log collection system to collect data from the WEB servers on the Big Data platform. The data collected include request logs from the WEB server, user access logs from the application server, and bulk WEB logs from the file server. The workflow of Apache Flume is shown in Figure 3.

The monitoring and alerting module is mainly divided into two parts: anomaly rule management and anomaly detection, the core part of which is anomaly detection, which is the focus of the full text. The proposed system uses quantum optimization clustering algorithms to perform anomaly detection on offline historical data. The anomaly rule management part uses Snort as a subsystem for rule-based anomaly detection of online data. When an exception is detected, the system flags the exception and sends a processing message to the data presentation module for an exception warning.

The data presentation module uses a basic C/S (client/server) framework to deliver data to the front-end interface using the Json format. The module uses statistical charts to visualize the messages to the user. The data presentation module includes sub-modules for message push, data query, and data interface. WebSocket technology is used to implement active pushing of messages, taking into account the data transfer rate and bandwidth utilization. Spark SQL is used for fast and interactive queries.

3. Offline Anomaly Detection Based on Quantum Ant Colony Optimised AP Clustering

As a technique that does not require manual labeling, cluster analysis occupies an important position in data mining. Through clustering, hidden connections in massive amounts of data can be effectively mined, thus enabling value processing of large-scale data. As a kind of unsupervised learning, clustering can group similar objects into the same cluster. The more similar the objects within a cluster, the more effective the clustering will be. The advantage of anomaly detection algorithms based on clustering analysis is that the data categories do not have to be manually labeled, thus reducing the cost of the training process of the anomaly detection algorithm. Without the aid of a priori knowledge, clustering often fails during data mining due to high dimensionality or heterogeneity.

The most commonly used anomaly detection algorithm based on cluster analysis is the K-Means algorithm [32]. However, the data sources in anomaly detection usually contain a large number of target objects, resulting in a large computational effort for anomaly detection based on the K-Means algorithm. Also, the selection of K values in the K-Means algorithm has to be tested repeatedly, thus further increasing the complexity of detection.

3.1. Affinity Propagation Clustering

The affinity propagation (AP) clustering algorithm [33] is a relatively new clustering method. Compared with traditional clustering methods, AP clustering algorithm does not require a predetermined number of clusters and has better clustering performance and efficiency. However, the accuracy of AP clustering is also often constrained by various factors such as the number of samples, the degree of sample balance, and the number of cluster centers. Therefore, in order to obtain better clustering results in large-scale data, the clustering method must be continuously improved according to the actual clustering needs.

Currently, there are more studies on optimized clustering algorithms. Shao et al. [34] used the density peaking algorithm to complete clustering and the whale algorithm for density peaking core parameter optimization to enhance the clustering performance. This study provides a new research direction for the improvement of clustering performance. Recently, various quantum population intelligence optimization algorithms have been proposed and have shown even better global and local optimization seeking capabilities. At present, no research has emerged on the use of quantum population intelligence optimization algorithms to enhance the performance of AP clustering.

AP cluster first needs to calculate that degree of similarity between two samples.where and represent the dimensional functions of the samples and , respectively. The distance between any two of the samples is calculated for all samples and output as a matrix. The value on the diagonal of the matrix is called the bias parameter .

In the similarity dimension function of a sample, and represent attract dimensionality and affiliation dimensions, respectively. Both can be represented in matrix form.where represents the attractiveness of . Add to both the left and right sides of equation (2).

The calculation of the similarity of the AP clustering algorithm takes into account and . The value of was chosen to measure the degree of similarity between and .

Let . Keep optimally solving to obtain the similarity of the samples to complete the clustering.

The factor () is added to attenuate the oscillation effect in the update of and . The calculation of and at the time of is shown as follows:

3.2. Affinity Propagation Clustering for Quantum Ant Colony Optimization

In this paper, the AP clustering algorithm is used to implement data mining. In order to prevent the problem of degraded clustering performance due to unreasonable bias parameter settings, we introduced the ant colony optimisation (ACO) strategy [35]. The ant colony algorithm was used to optimize the bias parameters and quantum bit coding was used to encode the individual positions of the ant colony to improve the applicability of AP clustering. The essence of the ant colony algorithm is to find the optimum through ant path selection. For nodes of ants, the next node path is determined by the pheromone of the selectable nodes. When the ant is at node position and the set of optional nodes is , then the next node is selected as follows:where denotes the pheromone of the node and node . denotes the initial pheromone. The pheromones in the ant path selection process were calculated in a probabilistic manner.where is the pheromone to , is the coefficient of the pheromone, is the inspired intensity, and is the coefficient of the inspired intensity. The pheromone needs to be updated after the ant moves to the next node.where is the evaporation factor.where indicates the best path pheromone value.

The bit representation of a quantum is shown as follows:where represents the superposition state. and represent the probability of quantum collapse to the “0” and “1” states, respectively.

Let and .

We need to code all individual positions of the colony.where , , , , are the total number of individuals in the colony and represents the location dimension.

Let be the phase of on the component .where meets .

3.3. Flow of Offline Anomaly Detection

The quantum ACO-AP clustering described above allows the dataset to be measured to be efficiently divided into clusters, and then the anomaly index is constructed according to the quantitative approach, thus completing the quantitative description of the anomaly clusters. For clusters with a large value of anomaly index, they can be regarded as anomalous clusters, which means that all objects in that cluster are anomalous data. The dataset is first clustered using quantum ACO-AP clustering, and then the anomaly index is calculated for each cluster. Then, the clusters are sorted according to the anomaly index and combined with the corresponding threshold judgement to finally identify the anomaly clusters, thus completing the anomaly detection. The flow of offline anomaly detection is shown below.

Step 1. Solving the similarity matrix for the offline historical data to be tested in the big data platform in order to initialize the bias parameters.

Step 2. Building several individuals of the ant colony optimisation algorithm with random values of the bias parameters.

Step 3. Quantizing the position of individual ants.

Step 4. Performing ant colony optimization to solve the ant colony individual with the highest fitness (optimal deflection parameter).

Step 5. Completing the clustering using the best bias parameter AP algorithm.

4. Experimental Results and Analysis

In this paper, two sets of experiments are designed. The aim of the first set of experiments is to demonstrate that the quantum ACO-AP algorithm has better clustering performance than the K-Means algorithm. The purpose of the second set of experiments is to verify the effectiveness of anomaly detection based on the quantum ACO-AP algorithm. The experimental running environment experimental machine is a Lenovo desktop computer with Intel Core i7 processor, 3.30 GHz CPU, 16 GB memory, and CentOs 7.0 operating system type. The commonly used clustering accuracy and clustering Silhouette values are selected as the evaluation metrics for clustering performance in this paper.

4.1. Clustering Performance Evaluation
4.1.1. Clustering Performance with Different Bias Parameters

The first set of experiments used the UCI dataset, as shown in Table 1.

In AP clustering, the common bias parameter selection methods are the median and minimum of the similarity matrix. These three bias parameter selection strategies were used for clustering simulation, respectively, and the results are shown in Table 2.

It can be seen that the number of categories obtained by AP clustering when P = median is significantly greater than the other two methods. This is mainly because when P = median, too many class centres are obtained in the clustering process, which results in a significantly distorted number of categories. The number of classes obtained by AP clustering when P is equal to the minimum is also significantly more than the actual classes, which indicates that the conventional AP clustering algorithm is poorly adapted to the UCI dataset. Therefore, it becomes critical to optimize the bias parameters of AP.

The number of categories obtained by the quantum ACO-AP clustering algorithm is consistent with the actual categories, which indicates its strong adaptive capability. Next, the effects of the three bias parameter setting strategies on the AP clustering accuracy will be analyzed, and the results are shown in Figure 4.

It can be seen that the use of different bias parameter setting strategies has a greater impact on the Silhouette values. The Silhouette of the quantum ACO-AP algorithm is significantly higher than the other 2 algorithms in the 5-class dataset. The cross-sectional comparison revealed that each of the 3 algorithms obtained the highest Silhouette value in the Iris set. Therefore, the setting of the bias parameter has a greater impact on the performance of AP clustering. When it cannot be set manually and reasonably, it is more appropriate to use ACO algorithm to adaptively optimise the bias parameters.

4.1.2. Performance Verification of the Quantum ACO

To further verify the optimization performance of Quantum ACO for AP clustering, the clustering tests were carried out using the AP, ACO-AP, and Quantum ACO-AP algorithms, respectively, and the results are shown in Table 3 and Figure 5.

It can be seen that the clustering categories obtained using the AP algorithm are significantly larger than the actual categories, while the quantum ACO-AP and ACO-AP algorithms obtain clusters that are closer to the actual categories. This indicates that the clustering effect is significantly improved after the introduction of the ACO algorithm to optimize the bias parameters. The classes obtained by the quantum ACO-AP clustering are all equal to the actual classes, whereas ACO-AP only obtains the same results as the actual values on the wine and glass datasets. The biased parameter optimization performance of ACO was further improved by quantum bit encoding, resulting in higher accuracy for quantum ACO-AP clustering.

In terms of Silhouette performance, the AP algorithms all stayed below 0.25. The Quantum ACO-AP algorithm all stayed above 0.4, while the ACO-AP algorithm stayed between [0.3, 0.36]. This is because the QACO algorithm results in a more reasonable distribution between the different cluster classes in the dataset. The performance comparison of the clustering accuracy of the three algorithms is shown as Table 4.

For the 5-class sample set, Quantum ACO-AP has the highest clustering accuracy. After the introduction of QACO, the performance of AP clustering was more stable. This is mainly due to the excessive number of class centres when the bias parameters of AP are not set appropriately, which tends to cause oscillations in the clustering results. A comparison of the convergence performance of the three algorithms is shown in Figure 6.

In terms of the number of iterations, Quantum ACO-AP requires fewer iterations, mainly because the AP algorithm requires more iterations to solve for the highest clustering accuracy before the bias parameters are optimized. However, with the quantum ACO algorithm, the efficiency of AP clustering is significantly improved. In terms of convergence curves, both ACO-AP and AP briefly fall into local optima, while the quantum ACO-AP algorithm has a very smooth downward trend.

4.1.3. Clustering Performance of Commonly Used Algorithms

To further validate the clustering performance of the Quantum ACO-AP algorithm, it was compared with Decision tree, K-medoid, and K-Means, and the results are shown in Figure 7.

The cross-sectional comparison revealed that all four algorithms had the highest clustering accuracy in the Seeds set and generally poorer clustering accuracy in the Iris set. It can be seen that for the same sample set, the quantum ACO-AP algorithm has the highest clustering accuracy. With a comprehensive analysis of the above results, we can see that Quantum ACO-AP has a more obvious advantage in terms of clustering accuracy.

4.2. Anomaly Detection Results

The second set of experiments uses the KDD CUP 99 dataset, which contains mainly 4900000 network protocol connection records. Each record consists of 42 fields, where the 42nd field indicates whether the record is an anomaly or not. To check the effectiveness of the Quantum ACO-AP algorithm, 40,000 records were randomly selected from the dataset to form the dataset to be examined, namely, 38,447 normal records and 1,553 anomalous records.

Detection accuracy is usually evaluated using the detection rate and the false alarm rate. The detection rate DR is calculated as shown as follows:

The false detection rate FAR is calculated as shown as follows:where indicates the number of correctly detected abnormal samples, indicates the number of incorrectly detected abnormal samples, indicates the number of correctly detected normal samples, and indicates the number of incorrectly detected normal samples.

Firstly, different thresholds are used to detect KDD CUP99 experimental subsets, as shown in Table 5 and Figure 6.

It can be seen that the detection rate of the Quantum ACO-AP algorithm decreases when the threshold value is taken to be larger. However, the false detection rate also decreases at the same time. When threshold = 0.05, the detection rate can reach 98.4%, but the false detection rate is higher at this time. This indicates that the threshold value has a certain influence on the experimental results and should be chosen carefully. The best results are obtained when the threshold value is equal to [0.05, 1].

Finally, experimental comparisons were performed on the KDD CUP99 dataset using the Quantum ACO-AP algorithm, the PLC algorithm, and the CE algorithm, respectively, and the comparison results are shown in Table 6.

It can be seen that the Quantum ACO-AP algorithm outperforms the other two anomaly detection algorithms in terms of both detection rate and false detection rate. In addition, the quantum ACO-AP algorithm has a smaller range of variation in detection rate and false detection rate. Combining the results of the first and second sets of experiments, we can conclude that the quantum ACO-AP algorithm has a higher detection performance for anomalies under reasonable threshold conditions.

5. Conclusion

Without the need for expert experience and rule bases, this paper uses an improved clustering analysis algorithm to accurately detect multiple anomalies in a large amount of data from a big data platform in a reasonable amount of time. The quantum ACO algorithm is used to optimally solve the bias parameters of AP clustering, which improves the accuracy of AP clustering. Reasonable settings of the main parameters of quantum ACO can obtain better bias parameter optimization of individuals and enhance the applicability of AP clustering. The experimental results show that compared with other anomaly detection algorithms, the quantum ACO-AP algorithm shows certain advantages in terms of both detection rate and false detection rate. The next step of the research will be to further optimize the core parameters of the quantum ACO algorithm in order to reduce the solution time of quantum ACO. An attempt is made to improve the real-time performance of large-scale sample processing by improving the clustering efficiency of the quantum ACO-AP.

Data Availability

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.