Abstract
Since the services on the Internet are becoming increasingly abundant, all walks of life are inextricably linked with the Internet. Simultaneously, the Internet’s WEB attacks have never stopped. Relative to other common WEB attacks, WEB DDoS (distributed denial of service) will cause serious damage to the availability of the target network or system resources in a short period of time. At present, most researches are centered around machine learningrelated DDoS attack detection algorithms. According to previous studies, unsupervised methods generally have a high false positive rate, while supervisory methods cannot handle large amount of network traffic data, and the performance is often limited by noise and irrelevant data. Therefore, this paper proposes a semisupervised learning detection model combining spectral clustering and random forest to detect the DDoS attack of the WEB application layer and compares it with other existing detection schemes to verify the semisupervised learning model proposed in this paper. While ensuring a low false positive rate, there is a certain improvement in the detection rate, which is more suitable for the WEB application layer DDoS attack detection.
1. Introduction
In the era of the prevailing development of the Internet, with the rapid development of the Internet, the services on the Internet are increasing, and all walks of life are inextricably linked with the Internet. Under this trend, people have become increasingly dependent on the Internet; whether it is online shopping or travel, it is closely related to the Internet. However, while the Internet is developing comprehensively and rapidly, the attacks on the Internet continue to exist and change constantly. Among them, WEB applications have become the focus of attacks because of their wide range of uses. Common WEB attacks [1] include WEB DDoS attacks, crosssite scripting attacks, and request forgery attacks. With the development of distributed and the proliferation of botnets, WEB DDoS attacks have become the most threatening attack, which can seriously damage the availability of target networks or system resources during the duration of a short attack.
WEB DDoS attacks have three characteristics: distributed, rapid development, and destructiveness [2]. However, traditional attack detection methods cannot effectively and accurately detect WEB DDoS, and with the development of machine learning, many researchers have used it to detect WEB DDoS attacks. In the machine learning [3–7] algorithm, there are two types: unsupervised learning and supervised learning. However, the unsupervised method alone has a high false positive rate, while the supervised method alone cannot handle a large number of unknown attacks. For the new type of attack of network traffic data, researchers have used Kmeans + C4.5 for attack detection, which has been experimentally proved to have a higher detection rate than the use of supervised or unsupervised algorithms alone, but because of its use of Kmeans compared with other current machine learning algorithms, the C4.5 algorithm has insufficient performance, so its detection accuracy and false positive rate have a lot of room for improvement. Therefore, this paper will study and propose a detection method for WEB DDoS attacks.
2. Related Work
The focus of this paper is on DDoS attacks in the WEB application layer. Research on this direction has never stopped at home and abroad. Moreover, with the development of machine learning technology, machine learning methods have become a mainstream method in DDoS detection research. Both Kim et al. [4] use machine learning methods to identify network traffic. The former finally derives DBSCAN. It is more suitable for clustering. The latter shows that the support vector machine (SVM) performs better in detecting attacks. Calix and Rajesh [5] tested the SVM algorithm on the NSLKDD data set. The accuracy rate is less than 80%. Literature [8] clusters users by the Kmeans clustering algorithm, which can be achieved by uniform clustering. Panda [9] compared several classification algorithms, in which a random forestbased set classifier can achieve 99% accuracy. Muniyandi et al. [7] proposed a hybrid algorithm using Kmeans + C4.5 for attack detection whose detection rate is higher than the one using a supervised algorithm or an unsupervised algorithm alone.
The DDoS detection methods in the literature are mainly divided into two categories: unsupervised methods and supervised methods. There are two main problems depending on the benchmark data set used:(1)The false positive rate of unsupervised methods is often high.(2)The supervisory method cannot handle large or new types of attack network traffic data, and its performance is often limited by noise and irrelevant data.(3)Since the Kmeans + C4.5 method uses the Kmeans and C4.5 algorithms, its performance is insufficient when compared with other current machine learning algorithms, so its detection accuracy and false positive rate has a lot of room for improvement.
Based on the above three problems, this paper proposes a semisupervised learning model combining spectral clustering and random forest to detect WEB DDoS. Compared with the existing scheme, it has a high performance rate and low false positive rate performance improvement, which is more suitable for current WEB DDoS attack detection.
3. Detection Methodology
In this paper, the semisupervised learning [10–15] model combined with unsupervised learning and supervised learning methods is used to detect WEB DDoS attacks, and the choice of learning methods has a great impact on the performance of this model.
First, for the unsupervised model [16–22], it includes DBSCAN, Kmeans, and spectral clustering. The DBSCAN algorithm [23] has a long convergence time when the sample data is too large and is not suitable for the big data network environment. Compared with Kmeans, the spectral clustering algorithm is very effective for the clustering of sparse data, while Kmeans is difficult to do. In addition, spectral clustering is processing the network traffic data because of the dimensionality reduction processing. In highdimensional data, the complexity is lower than traditional clustering methods such as Kmeans. Therefore, this paper chooses spectral clustering as an unsupervised learning algorithm for semisupervised learning models.
Second, for the supervised model [24, 25], the most commonly used algorithms include SVM, Naive Bayes, C4.5, and Random Forest. Lee et al. [26] compared the above classification algorithm, which proved that the random forest is the best classification effect among these algorithms. Panda et al. [6] also compared several supervised algorithms with two types of classifications. The cluster classifier based on random forest is optimal and can achieve 99% accuracy. Based on the above research, this paper chooses random forest as the supervised learning algorithm of semisupervised learning model.
This section applies the semisupervised learning model based on spectral clustering algorithm and random forest combination to detect WEB DDoS attacks. Firstly, the principle and characteristics of spectral clustering in the model are introduced, and then the classification algorithm applied to the model is random forest. The principle and advantages are introduced. Finally, the design of WEB DDoS detection model framework based on semisupervised learning combined with spectral clustering and random forest is introduced.
3.1. Spectral Clustering Algorithm Model
The clustering algorithm used in this paper is spectral clustering, and the spectral clustering algorithm is theoretically used to establish spectra. Compared with the traditional clustering algorithm, spectral clustering can better divide the sample data into clusters with high similarity regardless of the sample space. The principle of the spectral clustering algorithm [27] is as follows. Firstly, the data of the sample data set is transformed into a similar matrix that reflects the similarity between the sample data. Next, the matrix eigenvalues and eigenvectors are solved. Finally, select the feature vector that can cluster the data relatively well. This algorithm can converge to the global optimal solution. At the beginning of spectrum clustering, there are few studies on computer applications. The field of powerful clustering ability is computer vision and VLSI design. At present, machine learning is also applied to solve clustering problems and research at home and abroad. The efforts of scholars have become a hot clustering algorithm.
The spectral clustering algorithm is divided into two types according to different division criteria: 2way and kway. The 2way method includes PF algorithm, SM algorithm, and Mcut algorithm. The previous spectral clustering algorithm generally uses the 2way method to divide and cluster data samples. However, in most of the current research, it is found that the result of dividing and clustering by more feature vectors and using kway method is better. Ng et al. [28] proposed the NJW algorithm based on kway method by solving the first k largest eigenvalues of the Lagrangian matrix and its corresponding eigenvectors and orthogonalizing the k eigenvectors. The sample space R_{k} is obtained, so that the original data and each data point in the R_{k} space form a onetoone representation, and finally clustering is performed in the R_{k} space.
The general process of the spectral clustering algorithm based on the NJW algorithm is shown in Figure 1.
Among them, when constructing the Laplacian matrix, memory consumption can be saved by writing the operation result to the disk, and when the row vector of the feature vector matrix is converted into a unit vector, it is calculated by
When the spectral clustering is finally clustered by Kmeans, it is necessary to satisfy the condition that the data sample y_{i} is divided into cluster j if and only if the i row of Y is divided into clusters j.
3.2. Random Forest Algorithm Model
The random forest [29] is based on the basic idea of bagging to train a series of decision trees and improve them according to the characteristics of the decision tree. In the random forest training process, it adopts random attribute selection to improve the relative independence of the constructed decision tree to improve performance. Assuming that the number of nodes is n, the way in which the traditional decision tree selects the best attribute is based on all the attributes of the n nodes, and each node of the decision tree in the random forest is based on k attributes that are randomly selected in advance. The magnitude of the k value is decisive for the degree of randomness and is usually set to . In addition, the k value can also be 1 or d, which, respectively, represents a random selection of an attribute and a selection method using a conventional decision tree. The specific flow of the random forest algorithm is shown in Algorithm 1.
It can be seen from the training process of random forests that it only makes some minor changes to bagging, adding the randomness of feature attributes on the basis of random samples, and the generalization of the final integration of random forests. The degree of increase is better. Because the random forest algorithm has the advantages of small computational complexity and small difficulty in solving classification problems and often exhibits strong performance in practical applications, this paper also uses random forest as the classifier in the model.

3.3. Attack Detection Model Framework Based on Semisupervised Learning
The detection model proposed in this paper is based on the semisupervised learning model. The spectral clustering clustering algorithm introduced in Section 3.1 is used as the unsupervised learning algorithm in the model [30–39]. The abovementioned random forest algorithm is used as the model. There is a supervised learning algorithm. Through the cooperation of these two algorithms, this paper will construct a WEB DDoS attack detection framework based on semisupervised learning. The basic framework and process design are as follows. Since this semisupervised learning type detection framework is based on machine learning algorithms, it is similar to the traditional machine learning algorithm [40–45], including the training process and the detection process, and the approximate processing of these two processes is shown in Figure 2.
For the training phase, the defined dataset S is (X_{i}, Y_{i}), i = 1, 2, …, N, where X_{i} represents an Ndimensional matrix, Y_{i} = {0,1}, where 0 represents normal flow and 1 represents abnormal flow. In the training process, the training data set is first divided into k disjoint clusters by spectral clustering. The random forest corresponding to each cluster is then trained with the data in each cluster.
For the detection phase, the spectral clustering method is used to calculate which cluster of the k clusters the test data sample belongs to, and the corresponding random forest classifier is found according to the cluster of the sample data to determine whether the data sample is normal data or abnormal data.
4. Experiments
4.1. Experimental Environment
Table 1 lists the hardware and software environments used in this experiment.
4.2. Experimental Program
4.2.1. Extraction of Data Set
This experiment uses the fivefold crossvalidation method to test, extract 50,000 data from the NSLKDD data set, and divide it into 5 equal parts. Each subdata set is divided into four types according to the upper service type, including HTTP, SMTP, FTP, and others. The type of data in each category contains 40% of the attack data. The details of the data contained in each subdataset are shown in Table 2.
According to the kfold crossvalidation principle, each experiment will select the subset of data from the previous experiment that was not selected in the previous experiment. This model is used to test the trained model, and the remaining word data sets are available. Model training is used for learning, and k experiments are performed in this selection. The experimental results, that is, the performance of the model, are reflected by the average of k experiments. The principle flow of the 50% algorithm and the data set of this experiment are shown in Figures 3 and 4, respectively.
4.2.2. Data Preprocessing
The learning model’s evaluation rules are learned through the marked connections in the dataset. These connections are TCP data messages sent and received by the same IP address in a unit of time. The connection is marked as normal or abnormal. The features of each dimension of the NSLKDD data set are divided into discrete and continuous types, and their respective ranges of values are different. Therefore, preprocessing is required for these features. The preprocessing includes continuous discrete feature variables and data normalization. The two processes are described as follows.
First, the discrete feature variables need to be continuous. The NSLKDD data set contains continuous and discrete variables, and the discrete feature variables cannot be quantized, so the data is applied to the model. Previously, it was to be continuously processed. According to statistics, NSLKDD contains 7 discrete feature variables, 5 of which can be represented by 0 or 1 values, namely, _guest_login, logged_in, land, flag, and is_host_login feature variables. The service and protocol_type characteristic variables require special conversion because they have several different values. The specific conversion methods are shown in Tables 3 and 4.
Since the classification of data samples in this paper is obtained by calculating the degree of similarity between data samples, through the previous research on data sets, it contains many feature attributes, and the range and unit of each feature attribute are different. In order for the degree of similarity of the calculations to better represent the differences between the samples, data normalization is required. Data normalization refers to scaling feature attribute data proportionally so that the range of values of the data is reduced to a specific interval, i.e., [−1, 1] or [0, 1]. This experiment uses the zscore method to normalize the experimental data.
4.2.3. Performance Criteria
The performance indicators used to evaluate the experimental results are calculated based on the standard confusion matrix. For the sample data of this experiment, the confusion matrix is shown in Table 5. True positive (TP) refers to a record that is correctly classified as attack traffic, while false positive (FP) refers to a record that is misclassified as attack traffic, true negative (TN) is a record that is correctly classified as normal traffic, and false negative (FN) is a record that is misclassified as normal traffic. The formulas for the performance indicators used are defined as follows:
In the formula, N refers to the total number of data samples. Among them, formula (2) is the detection rate, which refers to the ratio of the normal data and the abnormal data of the correct classification to the total data. Formula (3) is the precision, which means that the number of attacks correctly divided into attacks is divided into the total proportion of attack data, which can reflect the ability of the model to identify the attack data. Equation (4) is the true positive rate, which represents the proportion of correctly identified attack data instances in all attack data. The higher the value of the above three evaluation indicators, the better the model effect. Formula (5) is a false positive rate, which refers to the ratio of normal data misclassification to the proportion of all attack data occupied by abnormal data. The lower the value, the better the model effect.
4.3. Experimental Results Analysis
Through the extraction and preprocessing fo the NSLKDD algorithm set, which is then applied to the semisupervised learning model proposed in this paper, the performance of the proposed algorithm is compared with the spectral clustering algorithm, Kmeans algorithm and K means + C4.5. As shown in Figures 5–7, the spectral clustering algorithm performs better than the Kmeans algorithm in terms of detection rate, accuracy, and true positive rate. The detection method of Kmeans + C4.5 is better than separate Kmeans or spectral clustering. Compared with other methods, the semisupervised learning model based on spectral clustering and random forest proposed in this paper is optimal in detection rate, precision, and true positive rate.
The false positive rate refers to the proportion of misclassification. The lower false positive rate is an important performance index for evaluating the detection algorithm. By comparing the above methods and calculating the average value, the experimental results are shown in Figure 8. The proposed semisupervised learning detection model has a lower false positive rate, which is basically consistent with the false positive rate of Kmeans + C4.5. The detection rate, accuracy, and true positive rate of the semisupervised learning model are higher than Kmeans + C4.5; therefore, the semisupervised learning detection model is more advantageous.
The experimental results show that the semisupervised learning model proposed in this paper has high accuracy, low false positive rate, and good performance. It is more suitable for detecting WEB DDoS attacks than other detection models.
According to the experimental results, the proposed method maintains a relative low false positive rate which is superior to unsupervised methods, and it can detect new types of attack network traffic data effectively. Additionally, the proposed method outperforms the hybrid method, Kmeans + C4.5, on all aspects of TPR, FPR, and precision.
5. Conclusion
In order to improve the detection rate of the existing WEB DDoS attack detection model, this paper proposes a semisupervised learning model based on spectral clustering and random forest. First of all, due to the importance of flow characteristics to the detection scheme, we focus on it to select better features to be applied to the detection model proposed in this paper. Then, we analyze the spectral clustering algorithm and the random forest algorithm in detail. Based on the principle and its advantages, spectral clustering and random forest are combined to form a semisupervised learning WEB DDoS attack detection model. Finally, the experiment proposed in this paper is compared with other existing detection schemes to verify the paper. The proposed semisupervised learning model has a certain improvement in the detection rate while ensuring a low false positive rate and is more suitable for the detection of WEB DDoS attacks. In the future work, we will work on the improvement of the detection model and try some other machine learning methods in different manners.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This research was supported by the Guangdong Province Key Area R&D Program of China (Grant nos. 2019B010137004 and 2019B010136003), the National Natural Science Foundation of China (Grant nos. 61972108 and 62072131), the National Key Research and Development Plan (Grant no. 2018YFB0803504), Guangdong Province Universities and Colleges Pearl River Scholar Funded Scheme (2019), the Science and Technology Projects in Guangzhou (Grant no. 202102010442), and the Science and Technology Project of Taizhou (2003gy15 and 20ny13).