Towards Effective Network Intrusion Detection: A Hybrid Model Integrating Gini Index and GBDT with PSO
In order to protect computing systems from malicious attacks, network intrusion detection systems have become an important part in the security infrastructure. Recently, hybrid models that integrating several machine learning techniques have captured more attention of researchers. In this paper, a novel hybrid model was proposed with the purpose of detecting network intrusion effectively. In the proposed model, Gini index is used to select the optimal subset of features, the gradient boosted decision tree (GBDT) algorithm is adopted to detect network attacks, and the particle swarm optimization (PSO) algorithm is utilized to optimize the parameters of GBDT. The performance of the proposed model is experimentally evaluated in terms of accuracy, detection rate, precision, F1-score, and false alarm rate using the NSL-KDD dataset. Experimental results show that the proposed model is superior to the compared methods.
With the prompt development of the Internet, a variety of network security problems has continued to occur. At the same time, many security mechanisms have been implemented to maintain the security of computer systems. Among these mechanisms, intrusion detection systems (IDSs) play a vital role in protecting computing infrastructures from attackers and intruders.
The primary task of an IDS is to discover and prohibit abnormal connections from network traffics. In general, intrusion detection approaches can be divided into two categories: misuse-based detection and anomaly-based detection [1–3]. A misuse-based detection method stores in advance signatures of already known intrusive behaviors in a database and determines whether a network connection is an attack by matching its characteristic with the signatures. If matching one signature, this connection is an attack. These methods are effective in identifying the well-known network attacks with low false alarm rate. However, they will fail when meeting unknown intrusions whose properties do not match any signature in the database . To address this problem, a misuse-based method needs to regularly update its database. Whereas, obtaining the signatures of new intrusive behaviors is usually expensive. Conversely, an anomaly-based detection method trains models for normal behavior and detects network intrusions by identifying traffics that significantly deviate from normal profile . The basic hypothesis in anomaly-based detection methods is the characteristic of an abnormal connection is far from those of normal connections. In an anomaly-based detection system, we no longer need to store signatures of attack patterns. The anomaly-based detection methods can recognize not only unknown intrusive behaviors but also unknown future attacks. That is an advantage over misuse-based methods. Nevertheless, they may misclassify some normal traffics that are located in the boundaries between normal and abnormal behavior. Because new attacks keep emerging, anomaly-based methods have captured more attention from researchers.
To achieve reliable detection results, great efforts have been made by researchers. In the early stages, rule-based expert systems  and statistical methods  were applied in intrusion detection systems. But the worse performance of these approaches when dealing with large-scale network traffics relegates their application to small datasets. In nature, anomaly-based intrusion detection is a classification problem. Hence, researchers have sought to develop solutions for IDS by utilizing various machine learning techniques , such as decision trees, support vector machines (SVMs), naive Bayes, and artificial neural networks (ANNs). When developing an IDS, the primary goal is to achieve the best possible accuracy. To this end, hybrid methods have been largely proposed to further enhance the accuracy of intrusion detection when compared to using individual machine learning approaches [4, 9, 10]. The basic idea behind a hybrid model is to significantly improve the detection performance by means of combining several machine learning techniques. In general, the styles of existing hybrid models mainly include ensemble classifier [9, 11, 12], clustering plus classification [13–15], and feature selection plus classification [3, 4, 16–21].
As mentioned above, hybrid models show a new way for network intrusion detection. In this paper, a novel hybrid model, namely, GINI-GBDT-PSO, is proposed to put forward a solution for IDS with high accuracy. The proposed method integrates the Gini index, the GBDT (gradient boosted decision trees) algorithm (GBDT has some other names, e.g., GBRT (gradient boosted regression tree), MART (multiple additive regression tree), and tree net.) , and the PSO (particle swarm optimization) algorithm  together. Feature selection, which extracts an optimal subset of features to represent the whole dataset, is a key step in intrusion detection because training a classifier based on the optimal subset can not only reduce the learning time but also improve its accuracy . In the proposed method, we adopt the Gini index, which has been used for dimensionality reduction in the study of text mining [24–26], to extract significant features from original datasets. After feature selection, the proposed method uses the GBDT algorithm to train a prediction model on the optimal feature space. The GBDT algorithm is a powerful supervised learning method, which integrates the gradient boosting framework and decision tree technique into one ensemble model . Due to its successful applications in many research fields, such as disease modeling [27, 28], web-search ranking [29, 30], and travel time prediction , we argue that the GBDT algorithm will be quite competent to the work of intrusion detection. The parameters of GBDT, for instance, learning rate ν, are optimized by the PSO algorithm. Finally, we experimentally study the performance of the GINI-GBDT-PSO method on the NSL-KDD dataset , compared with several baselines including single classification methods and hybrid models. Experimental results show that the proposed hybrid method improves the detection performance in comparison with baselines.
The rest of the paper is organized as follows. Section 2 gives the criteria used to evaluate the performance of intrusion detection models. The detail of the proposed hybrid method is described in Section 3. In Section 4, the performance of the proposed method is experimentally evaluated compared with several baselines. Section 5 briefly reviews some hybrid intrusion detection models. Finally, Section 6 concludes this work.
2. Evaluation Criteria
Table 1 shows a confusion matrix that is used to represent the information related to the actual and predicted classifications performed by a detection model. In Table 1, TP is the number of abnormal connections that are correctly detected; TN is the number of normal connections that are correctly identified; FP is the number of normal connections that are incorrectly classified as abnormal; FN is the number of abnormal connections that are incorrectly judged as normal.
In this paper, the performance of intrusion detection models is evaluated by five widely used measures: accuracy, precision, detection rate (DR), F1-score, and false alarm rate (FAR) [14, 33, 34]. The calculations of these evaluation measures are defined as follows:
The accuracy and detection rates evaluate the capability of an intrusion detection model to correctly predict connections and detect abnormal events, respectively. The precision reflects the proportion of real abnormal events in all returned abnormal connections. The F1-score is the harmonic mean of precision and DR, which balances the precision and DR of a detection model. The false alarm rate measures the ratio of normal connections incorrectly classified as abnormal to total normal connections. Therefore, higher values of accuracy, detection rate, precision, and F1-score and lower values of false alarm rate show better detection performance for network intrusion detection models.
3. The Proposed Method
As mentioned in the Introduction, this work aims to improve the performance of network intrusion detection by constructing a hybrid model. To this end, we propose the GINI-GBDT-PSO method that combines the Gini index, the GBDT algorithm, and the PSO method into one framework. In the proposed model, we use Gini index to select important features from original datasets and then adopt GBDT as a classification approach to detect abnormal events on the feature filtered datasets. Because manually configuring the parameters of GBDT is not an easy work, we use PSO as an optimizer to find the high-quality parameters for GBDT.
For the sake of clarity, the stages of the system framework are listed as follows: (1)Dataset preprocessing(2)Featuring a selection using the Gini index(3)Training a prediction model with GBDT optimized by PSO(4)Identifying network intrusions by the prediction model
3.1. NSL-KDD Dataset
In this study, the NSL-KDD dataset  is used as benchmark. NSL-KDD is an improved version of the popular KDDCup99 dataset (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html), which solves some inherent problems existing in KDDCup99 dataset. Each instance in NSL-KDD dataset is a TCP/IP connection record depicted by 41 different features and classified as one of the following classes: normal event, denial of service (DoS) attack, probe attack, user to root (U2R) attack, and remote to local (R2L) attack. The detail description about the NSL-KDD dataset can be found in [35–37]. Table 2 lists the number of instances in the training and testing sets. As seen in Table 2, the number of instances in the training and testing sets of NSL-KDD is in the reasonable range, which makes it affordable to conduct experiments on whole dataset. However, researchers have usually run experiments on randomly selected small portion of the KDDCup99 dataset, which may cause inconsistent evaluation results. In NSL-KDD dataset, it is free of redundant records in training set and duplicate records in testing set, so the classifiers will not be biased towards more frequent records.
There are three symbolic, two binary, and 36 continuous features among the 41 features in NSL-KDD dataset. To use the proposed method, symbolic features should be converted into numeric features. In this paper, the simple scheme used in  is adopted to handle symbolic features. The scheme maps symbolic values to integer values with range from 1 to M, where M is the number of distinct symbols for a feature. For class labels, normal is mapped to 1, DoS to 2, probe to 3, U2R to 4, and R2L to 5. Furthermore, we normalize the values of each feature into [0, 1].
3.2. Gini Index
Usually, the dataset for network intrusion detection contains many features. However, not every feature contributes to the task of detecting intrusion. Feature selection, which can remove redundant or irrelevant features, is a crucial step for intrusion detection. Based on the optimal feature space, we can not only enhance the speed of training a classifier for network intrusion detection but also improve its detection performance.
The goal of feature selection is to get a group of significant features from the whole dataset, such that these selected features are very important for training a classification model.
To do that, we employ in this paper the Gini index to undertake the mission of feature selection. The Gini index, which was developed by Corrado Gini, an Italian statistician and sociologist, in 1912, was originally used to measure the statistical dispersion of income distribution across different population sectors. Nowadays, in the research of data mining, Gini index has been widely used in text mining [24–26].
Given a dataset D, which includes instances from l classes , let f be a feature that has k distinct values in D. According to the values of f, we divide D into k disjoint subsets . The score of feature f measured by Gini index is define as follows: where pi,j is the probability that an instance in Di belongs to class Cj.
The range of G(f) is [0, 1]. The smaller G(f) is, the more important feature f is. For any Di, if all instances in it belong to the same class, then Plj=1 p2i,j = 1, and G(f) = 0. At this time, feature f has the strongest discrimination.
3.3. GBDT Algorithm
The gradient boosting is a machine learning technique for classification and regression problems, which produces a strong ensemble model by combining a series of weak prediction models in an iterative fashion. The GBDT (gradient boosted decision trees) algorithm is a powerful ensemble learning algorithm, which extends and enhances the classification and regression tree model according to gradient boosting . The GBDT algorithm iteratively constructs decision trees. In each iteration, a decision tree is trained from the residuals of the previous tree. Then, the accumulation of predicted results of all trees is the final result.
Given the training samples , where is a sample and denotes the label of sample . Let be a prediction model and be a loss function. For any sample is the prediction of , and is the lossness between and . The goal of GBDT is to learn an optimal model F such that is minimized for a specified loss function .
To do that, the GBDT algorithm first builds an initial decision tree , then iteratively constructs m new trees. In each iteration, a new tree is added to reduce the residuals, which are obtained by the given loss function . Therefore, the optimal model F of GBDT can be defined as follows: where m is the number of iterations, and is the shrinkage parameter that controls the learning rate of GBDT. is the tree trained in the tth iteration and ρt is the weight of .
The procedure of GBDT algorithm is shown in Algorithm 1. In line 1, tree is initialized. In this paper, we simply set . Line 3 calculates the pseudoresiduals, which represent the difference between real value and predicted value. In GBDT, the loss function is applied to obtain residuals at each iteration. Next, a new tree is trained according to the pseudoresiduals in line 4 and the weight of this tree is determined in line 5. Then, the prediction model is updated in line 6.
In GBDT algorithm, the least absolute deviation function, the least square function, and the Huber function are commonly used as the loss function. In this paper, we adopt the least square function that is defined as follows:
3.4. PSO Algorithm
Inspired by the population behavior of bird flocking and fish schooling, Kennedy and Eberhart proposed the PSO (particle swarm optimization) algorithm in 1995 . This algorithm is a derivative-free, zero-order method and thus can be applied to solve a variety of optimization problems by simulating social behavior . Compared to other optimization algorithms, the PSO technique is easy to implement, robust, and scalable and can quickly find approximately optimal solutions [34, 39]. In the research of IDSs, PSO is a popular optimization technique [12, 34, 40, 41].
At the outset, a group of particles, which represent a population of candidate solutions, is generated with random positions and velocities in the problem space. Then, each particle flies through the problem space by following the particle that is on the best-known position. To define how best a position is, a fitness value is assigned to a particle according to its position. Thus, the optimization problem becomes to find the best position that has the best fitness value. In this work, we define the detection accuracy (see (1)) as fitness function. To obtain the optimization goal, the PSO algorithm iteratively updates the position and velocity of each particle in the problem space guided by the best position identified so far by itself as well as the best coordinate tracked by entire particle swarm.
Formally, in the n-dimensional problem space, the position and velocity of the ith particle at the tth iteration is defined as and , respectively. Then, the particle changes its position and velocity according to its position and velocity at current stage, the best-known position identified by itself so far (denoted as ), and the global best-known position identified by entire particle swarm so far (denoted as ). Equations (9) and (10) show the calculation of the velocity and position, respectively. where ω represents the inertia weight. c1 is the particle acceleration factor, and c2 is the population acceleration factor. r1 and r2 are two independently positive random numbers between 0 and 1.
The detailed process of the PSO algorithm is described in Algorithm 2. In Algorithm 2, N is the total number of particles in the population, and T is the total number of iterations. fitness(·) is the function that gives the fitness value of a position.
This section experimentally studies the detection performance of the proposed hybrid model (i.e., GINI-GBDT-PSO), compared with six baselines. The six baselines contain three individual classification algorithms, that is, SVM , random forests (RF) , and C4.5 , and three hybrid models, that is, FC-ANN , CFA-DT , and IGCR-ANN . The NSL-KDD dataset  is the benchmark. All experiments are implemented in the MATLAB 7.0 environment.
4.1. Setting of GINI-GBDT-PSO
For the proposed method, before training an intrusion detection model, the first task is to extract the optimal subset of features. The Gini index was originally used to determine the degree of inequality of income distribution across different population sectors. The international community has usually used 0.4 as the warning line for the gap between rich and poor . In our experiments, we consider a feature whose score of Gini index is less than or equal to 0.4 as important feature. As a result, 18 important features are selected from the NSL-KDD dataset, which are listed in Table 3.
In order to train the best possible prediction model, parameters of the GBDT algorithm are optimized by PSO. However, the PSO algorithm also has several parameters (see (9)). Some papers discussed the values of ω, c1, and c2 [34, 46, 47]. In this paper, we set the values of ω, c1, and c2 as suggested in . The details of parameter setting for PSO are outlined in Table 4. According to the parameter setting of PSO, the optimized parameters for GBDT are obtained, which are presented in Table 5. In Table 5, the last parameter (i.e., minimum number of leaf) is set for the decision trees used in GBDT.
4.2. Results and Discussion
At first, we analyze the performance of SVM, random forest, C4.5, and FC-ANN with respect to different features. These four methods do not integrate feature selection; thus, we run them with all 41 features and the 18 selected features in Table 3, respectively. The results are listed in Table 6. For a method M, M41 and M18 in Table 6 denote M with the 41 and 18 features, respectively. It is evident from Table 6 that the performance of each method with the 18 features is better than that with all features. These results are easy to understand, since the 18 features have more discrimination than others. In the following of this part, the results of these four methods are all with the 18 features.
Next, the detection performance of the proposed method in comparison with the six competing methods are evaluated. Table 7 exhibits the overall performance of different intrusion detection approaches in terms of various criteria. The best result for each criterion is highlighted in boldface. Clearly, our model yields the highest accuracy, detection rate, and F1-score, which markedly outperforms the counterparts. It can be seen from Table 7 that all methods achieve very high precision. Although the precision of FC-ANN is the highest, its detection rate is the lowest. That makes its F1-score is only 73.64%, which is very low compared with others. For the false alarm rate, the proposed method does not perform well. In the future, we need to further reduce the false alarm rate of the proposed method.
After introducing the overall performance of seven intrusion detection models, then we present the detection performance of these models for four attack types in Figure 1. We can observe from Figure 1(a) that the proposed method outperforms all compared methods in terms of detection rate for all attack types. For the two low-frequency attack types, that is, R2L and U2R, the corresponding values of the proposed method are 34.44% and 14.66%, respectively. Whereas, all compared methods, especially SVM and FC-ANN, exhibit poor performance when meeting U2R and R2L attacks. The proportions of samples of these two attacks in training set are very low (see Table 2) which is a reason that gives rise to this scenario. However, the main reason is that the capability of these models to detect low frequency attack is insufficient. Figure 1(b) reports the precision of each detection model for four attack types. In Figure 1(b), the proposed method gets the highest precision for R2L attack; however, its precision in detection of DoS, probe, and U2R attacks is not good enough. The RF method yields the highest precision for DoS and probe attacks, but its precision for U2R and R2L is very low. To fairly evaluate the capabilities of all methods for individual attack type, Figure 1(c) depicts the F1-scores obtained by all methods. Figure 1(c) shows that the proposed model performs the best for DoS, U2R, and R2L attacks and gets the approximate best F1-score for probe attack. RF shows the best for probe attack and the second best for Dos attack, but its F1-scores about U2R and R2L attack are very low.
(a) Detection rate
In Table 8, we show the harmonic mean of each measure for different attack types. Since some values are 0 (see Figure 1), we add a small constant, 0.01, to each value when computing harmonic mean. It is clear from Table 8 that the proposed method ranks first in terms of both DR and F1-score and third with respect to precision. For DR and F1-score, our method outnumbers the compared methods. Although, the precision of our method is lower than that of C4.5 and CFA-DT, it is far better than that of the other four.
Finally, we compared our method with the 18 selected features as well as all 41 features. The result in detection of abnormal connections is described in Figure 2. Clearly, the performance with the 18 features is better than that with 41 features. This scenario indicates that feature selection is an important technique in intrusion detection. Consequently, the combination of Gini index, GBDT algorithm and PSO technique are a competitive detection model.
In summary, the above experimental results demonstrate that our hybrid model can give overall better performance than single classification methods (i.e., SVM, C4.5, and RF) and some hybrid models (i.e., FC-ANN, CFA-DT, and IGCR-ANN) when detecting network intrusion. Specifically, for the low frequency attacks, our model substantially outnumbers the others. Consequently, the proposed model has a strong applicability for network intrusion detection.
5. A Brief Review of Hybrid Models
Recent years have witnessed a surge of research efforts about the application of hybrid techniques for intrusion detection. This section briefly lists some recent studies.
Aburomman and Reaz  developed an ensemble method based on the SVM and k-nearest neighbors (k-NN) classifiers, as well as the PSO algorithm to enhance the detection accuracy. Their approach trains six k-NN experts with different ks and six SVM experts with different parameters on the same dataset and then generates two new ensembles by combining the opinions of 12 experts using the PSO and metaoptimized PSO, respectively. The difference of both ensembles is in the setting of the PSO behavioral parameters. In the first ensemble, the behavioral parameters are manually selected, and in the second one, the behavioral parameters are optimized by using local unimodal sampling (LUS) . Additionally, they also gave an ensemble method by weighing each expert using the weighted majority algorithm (WMA) , for the purpose of comparison. The experimental results on five randomly selected subsets from the KDDCup99 dataset show that the LUS-based ensemble achieves the best accuracy.
De la Hoz et al.  applied statistical techniques and the self-organizing map (SOM) approach to detect network anomalies. Two machine learning techniques, principal component analysis (PCA) and Fisher discriminant ratio (FDR), were considered in their work to remove irrelevant features and noises. Then, the probabilistic SOM (PSOM), a fuzzy version of the classical SOM, is used to model and classify the filtered feature space. Experiments run on the NSL-KDD dataset show that the PCA + FDR + PSOM model has the best performance compared with the contrasted hybrid models.
De la Hoz et al.  also introduced another hybrid model into the network intrusion detection problem with the purpose of diminishing model complexity and promoting detection performance. In that paper, feature selection is carried out by a multiobjective optimization, and both anomaly detection and attack classification are fulfilled by the growing hierarchical self-organizing maps (GHSOMs) . The multiobjective approach is implemented by the NSGA-II algorithm  in which Jaccard coefficient is employed to measure the performance of a classifier for the corresponding class. Experimental results on the NSL-KDD dataset illustrate that their proposed model is better than other approaches in terms of detection accuracy and detection rate.
Wang et al.  proposed a hybrid model by uniting the fuzzy clustering and ANN. First, their model partitions the training set into several different groups using fuzzy clustering technique. Then, different ANN models are trained based on different groups. Finally, a fuzzy aggregation module is generated by aggregating all ANN models’ results. Experimental results on the KDDCup99 dataset indicate that this model outperforms the decision trees, naive Bayes, and backward propagation neural network (BPNN).
Eesa et al.  presented a hybrid method, which combines the cuttlefish optimization algorithm (CFA) and the decision tree classifier, to detect network intrusions. In their model, the CFA is used to select significant attributes, and the decision tree algorithm is utilized to identify types of abnormal events. The performance of this hybrid model was evaluated on the KDDCup99 dataset. The results demonstrate when the number of attributes is less than 25, the detection accuracy and detection rate have obvious improvement.
Guo et al.  proposed a two-level hybrid approach for intrusion detection by exploiting the strengths of misuse-based and anomaly-based detection methods. This approach is composed of two anomaly detection components (ADCs) and one misuse detection component (MDC). In stage 1, the ADC 1 detects abnormal connections using the ADBCC method . Then, the declared abnormal and normal connections are, respectively, sent to the ADC 2 and the MDC in parallel to further evaluated by k-NN. This hybrid approach was experimentally tested on the KDDCup99 and the Kyoto University Benchmark dataset. The results show that this approach is effective in detection abnormal connections with a low false alarm rate.
Akashdeep et al.  developed an intelligent system by means of feature selection and ANN classifier. The proposed system first ranks features according to information gain and correlation and then selects useful features by combining their ranks using a novel approach. Experimental results on the KDDCup99 dataset illustrate that the system is really encouraging.
Motivated by the successes of Gini index and GBDT algorithm in other fields, this paper proposed the GINI-GBDT-PSO method, a novel hybrid intrusion detection model to improve the performance of network intrusion detection systems. The proposed model first extracts the optimal subset of features from whole dataset by using the Gini index. Then, the GBDT algorithm, which is a gradient boosting approach, is adopted to detect abnormal connections. In addition, the PSO algorithm is employed to optimize parameters of the GBDT algorithm in the proposed model. This model can not only enhance the overall performance for network intrusion detection effectively but also improve the detection performance for each type of attack.
In order to validate the performance of our method, we performed experiments on the NSL-KDD dataset compared with six baselines. Five evaluation criteria are introduced to conduct fair comparisons, which are accuracy, detection rate, precision, F1-score, and false alarm rate. The experimental results demonstrated that the proposed model performs the best on the whole in comparison with baselines. These results indicate that the proposed model is an accurate and effective solution for network intrusion detection systems.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported in part by the National Natural Science Foundation of China (no. 61602225) and the Fundamental Research Funds for the Central Universities (no. lzujbky-2017-192).
A. Mohan, Z. Chen, and K. Weinberger, “Web-search ranking with initialized gradient boosted regression trees,” Journal of Machine Learning Research, vol. 14, pp. 77–89, 2011.View at: Google Scholar
L. Dhanabal and S. P. Shantharajah, “A study on NSL-KDD dataset for intrusion detection system based on classification algorithms,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 4, no. 6, pp. 446–452, 2015.View at: Google Scholar
V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995.View at: Publisher Site
J. R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, USA, 1993.
Y. Tao, X. Wu, and C. Li, “Rawls’ fairness, income distribution and alarming level of Gini coefficient,” 2014, http://arxiv.org/abs/1409.3979.View at: Google Scholar
N. R. Samal, A. Konar, S. Das, and A. Abraham, “A closed loop stability analysis and parameter selection of the particle swarm optimization dynamics for faster convergence,” in 2007 IEEE Congress on Evolutionary Computation, pp. 1769–1776, Singapore, Singapore, 2007.View at: Publisher Site | Google Scholar
M. E. H. Pedersen and A. J. Chipperfield, “Local unimodal sampling,” Tech. Rep., Tech. Rep. HL0801, Hvass Laboratories, 2008.View at: Google Scholar