Abstract

The global usage of more sophisticated web-based application systems is obviously growing very rapidly. Major usage includes the storing and transporting of sensitive data over the Internet. The growth has consequently opened up a serious need for more secured network and application security protection devices. Security experts normally equip their databases with a large number of signatures to help in the detection of known web-based threats. In reality, it is almost impossible to keep updating the database with the newly identified web vulnerabilities. As such, new attacks are invisible. This research presents a novel approach of Intrusion Detection System (IDS) in detecting unknown attacks on web servers using the Unified Intrusion Anomaly Detection (UIAD) approach. The unified approach consists of three components (preprocessing, statistical analysis, and classification). Initially, the process starts with the removal of irrelevant and redundant features using a novel hybrid feature selection method. Thereafter, the process continues with the application of a statistical approach to identifying traffic abnormality. We performed Relative Percentage Ratio (RPR) coupled with Euclidean Distance Analysis (EDA) and the Chebyshev Inequality Theorem (CIT) to calculate the normality score and generate a finest threshold. Finally, Logitboost (LB) is employed alongside Random Forest (RF) as a weak classifier, with the aim of minimising the final false alarm rate. The experiment has demonstrated that our approach has successfully identified unknown attacks with greater than a 95% detection rate and less than a 1% false alarm rate for both the DARPA 1999 and the ISCX 2012 datasets.

1. Introduction

The continuous growth of Internet usage, development of speedier Internet technology, and availability of massive sensitive information have caused servers to be the primary target of malicious attack. Lately, web-based applications and web servers have become popular targets as most network communication serves client-server enquiry needs. These web applications are often accessible through ports that are open through firewalls [1]. Although the Internet provides convenient real-time information services to the public, the potential threats to confidentiality, integrity, and availability (CIA) need to be addressed more effectively and permanently [2]. To fortify the security aspects of web-based servers and systems, Intrusion Detection Systems (IDSs) can be used as a complementary device to many existing security appliances such as password authentication, firewalls, access control, and vulnerability assessments.

The Intrusion Detection System (IDS) is an application system or a device that identifies hostile activities or policy violation activities within a network. IDSs have been widely used in recent years as one of the network security components. They play an active role in network surveillance, as well as functioning as a network security guard. IDSs function to capture and analyse traffic movement and precipitate an alarm when there is an intrusive action detected. The alarm is set to alert the security analyst to take the necessary action. In general, IDSs can be designed either as a network-based IDS (NIDS) or as a host-based IDS (HIDS) [3] to recognise signs of intrusion. The design is based on the placement of the IDS either to capture traffic for whole network or only for a specific host [4]. In NIDSs, the IDS is normally installed before and after the firewall to capture traffic for whole network segment. With respect to the HIDS, the IDS focuses on a specific host to examine packets, logs, and system calls. Such being the case, the HIDS is more suitable in identifying internal attacks compared to the NIDS [5].

According to [6] there are two types of IDS: The Signature Detection System (SDS) and the Anomaly Detection System (ADS). In SDS, a set of previously defined rules are stored inside databases specifically used to identify known attacks. In view that SDS technique relies on consistent signature updates, it is unable to detect unknown or new attacks [7]. Consequently, such attacks could pass through the system undetected. On the other hand, the ADS approach is based on analysis of normal behaviour traffic as the baseline of general usage patterns. Fundamentally, ADS is based on the assumption that any traffic that deviates from normal patterns will be identified as malicious traffic [8]. The main advantage of this approach is its ability to identify new or unknown attacks. In spite of having such advantages, ADS are overly keen to trigger massive false detection [9]. A false detection occurs when the system misclassifies legitimate traffic as malicious traffic and vice versa. The key factor in ADS is developing a system that could produce high detection accuracy while maintaining low false detection rates.

Therefore, this paper presents a novel Unified Intrusion Anomaly Detection (UIAD) that consists of three components (preprocessing, statistical analysis, and classification). The study provides contributions through a new set of techniques using 2-stage detection which aims to improve the outlier detection rate and minimise the false alarm rate in ADS environments. Initially, we performed hybrid feature selection (HFS) to filter out the irrelevant and redundant features. Secondly, the first-stage detection begins with statistical approaches, where the methods are further divided into two phases: the learning phase and the detection phase. Meanwhile, in the second-stage detection, the data mining approach is employed, and in particular it uses ensemble learning classification to improve the true detection rate (True Positive and True Negative) and the misclassification rate (False Positive and False Negative) that have first been detected in the first stage. Finally, we implemented the Logitboost algorithm as a metaclassifier with RF as a base classifier. The result has demonstrated a significant improvement regarding attack detection accuracy and a reduction in the false alarm rate for both the DARPA 1999 and ISCX 2012 datasets.

The rest of this paper is organised as follows. Sections 2 and 3 review the related work and the datasets used by this study, while the proposed approaches are explained in Section 4. The experimental results are presented in Section 5. Section 6 concludes and outlines future work.

In this section, we discuss the related works on IDSs and the existing work in the areas of feature selection, statistical analysis, data mining algorithms, and web-attack traffic.

2.1. Feature Selection

Feature selection is a foundation of machine learning and has been studied for many years [22]. It is a process of discovering the most prominent features for the learning algorithm in the sense that the most useful data is analysed for better future projection. Therefore, it is imperative to extract the redundant or irrelevant features to provide excellent discriminative models for every classifier. As the effectiveness of the selected algorithm is highly dependent on the feature selected, it is also crucial to choose the most significant features that could contribute to maximising the classification performances. Selecting the feature selection algorithm often requires expert knowledge, as it is not a straightforward task to identify a good set of features.

Currently, the two general methods used in this field are the filter and wrapper [23] approaches. Filter-based subset evaluation (FBSE) was introduced simply to overcome the redundant feature issue inside filter-ranking [24]. It examines the whole subset in a multivariate way. It selects the relevant features and explores the degree of relationship between them. In addition, FBSE is heuristic-based and involves probabilities and statistical measures to search for and evaluate the usefulness of all identified features. On the other hand, the wrapper-based subset evaluation (WBSE) uses a classifier to estimate the worthiness of feature subset. Usually, WBSE has better predictive accuracy compared to filters. This is because the selection approach is optimised when evaluating each feature subset with a particular classification algorithm.

Conversely, most of the time wrappers use a classification algorithm to evaluate each set of features. This has made it excessively expensive to execute. Moreover, when dealing with a large database that consists of many features, [25] the wrapper can become uncontrollable. Wrappers are also highly associated with the classifier’s algorithm and that makes it more difficult when shifting from one classifier to another because the selection process needs total reinitiation. Unlike filters, the selection criteria of features use distance measures and correlation functions [26]. It does not require reexecution for different learning classifiers. As such, its execution is much faster than wrappers. Filters are suitable in large database environments that contain many features. Researchers have often used the filter method as an alternative to the wrapper method, since the latter is expensive and time-consuming to run.

2.2. Statistics-Based Approaches

Statistical methods in IDSs were first introduced by [27]. The detection approach primarily relies on a collection of data history to create a normal profile of behaviour. In this approach, only benign traffic data collected over a period of time is utilised to detect intrusion [27]. Some researchers have proposed a statistical model in more specific areas such as Packet Header Anomaly Detection (PHAD). In PHAD packet characteristics and behaviours are used to recognise abnormal patterns. PHAD uses statistical measurement from activity history [28] to construct a normal profile. A set of traffic that deviates from the normal profile and behaves abnormally would be identified as an intruder by this method. Instead of using IP addresses and port numbers, PHAD uses all information inside a packet header [28]. The 33 attributes in a packet header represent the information of 3 layers in the OSI 7-layer model, which are the data link, network, and transport layers. The information in the attributes is used to measure the probability of each packet being normal or tending towards abnormal behaviour. The anomaly score is awarded when there is any dissimilarity detected between training data and the testing data. Finally, the sum anomaly score of each packet is totalled up and flagged as anomalous if the score surpasses the preset threshold.

In contrast to conventional PHAD systems, [29] proposed the Protocol-based Packet Header Anomaly Detection (PbPHAD) in two different environments: network-based and host-based. The proposed method used three main protocols: the transmission control protocol (TCP), user datagram protocol (UDP), and Internet control messaging protocol (ICMP) to construct a normal profile that contains normal behaviour. Similar to the traditional PHAD system, this approach uses all 33 packet header attributes to produce an anomaly score. The score will individually rate the degree of incoming traffic. In spite of surpassing the results from PHAD and DARPA best system [30] with a 57.83% detection rate, there is still room for further improvement.

To identify whether malicious packets exist inside Telnet traffic, [6] has proposed the Lightweight Network Intrusion Detection System (LNID). In LNID, benign behaviour extracted from training data is used to construct a normal profile. Additionally, the normal profile is used as the indicator to compute an anomaly score. The anomaly score was given during a matching process between testing and training data. The packets are flagged as malicious when the score surpassed the preset threshold. Insignificant features from training data are removed during the preprocessing phase to reduce computational cost. Although the scoring approach in LNID has increased the detection rate for U2R and R2L to 86.4%, it still has the opportunity for further improvement. The test has recorded nearly 14% of undetected attacks by singly using anomaly scores to determine the threshold without considering effective features as additional input. Profile generation has attracted [15] to propose catastrophe and equilibrium surface theory to extract common behaviours that exist within the network. The standard equilibrium surface is used to indicate the change of packet behaviour, which makes it suitable for inspecting incoming traffic. Despite the fact that the evaluation of true positives increased to slightly over 86% for Telnet traffic, the real challenge is to get the best detection rate together with the lowest false alarm rate.

2.3. Data Mining Based Approaches

Data mining is the technique of discovering systematic data relationships and determining the fundamentals of data information [7]. Data mining is divided into two broad categories: unsupervised and supervised approaches. Clustering and classification are examples of unsupervised and supervised algorithms, respectively. In clustering, the group of objects are based on characteristic data points, where every single data point in a cluster is similar to those within its cluster but is dissimilar to those in a different cluster. It works by grouping similar data into one or more clusters to ease abnormality identification. However, this approach would potentially increase the false alarm rate. In view of the fact that IDS performance is highly dependent on its achieving a low false alarm rate, its capabilities would be downgraded if it continuously generated a high false alarm rate. For that reason, classification is the better approach in classifying (i.e., benign or anomalous) data, especially in reducing the false alarm rate. Classification is the supervised approach that has the capability to differentiate unusual data patterns, thus making it suitable for the identification of new attack patterns [31]. Furthermore, classification has been widely used due to its strong reliability in identifying normal structure accurately, which contributes towards reducing false detection [32].

The ensemble technique in classification has attracted researchers to perform a combination of several classifiers which aim to obtain better prediction on accuracy performance [33]. The ensemble method is divided into 3 main approaches: (i) bagging, (ii) stack generalisation, and (iii) boosting. Bagging often referred to as “bootstrap aggregating” functions to improve detection accuracy by fusing the outputs of learned classifiers into a single prediction. For instance, the RF algorithm achieves high classification accuracy by fusing random decision trees using the bagging technique. Stack generalisation, or stacking, basically involves the combination of predictions from several learning algorithms. The prediction output from base-level classifiers is used to achieve high generalisation accuracy.

Boosting is mainly used to boost weak classifiers or weak learners to achieve a higher accuracy classifier. In other words, boosting can be considered a metalearning algorithm. The incorrectly classified instances from the previous model are used to build an ensemble. Weak classifiers such as decision stumps that are based on a decision tree with a root node and two leaf nodes are usually used in the boosting technique [34]. Adaboost (Adaptive boosting) is the most popular boosting algorithm which was first introduced by [35]. The high accuracy achieved by using this algorithm has attracted researchers [3638] to employ this method in IDSs.

In [36], the author has proposed Adaboost, with a decision stump as a weak classifier. The noise and outliers existing inside the dataset are initially removed by training the full data. The sample data that contained high weight is considered as noise and as containing outliers. Although the detection rate achieved was almost 92%, the false alarm rate was still at 8.9%. Similarly in [20], the authors had proposed CAGE (Cellular Genetic Programing) that used the evolve combination function present in ensemble approaches. The approach was tested on the ISCX dataset and achieved a 91.37% attack detection rate. Although the approach achieved a high detection rate, the recorded high false alarm rate constitutes a limit to the system’s capability.

In choosing the right weak classifier for Adaboost, [37] has compared four classifiers NNge (Nonnested generalised exemplars), JRip (Extended Repeated Incremental Pruning), RIDOR (Ripple-Down Rule), and Decision Tables as a base classifier for Adaboost. The proposed combination of Adaboost with NNge received the highest detection rate in detecting U2R and R2L types of attack while a combination of Adaboost with Decision Tables was found to be efficient in detecting DoS attack. Paper [38] has proposed a similar concept to [36]. The author has tested a Naïve Bayes algorithm to be used as weak classifier. Although the proposed algorithm could achieve a 100% detection rate for DoS attacks, the overall performance (84% detection rate with 4.2% false alarm rate) is still much lower compared to [36].

The introduction of the logistic-regression (Logitboost) algorithm [39] as an alternative solution is to address the drawback of Adaboost in handling noise and outliers. The Logitboost algorithm uses a binomial log-likelihood that changes the loss function linearly. In contrast, Adaboost uses an exponential loss function that changes exponentially with the classification error. This is the reason why Logitboost turns out to be less sensitive to outliers and noise. To the best of our knowledge, no research to date has investigated the performance of the Logitboost algorithm in the field of ADS environment.

2.4. Attack on Web Traffic

The exposure of vulnerable web-based applications and their related sensitive information in the Internet environment has promoted network security to an area of major concern. This is because incidents of attack are getting more frequent and aggressive, causing serious damage to the targeted web-based information system. As such, it is not surprising that more researchers are involved in this field. Paper [40] has proposed a learning based approach to secure web servers by focusing on detecting SQL and Xpath injection attacks. The detection is based on an input query where the attacker usually adds extra conditions to the original SQL commands. The proposed methods examine the structure and type of inputs as well as the outputs of the operation existent in the XSD file. After the necessary information is collected, the workload generator is used to inspect the set of data accessing the SQL/Xpath presented in the source code. In detection mode, the SQL query is compared with the normal SQL query that contains zero attacks stored in the lookup map. If the SQL query is not found, the execution will stop processing the query to avoid probable hazardous requests. This approach is capable of alerting developers and service administrators to stop the XPath/SQL injection before the system is harmed.

The research proposed in [41] assumed that the attack patterns commonly have a level of complexity that exceeds normal access requests. This complexity is used as a benchmark in detecting attacks. The recorded request log is inspected with Shannon entropy analysis, which is used to calculate the complexity level. In defining entropy level, normal log requests in training sets are used as benchmarks of a legitimate profile. The boundaries (threshold) in detection are measured using average and standard deviation of the period for each entropy. Log requests that surpass the predefined complexity threshold are flagged as potential intrusions. Although the proposed attack detection approach is able to detect attacks at a satisfactory rate, the false detection rate has room for further improvement.

On the other hand, the work done by [42] has considered the analysis on HTTP log requests. The normal HTTP log requests have been used as training sets that could describe the model of normal user behaviour. This approach is similar to work from [43], where they make use of query in detecting SQL attacks. Initially, the query trees would be converted into dimensional vectors for feature extraction and feature transformation. The work has been carried out using the data mining technique and utilise the Support Vector Machines (SVM) algorithm for classification purposes. The result has demonstrated conspicuous performance improvement in terms of computational time reduction and attack detection accuracy rates. Although the detection rate presents a significant improvement, the proposed methods require a readable payload to extract the http request.

Traditionally, IDS works with the principle of “deep packet inspection” where the packet payloads are inspected to look for the presence of malicious activities. As the usage of network communications gets more frequent, the demands for more secured communication using cryptography also increase. In encrypted traffic environments, secure sockets layer (SSL), wired equivalent privacy WEP, or Internet protocol security (IPsec) protocols are utilised to offer better privacy and confidentiality. Previous work in detecting web-based attacks mainly focused on investigating the log/payload content [44, 45]. In view that the traffic is encrypted, payload (log) is unavailable as the content is indecipherable. Unlike payload, the information set of the packet header is still accessible for extrication. Packet headers were used in this research due to the availability of information even in encrypted traffic situations. Thus, this approach is applicable in detecting malicious attacks within the encrypted network traffic.

3. Dataset Description

The proposed method was experimented using two different datasets: the DARPA 1999 [30] and the ISCX 2012 [46] datasets. We made use of a publicly available labelled dataset simply to avoid the problems described in [47] with recorded traffic from the real environment. Both datasets are available online and have been comprehensively used as a standard benchmark by many researchers in this field, for example, by [6, 15, 16, 48]. The DARPA 1999 dataset is traditional and commonly used in this field. Basically, the dataset is an improved version of the Defence Advanced Research Projects Agency (DARPA) 1998 initiative updated with additional types of attack. In contrast, the ISCX 2012 dataset is a modern updated dataset, which is claimed to have rectified the weaknesses identified in DARPA 1999.

3.1. DARPA 1999

MIT Lincoln Lab has provided a publicly available dataset called DARPA 1999. The dataset consists of traffic data spanning a total of 5 weeks, with 3 weeks of training and 2 weeks of testing data. In view of its multiformat datasets, we choose tcpdump since it contains comprehensive TCP/IP information that is good for traffic analysis. In training data, traffic from weeks 2 and 3 was defined as benign traffic as it is free from attack. Thus, it is suitable to use the data to train ADS. For testing data, weeks 4 and 5 contain attacks that were generated in the middle of benign traffic. The distributions of the attacks are different for week 4 and for week 5. In week 5, the data contains more attacks that were not present in week 4. The different attack distribution is an opportunity for researchers to seek methods that can be used to detect new or novel attacks.

Figure 1 shows the dataset generation simulation based on a scripting technique generating live benign and attack traffic. The scenario is equivalent to flowing traffic from the internal Eyrie Air Force Base to the Internet at large. The test bed generates rich background traffic to simulate the initialisation of traffic, as if the traffic was initiated by thousands of hosts from hundreds of users. All attacks were set to automatically launch against victim machines (UNIX OS) and the external host’s router. The sensor known as “sniffer” was placed within the internal and external network to capture all traffic broadcasted through the network.

3.2. ISCX 2012

This dataset was generated by [46] from the University of Brunswick (UNB) and aimed to address issues in other existing datasets such as DARPA, CAIDA, and DEFCON. The 7-day simulation dataset consists of 3 days of attack-free traffic and 4 days of mixed benign and malign traffic. The distribution model profile concept is the basis of the dataset effectiveness in realism, evaluation, malicious activity, and capabilities. Numerous multiphase attack events were induced to create the anomaly trace to the dataset such as HTTP Denial of Service (DoS), Botnet, Distributed Denial of Service (DDoS), and Brute Force SSH. The simulation was created to simulate and mimic user behaviour activity. Profile-based user behaviour was created by executing a user-profile that synthetically generates at random synchronized times. The dataset came with labelled traffic that could assist the researcher for testing, comparison, and evaluation purposes.

Figure 2 shows the ISCX 2012 test bed network that contains 21 interconnected Windows workstations. Those workstations were equipped with Windows operating systems as a platform to launch attacks against the test bed environment. Out of 21, 17 workstations were installed with Windows XP SP1, 2 with SP2, 1 with SP3, and the rest with Windows 7. The network architecture divides the workstation into four distinct LANs. This configuration was expected to represent a real connectivity network environment. The servers located at the fifth LAN provide web, email, DNS, and Network Address Translation (NAT) services.

The NAT server (192.168.5.124) was placed at the entry point of the network so that the firewall would only allow authorised access. The primary main server (192.168.5.122) was accountable for email services, delivering website and performing as an internal name resolver. The secondary server (192.168.5.123) was made responsible for handling internal ASP.NET applications that sit on Windows Server 2003 machines. Both main and NAT servers were run on Linux operating systems and configured with Ubuntu 10.04. Our experiment was focused on the specific host server addresses DARPA (172.016.114.050) and ISCX (192.168.5.122). These two hosts were chosen due to their having the highest attack traffic content.

4. Methodology

In this research, our anomaly detection approach consists in three parts: preprocessing (hybrid feature selection), statistical analysis (benign behaviour analysis), and data mining (boosting classification algorithm).

4.1. Preprocessing

In the preprocessing step, we adopted our previous HFS [49] approach to leverage the strengths of both the filter and wrapper approaches. In addition, the proposed filter-based subset evaluation (FBSE) was utilised to resolve the drawback in filter-ranking where redundant features exist.

Figure 3 shows the process flows for building HFS, which can be classified into 3 phases as follows.

In Stage 1, the process starts with the filter-subset evaluation. It processes the original features M and produces a new set L of reduced features, where . We proposed the Correlation Feature Selection (CFS) approach due to its robustness in removing redundant and irrelevant features. This approach prevails to overcome the existence of redundant features, as in CFS the relationship between features is measured as in (1). In addition, in feature ranking, the reduced features are usually defined without the need to perform further examination (information gain, gain ration). The CFS is an intelligible filter algorithm that evaluates subsets of features based on a heuristic evaluation function. The evaluation is based on the hypothesis “A good feature subset is one that contains features highly correlated with the class, yet uncorrelated with each other” [25].Equation (1) shows how the merit, , is used to select subset s containing number of features. Both redundant and irrelevant features are determined by the , which represents the mean of the relationship of each feature to its class while the is the mean of the relationship among the features. The exhaustive search is not suitable in large datasets [25] due to its high complexity. As such, we used heuristic search techniques and chose a genetic algorithm as the search function. This was because our experiment reveals that the genetic algorithm gives a global optimum solution and is more robust compared to the best-first and greedy methods. Furthermore, at this stage it is crucial to help to truncate the computational effort using the wrapper approach as it only deals with a reduced set of features compared to the original set of features.

In Stage 2, the reduced feature set L gathered from the FBSE was further processed by WBSE to produce the final set of optimal features K, where . The proposed filter and wrapper hybridisation approaches would leverage both of their strengths to produce a much better result in terms of accuracy, false alarm rate, and fewer redundant and irrelevant features. This was due to the fact that the filter approach could not find the best available subset, as it is less dependent on the classifier. On the other hand, the wrapper approach is proven to be more effective and produced better accuracy. Nevertheless, it is computationally expensive when dealing with a large dataset. Thus, by leveraging the strengths of both methods, we had combined both methods together to form a hybrid feature selection (HFS) approach. We use the Random Forest (RF) classifier in WBSE to evaluate the selected features using genetic search and produced the final K feature subset. The search would continue to train a new model for each subset and will stop once the final optimum subset is found.

Stage 3 is called the classification stage. In this stage, the final optimum subset , produced by WBSE, was tested by the RF classifier with 10-fold cross validation. RF consists of many decision tree classifiers. Each decision tree was constructed from the different original dataset samples. The outputs were chosen based on votes obtained from each tree that indicated the tree’s decision concerning the class object. The most votes for the object are from the best individual trees.

The RF algorithm is widely used in data mining techniques for prediction, pattern recognition, and probability estimation, as in [5153].

Figure 4 presents the general architecture of RF. As RF originates from many decision trees, each tree of RF is grown by a different sample of bootstrap using a decision tree as a weak classifier. The vote is given by each tree to represent the tree’s decision towards the class object. The forest will choose the class with the majority vote of all over the trees. Out-of-bag (OOB) error is used as validation during the tree growth. It is described as the average of the classification error connected to each tree using the sample. After constructing the forest, a new sample needs to be classified according to following equation:where is the class that is assigned by the tree .

4.2. Statistical Based Anomaly Detection (First-Stage Detection)

Although some work has been done in the past directed towards determining how to detect abnormalities using header traffic, for example, [6, 28], this work does not take into account the influences of packet size while analysing benign and abnormal traffic. In our research, through statistical based anomaly detection (SBAD), we computed the attack detection performance by calculating the traffic normality alongside with standard profile and packet size.

4.2.1. Standard Profile

At bottom, the ADS is based on the analysis of the normal behaviour detection model. The normal practice model examines the incoming traffic against its standard normal behaviour to divulge any significant irregular patterns. Using benign traffic rather than abnormal behaviour as a profile was found to be more convenient, as the intruder tends to employ certain evasion techniques. In addition, to determine the difference between anomalous and benign traffic, the probability of discrepancy traffic was calculated using statistical techniques which assign an anomaly score function. The basic idea of generating a normal profile was proposed by Mahoney and Chan [28] using a nonstationary model. The model is based on the possibility of the event, depending on the time since it last happened. The study made by Chen et al. [6] concluded that the nonstationary model is not suitable to detecting attacks which occurred on a different time scale. For instance, for 2 httptunnel attacks that shared the same traffic content and , where occurred 1 s after the previous attack and happened 30 mins after the previous attack, both attacks should share the same anomaly score due to the same packet content. However, the difference on value of both attacks, one with and the other with , has resulted in different anomaly scores. The different anomaly scores for both packets reflect the gap time that occurred between and , resulting in an anomaly score for which is 1800 times greater than that for . As both attacks shared the same content, conveniently they should have the same anomaly score. This approach will give effect when the threshold is set to the certain level, where might be ignored after the system detected . To address these issues, Chen et al. [6] have introduced stationary models that ignored the time dependent scheme.

We adopted the idea of extracting distinct values from attack-free traffic introduced by Mahoney and Chan [28] and stationary models proposed by [6], since such approaches are able to demonstrate traffic characteristics efficiently. Nevertheless, our approach is different, in so far as it does not solely depend on normal profiles to determine malign traffic. Our model can be seen as a unified system, which consists of feature selection, statistical, and data mining approaches. Our research approach is different from [6, 28] in three ways. Firstly, we eliminated superfluous and irrelevant features using our proposed hybrid feature selection methods. Secondly, we used a normal score conjunction with packet size features to produce a better threshold mechanism. In our research, we propose measuring the normal score instead of calculating the anomaly score. The main reason for us calculating the normal score as an alternative to the anomaly score proposed by [6, 28, 29] was because the latter is not sensitive to considering the new value in an attribute. In [6], our observation revealed that benign traffic is more likely to have more novel value than malign traffic. Furthermore, in the real environment, there is more benign traffic compared to malign traffic. Thus, analysing the degree of normal field value in the traffic is appropriate and easier compared to doing so for attack traffic. Thirdly, we proposed 2-stage detection strategy comprise of a statistical approach alongside with Logitboost algorithm with the aim of reducing the overall false detection rate. In our statistical approach, the practice of treating normal traffic behaviour as a standard profile has limited the system’s ability to recognise attack behaviour. Thus, by implementing a classification approach using data mining, additional derived features from statistical procedures along with variation samples of malicious traffic could define attack behaviour more precisely. As a result, the detection accuracy and misclassification rate would be greatly improved.

Figure 5 shows how the proposed ADS model is divided into two phases. In the learning phase, we created a standard profile as a benchmark to determine benign traffic characteristics. The purpose of creating the profile was to identify and calculate the degree of normality for the incoming web traffic (benign or malign). Normal scores were given for every associate feature that was based on the procedure shown in (3) while a standard profile is given in Tables 1 and 2.

We index attributes as , where and is a distinct accumulation of standard packet characteristics while is the total amount of traffic related to each attribute.

Tables 1 and 2 illustrate the generic model of the standard profile. The value represents a distinct value for each attribute. We use a log ratio in our model to calculate the score as the value varies greatly. The normal score is calculated based on distinct values divided by the total number of traffic . The proportion score is multiplied by 100 to get the percentage values.

In the detection phase, the testing data contains a mixture of benign and malign traffic. As can be seen in Figure 6, the web-based traffic within the testing dataset was matched with the standard profile. We incorporated the scores derived from the standard profile into the test dataset. All values within the test dataset were examined very closely. If their unique values are matched with the profile, a normal score will be awarded. However, if the test dataset values are absent in the standard profile, a zero score will be given to the particular attributes.

During the matching procedure, two scores, namely, the Passive Score (PS) and the Active Score (AS), are produced. PS is a fixed score obtained directly from the standard profile, while AS is generated during the matching procedure between the testing data and the standard profile.

Both scores collected during the detection phase (matching process) were then converted into data points that represent coordinates for distance measurement. Later, the degree of normality is defined by calculating the distance between the passive and active data points of the testing dataset. Figure 7 presents the example of normal traffic behaviour when both the passive point and the active point shared the same coordinates. In addition, Figure 8 presents the example of anomaly traffic behaviour when active points are separated from passive points and some outliers are produced. To measure the distance between these two points (benign and outliers), Euclidean Distance Analysis (EDA) was used, given its adequacy in computing basic distances. In this research, we make the assumption that anomalies will occur when there is deviation between normal and abnormal behaviour. Based on that assumption, we flag the possible intrusion by calculating the distance between the two points of training and testing data using Euclidean Distance. As we implement the rigid assumption, the false detection is expected to be huge. Thus further analysis using Chebyshev Inequality is deployed to measure the upper bound for threshold measurement that could improve the detection performance. The EDA between passive and active data points is computed as Thus, the distance between passive point and active point can be simplified into In the next process, we had considered packet size as an additional measure in conjunction with the standard profile. The justification of choosing this feature is briefly explained in the next subsection.

4.2.2. Influence of Packet Size on Traffic Behaviour

Previous work [5456] has proven that the packet size (bytes) can be used to measure the traffic normality. This fact is validated by the nature of a client-server input service request. Typically, in client-server access, a client request would be comprised of a small number of bytes. In return, the server will respond with a large number of bytes. As such, a large number of requests can be considered or suspected as abnormal requests. Normally, when the user makes a request from the same source address, the increase of the extracted string packet size such as “get request” is minimal.

For that reason, the inconsistent input size would trigger anomaly activity. This normally happens when malicious input is bound together in the legitimate traffic. For instance, the XSS (one of the top web attacks) would target web pages with an attempt to add malicious scripts to the website. This activity requires more data that significantly exceeded the size of the average parameter. With regard to the SQL injection type of attack, the attacker’s input would include malicious code to misdirect the program execution. The code is in special strings that could alter the SQL statement with the intention to compromise the intended database files. Consequently, the malicious packets may contain up to several thousand bytes. We therefore statistically measure the packet size of the anomalous source traffic, which was first flagged as anomalous using EDA.

4.2.3. Threshold

We deployed the Chebyshev Inequality method to find the right boundary and determine the finest threshold to achieve a higher detection rate. We have considered anomalous source traffic and packet size as the main features in defining the threshold. Figure 9 illustrates an example of measuring the upper bound of training data from the mean using Chebyshev Inequality.

From the previous anomalous source IP address, we estimated the mean and the standard deviation of their packet size (bytes) distribution by determining the sample (mean and variance) of each parameter size for both datasets during the learning phase (normal traffic). The mean and the variance collected during the learning phase were used to find the regularity in the detection phase. We measured the probability of a packet becoming irregular using Chebyshev Inequality as shown as follows:The advantage of using Chebyshev Inequality is that it does not rely on the knowledge of how the data is distributed, as in the real environment the traffic distribution could vary. It places an upper bound on the possibility that the deviation between the value of the random variables and is greater than the threshold for random distribution with variance and mean . We changed the threshold with the difference between the feature size S and the mean of the feature size distribution. This will define the upper bound for the probability that the feature size of a particular source IP address deviates more from the mean when compared with normal traffic. The probability value for feature size S is calculated as follows:

4.3. Boosting Classification Algorithm (Second-Stage Detection)

In the previous stage, the combination of EDA and CIT statistical approaches had demonstrated some attack detection performance (true positive) ability. However, when the detection approach is solely dependent on normal behaviour as a benchmark, massive false detection is produced. To reduce the false detection rate, the data mining approach is proposed. The main intention is to reexamine the traffic, which has been predicted in binary form either as an anomaly or normal. Furthermore, some additional features such as predicted_field, anomaly_field, and normal_fields which are generated in the first stage are induced into data mining techniques. The additional feature would improve the discriminative power of the classification algorithm, thus improving the detection capabilities and reducing the false alarm rate. In the data mining approach, we propose to use an ensemble technique named boosting algorithm that has the potential to improve the detection accuracy while minimising the false alarm rate, as it is proven to be more efficient than using a single algorithm [57].

In this research we use the boosting algorithm named Logitboost as the metaclassifier for boosting classification. From the literature, we found that this algorithm is more suitable in handling noisy and outlier data compared to the famous Adaboost algorithm. Consider a training data set with samples and divided into two classes (in this study the two classes are abnormal and normal). The two classes are defined as ; that is, samples in class are normal traffic while are the sample of attack traffic. Let the set of training data be , where is the feature vector, and is the target class. The Logitboost algorithm consists of the following steps [39]:(1)Input data set , where and . Input number of iterations .(2)Initialise the weights , ; start committee function and probabilities estimates .(3)Repeat for :(a)Calculate the weights and working response(b)Fit the function by a weighted least squares regression of to using weights . In this research, we use Random Forest as weak classifier to fit the data using weights .(c)Update(4)Output the classifier:At this point, is a function that has two possible output classes:One of the key factors exerting influence on the performance of the boosting algorithm is the construction of the weak classifier. The weak classifier chosen in (8) should be resistant to data overfitting and be able to manage data reweighing. Based on the successful performance of Random Forest (RF), we chose that algorithm as the weak classifier for Logitboost classification.

5. Experiment and Results

The detection performance of the proposed unified approach when applied to both the DARPA 1999 and the ISCX 2012 datasets is presented in this section.

The experimental results were obtained using the Waikato Environment for Knowledge Analysis (WEKA) data mining tools version 3.7 [58] and MySQL as a database management system. Three main performance metrics were used in this experiment to evaluate our proposed methods:(a)False Alarm Rate (FAR). To quantify the amount of benign traffic detected as malicious traffic.(b)Detection Rate (DR). The proportion of detected attacks among all attack data.(c)Accuracy (ACC). Measured in percentage, where instances are correctly predictedWe use the publicly available DARPA 1999 and ISCX 2012 datasets that represent traditional and modern intrusion datasets in evaluating our methods. The detail of the aforesaid datasets can be found in [30, 46]. In the DARPA 1999 dataset, the week 4 training data and week 5 testing data consist of 136,962 types of http traffic, as presented in Table 3. With regard to the ISCX 2012 dataset, 8,000 unique instances of http traffic were used in the training data while a total of 34,957 instances of http traffic were used in the testing data, as showed in Table 4.

In the preprocessing phase, we employed the HFS approach for both datasets to select the most prominent features. Through this process, the original 33 DARPA 1999 and the original 11 ISCX 2012 features were reduced to 10 and 4, respectively, as shown in Tables 5 and 6. This significant reduction of features has contributed to reducing the overall computational costs in this experiment.

Thereafter, the process continues to statistically measure the packet header with Euclidean Distance Analysis (EDA) and Chebyshev Inequalities. We used EDA to find the outliers in the testing data by calculating the distance between testing data and training data. In determining the finest threshold, the upper bound was computed using the CIT method. Table 7 shows the comparison results achieved in statistical analysis for both datasets.

By implementing SBAD alone, the approach was seen to generate a number of false alarm rates. Upon closer investigation, we found that the false detection was derived from masquerade traffic where benign traffic shared the same behaviour with malign traffic and vice versa. Thus, the data mining technique particularly using a boosting algorithm classification is employed as a compliment to the SBAD to reduce the inaccurate classification rate.

Table 7 presents a data performance comparison between SBAD and the proposed unified approach. The result shows that the proposed UIAD had outperformed SBAD in terms of FAR, DR, and ACC. This has indicated that the anomaly detection components in the second stage are a good complement for attack detection in first stage. The implementation of 2-stage detection has significantly reduces the false alarm rate from 5.1% and 3.5% to 0.13% and 0.08% for both datasets, respectively. Although the detection rate of SBAD in the ISCX 2012 dataset is slightly better by 0.15%, in terms of the overall accuracy produced, the performance of UIAD is slightly ahead by 0.53%, along with a more than 43 times reduction of false alarm rate compared to SBAD.

To test the robustness of our proposed unified approach, we ensured that the attack traffic in both training and testing data was significantly different. In simple terms, this mean that the sample attack traffic used in the training data is not itself part of the testing data. In addition, we made sure that the proportion of attack traffic in the training data was less than the attack traffic in the testing dataset. For example, in this research 2,432 and 3,714 amounts of attack traffic were used in the training data to build the classification model while 12,298 and 28,329 amounts of attack traffic are available for detection in the DARPA 1999 and ISCX 2012 testing sets, respectively.

Table 8 lists 6 types of attack available in both weeks 4 and 5 from the DARPA 1999 dataset. The 4 types of attack existed in week 4 (training dataset) were back, ipsweep, perl, and phf. Subsequently on week 5 (testing dataset), 5 types of attack, back, ipsweep, and perl plus two new attacks named secret and tcpreset, were identified. Our unified approach successfully recognised 95.84% of attack instances in the testing dataset. The attack types with the highest detection rate are U2R (100.00%) and DATA (100%), followed by DoS (75.71%), and the lowest is the PROBE (67.56%). Upon closer analysis, we noticed that the poor performance of PROBE was due to the nature of the attack itself, which shares similar characteristics with normal traffic behaviour. As the nature of PROBE attacks is to gather system information and to discover known vulnerabilities, the relevant kind of traffic seems to be legitimate and is mostly classified as normal by the system. With regard to the DoS attack type, the low detection percentage of “back” attack was caused by the lack of samples available in the training dataset. The sample was 52 times smaller than the attack in the testing dataset. It is worth mentioning that our proposed unified approach successfully identified 2 new attacks name “tcpreset” and “secret” that were only present in the testing dataset, which indicated that our proposed unified approach is capable of detecting unknown attacks.

In the ISCX 2012 dataset, the attack class is represented in binary form either as normal or attack traffic. Thus, the analysis on the specific attack type in the dataset is not possible. As shown in Table 9, with a limited number of attacks available in training dataset, the system successfully recognises almost all the attacks in the testing dataset, with a 99.66% detection rate.

Tables 10 and 11 show the performance of our proposed unified model in terms of FAR, DR, and ACC compared to the previous methods tested on the DARPA 1999 and ISCX 2012 datasets. It should be noted that the comparisons are for reference only due to many researchers having used different proportions of traffic types, sampling methods, and preprocessing techniques. Although our proposed approach had achieved better performance in most of the cases, it cannot be claimed that the proposed method outperformed others. Nevertheless, our proposed approach has shown some detection ability with a robust performance in detecting unknown attack traffic.

In addition, it should be noted that we evaluated the performance of the proposed approach with some eminent state-of-the-art data mining algorithms used in IDS. Tables 12 and 13 display a comparison of performance metrics between our proposed approach and seven other data mining algorithms previously used by researchers in IDSs, including Naïve Bayes [59], Support Vector Machine [60], Multilayer Perceptron [61], Decision Table [62], Decision Tree [63], Random Forest [64], and Adaboost [36].

To choose a better combination for the Logitboost classifier from a set of single classifiers in terms of accuracy, detection rates, and false alarm rates, five single classifiers are evaluated individually as illustrated in Figures 10 and 11. This is a crucial aspect of our research because the algorithm choices need to be further reclassified with ensemble approaches for better detection performance. In the DARPA 1999 dataset, among five classifiers the accuracy, detection rate, and false alarm rate shown by RF are comparable with others. Although MLP had shown slightly better performance compared to RF, the time taken to build a classification model by MLP is 84 times longer than RF. Meanwhile, in the ISCX 2012 dataset, RF outperformed every single other classifier by achieving 99.68% detection accuracy. Thus, in our unified detection approach, we had chosen RF to ensemble with the Logitboost classifier for both the DARPA 1999 and ISCX 2012 datasets.

To compare the performance of the Adaboost ensemble with RF and our unified approach, a further experiment is performed as presented in Tables 12 and 13. Due to the high complexity of our proposed method, Table 12 indicated that our proposed method took slightly longer in building a classification model and attack detection compared to Adaboost + RF. As a result, our method took 0.82 seconds and 0.38 seconds longer than Adaboost + RF in building and testing classification model. Although our proposed method recorded higher computational complexity, overall performance that includes detection rate and overall accuracy rate reveals that our proposed method has indicated a better performance with 6.99% and 0.79% improvement, respectively, over the Adaboost + RF in DARPA 1999.

Table 13 presents the performance of our unified proposed approach on the ISCX 2012 dataset. Further comparison between Adaboost and our proposed unified approach has shown Adaboost performed slightly better in terms of performance accuracy and detection rate, displaying 27% less computational complexity in building classification model. However, the false alarm rates obtained by Adaboost are 6.5 times worse than our proposed model. From the aforementioned results, we conclude that our algorithm provides comparable detection accuracy rate with a low false alarm rate, which is the most crucial property of IDSs in practice.

6. Conclusion and Future Work

There were numerous anomaly intrusion detection studies made in the past. Nevertheless, achieving exceptionally low false alarm rates with high attack recognition capabilities for unseen attacks still remains a major challenge. This paper presented the novel Unified Intrusion Anomaly Detection (UIAD) experiment results. The experiment synthesised both statistical and data mining approaches to achieve better results. The model consists of three major parts: preprocessing, statistical measurements, and a boosting algorithm. The UIAD was evaluated using two publicly available labelled intrusion detection evaluation datasets (DARPA 1999 and ISCX 2012) to allow different integration testing environments. Initially, in the preprocessing phase, redundant and irrelevant features were filtered-out by HFS to obtain the most prominent features. Following that, we deployed the EDA and Chebyshev Inequality methods to measure and determine the normality (benign or malicious) of the traffic characteristics. We employed a data mining approach using the Logitboost classifier algorithm to improve the overall detection accuracy while reducing the false alarm rate. The combination detection of statistical analysis and data mining approaches demonstrated a promisingly reliable rate of anomaly based intrusion detection. Individually, the statistical approach was capable of demonstrating some level of detection ability. However, the better-synergised approach of statistical and data mining approaches yielded better performance particularly in reducing the low false alarm rate below 1%. The experimental results have demonstrated that our proposed UIAD has achieved comparable performance with other established state-of-the-art IDS algorithms. Moving forward, the final successful results will be transformed into signatures and stored in the blacklist database for future identification proposes. We believe that detection time can be drastically reduced, since the new entry traffic can be matched with benign/malicious signatures generated from the previous detection. Moreover, the proposed unified approach can potentially be evaluated online using larger, as well as the latest, encrypted sets of traffic.

Conflicts of Interest

The authors declare that they have no conflicts of interest.