Abstract

In recent years, the number of malware and infected hosts has increased exponentially, which causes great losses to governments, enterprises, and individuals. However, traditional technologies are difficult to timely detect malware that has been deformed, confused, or modified since they usually detect hosts before being infected by malware. Host detection during malware infection can make up for their deficiency. Moreover, the infected host usually sends a connection request to the command and control (C&C) server using the HTTP protocol, which generates malicious external traffic. Thus, if the host is found to have malicious external traffic, the host may be a host infected by malware. Based on the background, this paper uses HTTP traffic combined with eXtreme Gradient Boosting (XGBoost) algorithm to detect infected hosts in order to improve detection efficiency and accuracy. The proposed approach uses a template automatic generation algorithm to generate feature templates for HTTP headers and uses XGBoost algorithm to distinguish between malicious traffic and normal traffic. We conduct a performance analysis to demonstrate that our approach is efficient using dataset, which includes malware traffic from MALWARE-TRAFFIC-ANALYSIS.NET and normal traffic from UNSW-NB 15. Experimental results show that the detection speed is about 1859 HTTP traffic per second, and the detection accuracy reaches 98.72%, and the false positive rate is less than 1%.

1. Introduction

With the booming of the Internet and the popularity of computers, today’s computers are facing serious security problems, whose biggest cause is the explosive growth of malicious code. The malicious code refers to a computer code that is intentionally written by individuals or organizations to pose a security risk to a computer or network. It usually contains malicious sharing software, adware Trojans, viruses, worms, etc., each of which has different kinds of variants [15]. In the first half of 2018, China Internet Security News from 360 Internet Security Center shows that a total of 140 million new malicious programs were intercepted and an average of 795,000 new malicious programs were intercepted every day. Among them, the number of malicious programs on the PC side was 14,098,000, and an average of 779,000 new malicious programs were intercepted every day [6]. In the fourth quarter of 2017, McAfee Labs detected the highest number of new malware in history, with a total of 63.4 million new samples. McAfee Labs records an average of eight new malware samples per second, a significant increase from the four new samples recorded in the third quarter [7]. The malware not only brings huge economic losses to users, but also rapid changes have brought great trouble and pressure to the antikilling technology of malicious programs. The current technology has been difficult to detect malware before the host is infected.

Based on this background, detecting malware-infected hosts in network traffic can make up for the shortcoming [8] because most malware will communicate with externally hosted command and control (C&C) servers using the HTTP protocol after infecting the device. The C&C server is the control center that sends malware execution commands, and it is where malware collects data. After an attacker attacks the host with malware, the controlled host sends a connection request to the C&C server. The traffic generated by the connection is malicious external traffic. Currently, there are two main ways to detect malicious external traffic. One is to filter malicious domain names based on blacklists, and the other is to use rules to match malicious external traffic. Both of these solutions have certain limitations. The blacklist-based filtering scheme can only identify malicious external traffic when connecting to a known malicious website and has no perception of domain name changes. However, based on the feature detection scheme, it is necessary for the security practitioner to analyze the samples one by one, which consumes large manpower and is difficult to detect the malicious external connection traffic of the variant.

As a supplement to the prior art, malicious traffic can be detected through machine learning. Using machine learning to discover the commonality between malicious traffic and use it as a basis to detect malicious traffic, a good algorithm can greatly reduce the workload of security practitioners. Specifically, the contributions of this work are specified as follows:(1)We propose an approach-combined machine learning and HTTP header template to discover traffic involved in malware infection and develop it into the MalDetector system.(2)We use the statistical technique to aggregate similar features of HTTP header fields, which is also called HTTP header template, from large-scale network traffic.(3)We use the GridSearchCV function to coordinate the eXtreme Gradient Boosting (XGBoost) algorithm and verify their effectiveness in the dataset consisting of malicious external traffic generated from malicious samples from MALWARE-TRAFFIC-ANALYSIS.NET [9] running in the sandbox and the UNSW-NB 15 dataset [10].

The structure of this paper is arranged as follows. We introduce the related work in Section 2. Section 3 presents an overview of the proposed approach. The process of template automatic generation from the HTTP header is described in Section 4. Section 5 completes the experimental evaluation metrics and illustrates the experimental results. We make a conclusion of the paper in Section 6.

At present, the malware traffic identification approach based on HTTP traffic mainly focuses on two aspects [1124]; one is based on the request and response statistical features [1116] and the other is based on the content of the HTTP packet [1726].

2.1. The Request and Response Statistical Features

The approach mainly analyzes the behavior characteristics of HTTP request/response time interval, quantity, and packet size to model malicious behavior and identify malware traffic. Perdisci et al. [11] developed a novel network-level behavioral malware clustering system. They performed coarse-grained clustering through statistical features, such as the total number of HTTP requests, the number of GET requests, the number of POST requests, the average length of the URLs, the average number of parameters in the request, the average amount of data sent by POST requests, and the average response length. Then, they performed fine-grained clustering by calculating the difference in URL structure between two malware samples. At last, they merged together fine-grained clusters of malware variants that behave similarly enough. Their work can be able to unveil similarities among malware samples that may not be captured by current system-level behavioral clustering systems. Ogawa et al. [12] extracted new features such as HTTP request interval, body size, and header bag-of-words from HTTP request/response pairs and calculated cluster appearance ratio per communication host pairs and identified malware originated communication host pairs. However, the identification approach based on the request and response statistical features is limited to malware samples that perform some interesting actions (i.e., malicious activities) during the execution time T. The identification approach based on the content of HTTP requests and responses can overcome this limitation.

2.2. The Content of HTTP Packets

The approach performs an analysis of the content of HTTP requests and responses, extracts relevant field information to process it, and combines machine learning algorithm to identify malware traffic. Zhang et al. [17, 18] used a learning-based approach to discover dependencies of network with the help of HTTP request features and thus detect malicious traffic. Srivastava et al. [19] developed a system called ExecScent that is closest to this work. They used all the HTTP header fields to detect botnet traffic. They manually created templates by themselves, such as URL-Path, Query, and User-Agent, and formatted them using regular expressions. Zhang et al. [20] proposed a method that used the User-Agent field to detect malicious external traffic generated by malware. They used regular expressions to format HTTP header information and used the operating system’s fingerprint technology to identify whether it was a fake user agent domain to infer if there was a malware infection. Grill and Rehak [21] also used the User-Agent field to detect the presence of malicious external traffic. They found that all User-Agent field information can be divided into five categories: legitimate user browser information, null, specific, spoofed, and inconsistent. According to their findings, some malware deliberately forged requests that were sent from a web browser, making it difficult to detect malicious outbound traffic from the User-Agent field. Li et al. [22] proposed MalHunter based on behavior-related statistical characteristics. They detected malware communication patterns from three types of features: character distribution of the URL, HTTP header fields, and HTTP header sequence. However, these approaches are either based on a single field or based on all fields, and their feature validity is low.

Moreover, Zhang et al. [23] presented a system SMASH that uses unsupervised data mining methods to detect various attack activities and malicious communication activities, focusing on detecting malicious HTTP activity from the perspective of server-side communication. Mekky et al. [24] put forward a method for identifying HTTP redirected malicious links. They built per-user chains from passively collected traffic and extracted new statistical features from them to capture the inherent characteristics of malicious redirect cases. The supervised decision tree classifier is then applied to identify malicious links. Liu et al. [25] proposed an identification approach by analyzing HTTP connections established by clients in a monitored network and combining stream classification with graph-based fractional propagation methods to identify previously undetected Internet Service Provider (ISP) networks.

3. HTTP-Based Infected Host Detection Approach

The proposed HTTP-based infected host detection system includes four modules: HTTP traffic filtering, header feature extraction, template automatic generation, and infected host detection. Figure 1 gives an overview of the framework of our proposed infected host detection approach using HTTP traffic.

3.1. HTTP Traffic Filtering and Header Feature Extraction

We save the HTTP header to reduce the amount of stored traffic. We also select the important information from the HTTP header for further analysis. The number of distinct HTTP header fields could be roughly 10 K. Moreover, some unrelated features may expose the machine learning model to the risk of overfitting. Rare fields are nonversatile, so the selection criteria are that we do not extract fields that appear less than 10 times or never appear in training data.

In addition, we mainly focus on the detection of malware that leverages the HTTP as the primary channel to communicate with the C&C server or to launch attack activities. Thus, our approach mainly focuses on HTTP requests rather than responses. If the C&C server is temporarily offline or changes its response content, there is little impact on our detection capabilities. Therefore, the selected fields are URI, Host, User-Agent, Request-Method, Request-Version, Accept, Accept-Encoding, Connection, Content-type, Cache-Control, Content-length, and some identification fields like Frame-time, srcIP (source IP), srcPort (source port), dstIP (destination IP), and dstPort (destination port).

Table 1 lists the description of the selected fields. The reason for selecting them is that they are often used in HTTP traffic and may be helpful in distinguishing legitimate traffic and malicious traffic.

3.2. Template Automatic Generation

When malware communicates with externally hosted C&C servers, malware developers typically use custom formats to construct packets. The network traffic generated by the malware belonging to the same family usually has a similarity. Therefore, we use statistical techniques to aggregate similar features of the HTTP header fields, that is, to generate similar templates for malicious traffic, and then use the template to detect new malicious traffic. A template is a series of strings, the character part represents the same part of the value of an HTTP header field, and ∗ represents the different parts of the value of the header field. Templates are generated to display the variability of words constituting the HTTP header fields and aim to compress their information. The template automatic generate module consists of three steps: scoring, clustering, and generating templates [27], which is explained in detail in Section 4.

3.3. Infected Host Detection

Many winners in Kaggle’s competitions like to use XGBoost [28] due to Parallelization, Distributed Computing, Out-of-Core Computing, and Cache Optimization of data structures and algorithms. Thus, we use the XGBoost algorithm to classify malicious traffic and normal traffic in this work.

4. Template Automatic Generation

This section introduces focuses on how template automatic generation algorithm works.

4.1. Scoring

We first calculate the score for each value of the selected HTTP header fields by using the score calculation method, and then sort each selected HTTP header field’s values according to their scores. Each field in the HTTP header is divided by the following four separators: space, “/”, “=”, and “,”. Thus, the score calculation method is that we split each selected HTTP header field by separator and then calculate the percentage of their values’ scores. For a value in the field F, its score is , which can be calculated usingwhere is the position of the value in the field F, len(F) is the number of values in the field F. For example, F = {foo, bar, baz, quz},  = bar, , and len (F) = 4. n(X) is the number of times that X appears in all the HTTP header, represents the number of times that appears in all the pos field of all data, and indicates the number of times that the pos field appears in all the HTTP header. As shown in Figure 2, the score of “rv: 19.0” is 0.33 ().

4.2. Clustering

We use the idea of the DBSCAN [29, 30] algorithm to cluster the values of the selected HTTP header fields. In the selected HTTP header field, when the score of the next value differs from the score of the previous value by less than δ, the next value is added as the current cluster; otherwise, the next value is added to the other clusters. Repeat the above process until all values have been added to the cluster. Here, the DBSCAN algorithm requires two parameters: scan radius (eps) and minimum inclusion points (minPts). The working process of the DBSCAN algorithm is as follows.

Starting with an unvisited point and finding all nearby points within the eps (including eps). If the number of nearby points is not smaller than minPts, the current point forms a cluster with its nearby points, and the starting point is marked as visited. Then recursively, all the points in the cluster that are not marked as visited are processed in the same way, thereby expanding the cluster. If the number of nearby points is smaller than minPts, the point is temporarily marked as a noise point. If the cluster is fully extended, i.e., all points within the cluster are marked as accessed, then the same algorithm is used to process the unvisited points.

Finally, we descript our clustering approach with the scoring method and DBCSAN algorithm in the following.

First, we need to introduce the following two parameters: (δ ≥ 0) and β (0 < β < 1), δ is the minimum distance between two clusters, β × len(F) for the minimum number of points in the cluster, and len(F) refers to the number of value in a field. In this work, the δ is set to 0.1 and β is set to 0.5.

Then, we sorted each word in descending score. When the score of the next word differs from the mean score of a cluster by less than δ, the next word is added to the current cluster. Otherwise, the next word is assigned to a new current cluster. This process is repeated until all words are included in either cluster.

4.3. Generating Templates

The results of the clustering are filtered to preserve only the clusters whose values are larger than β × len(F) and the remaining clusters are replaced with “∗”, where ä is the minimum distance between two clusters, whose value is not smaller than 0; β × len(F)(0 < β < 1) is the minimum number of points in the cluster, and len(F) is the number of values of the field. The overall generation process is shown in Figure 2.

The generated HTTP header field information and HTTP template are shown in Table 2.

We also performed statistics on the templates generated by the training data. The statistical results are shown in Figure 3.

As can be seen from Figure 3, the number of templates for malicious traffic is generally several times larger than the number of templates for normal traffic. The maximum number of templates generated is the URI and User-Agent fields. It can be inferred that malicious traffic may be distinguished mainly based on templates of these several fields. It has been observed that some fields do not even have the generation of malicious traffic templates. It can be inferred that the HTTP request information of malicious traffic may be short, including only information of several fields. Probably because normal HTTP request traffic is usually a connection made through a browser, the browser logs information for many fields. Malicious traffic is a connection made to the C&C server through malware, and the data format is usually constructed by a malware developer, so the HTTP request message is shorter.

5. Experiments and Results

This section introduces the dataset, the experimental setup, the performance metrics, and the obtained results.

5.1. Dataset

The malware traffic used in this work is from MALWARE-TRAFFIC-ANALYSIS.NET [9]. We collect malicious external traffic by running malicious samples collected from June 2013 to December 2017 in the sandbox and use SecurityOnion (a tool for network security monitoring) to detect traffic and get the result. The normal traffic samples are from the UNSW-NB 15 dataset shared by the Cyber Range Lab of the Australian Cyber Security Center (ACCS) in 2015 [10]. They used the tcpdump tool to capture 100 GB of raw traffic (PCAP files) for evaluating network intrusion detection systems and gave a labeled dataset. The labeled file contains the time period, the source port, the source IP address, the destination port, the destination IP address, the protocol type, and other information of the threat traffic, which is shown in Table 3. There are 373864 HTTP request records and only 6401 malicious traffic records in the 100G raw traffic data. We remove malicious HTTP traffic based on source IP, destination IP, source port, destination port, and the time period (from the start time to the last time) in the given labeled file. When the protocol type is HTTP and the time period, source port, source IP, destination port and destination IP address are matched successfully, the traffic is labeled as malicious traffic.

We set the ratio of the training set to the testing data as 7 : 3. Thus, the dataset in the experiment is shown in Table 4, which consists of 34,239 malicious HTTP requests and 35,481 normal HTTP requests.

5.2. Experimental Setup

The system had been implemented in Python 3.5, and all experiments were performed using an off-the-shelf server with 64 GB of RAM memory and 6-core processor. In order to evaluate the true positive rates and false positive rates of our detection approach, we tune the model parameters on the training set. The initial key parameters of the XGBoost model are shown in Table 5.

Table 5 shows that the accuracy of cross-validation of the training set with the initial parameters is 99.5%, but the accuracy of the testing set is only 92.89% due to over-fitting. In order to further improve the accuracy of the prediction, we further adjust the parameters of the XGBoost algorithm.

We use the GirdSerachCV function in the SCIKIT-learn [31] package to adjust the parameters, which traverses the value range of parameters. We adjust three of the key parameters, and the adjustment steps are as follows:(1)We first adjust two parameters max_depth and min_child_weight that play a decisive role in the model. The value range of max_depth is set to [4, 6, 8, 10, 12]. The value range of min_child_weight is very large and seriously affects the experimental results. If min_child_weight is over-fitting, the value of min_child_weight should be increased. Thus, its value range is set to [1, 10, 100, 1000]. The results of the parameter adjustment are shown in Table 6. The experimental results show that the model performs optimally when max_depth = 10 and min_child_weight = 1.(2)Based on the adjusted max_depth and min_child_weight parameters, we adjust the parameter gamma, which participates in the pruning of the decision tree. The larger the value of the parameter is, the less the impact on the model is. Here, we set the value range of gamma to [0∼8]. The results of the parameter adjustment are shown in Table 7. The experimental results show that the model with the best performance when gamma = 0.(3)We adjust the two parameters subsample and colsample_bytree at last, which is related to the proportion of samples used. If the sampling setting ratio is too small, the accuracy may be reduced. Here, the value range of the subsample is set to [0.7∼1], and the value range of colsample_bytree is also to [0.7∼1]. The results of the parameter adjustment are shown in Table 8. The experimental results show that the model performs best when subsample = 0.8 and colsample_bytree = 0.8.

5.3. Evaluation Metrics

The evaluation metrics of our proposed infected host detection approach using malicious external HTTP traffic are expressed as follows: TP refers to the number of malicious HTTP requests that are recognized as malware HTTP requests, TN indicates that the number of normal HTTP requests that are recognized as normal HTTP requests, FP refers to the number of normal HTTP requests that have been mistaken for malware HTTP requests, and FN indicates that the number of normal HTTP requests that are incorrectly identified as malware HTTP requests. The higher the value of precision, recall, and F1, the better the recognition effect of the infected host detection approach.(1)(2)ROC curve whose horizontal axis is FRP and vertical axis is TRP, where and (3)PRC curve whose vertical axis is precision and horizontal axis is recall, where and (4)

5.4. Experimental Results

When the ratio of the number of HTTP requests in the training set and testing set is 7 : 3, the experimental results are shown in Table 9.

The accuracy of the testing set is 98.72%, and the false positive rate is less than 1%. The total testing time is about 7 s. Therefore, the proposed approach can quickly detect the network traffic and conclude whether the host is infected by malware so that the user can respond to the action as soon as possible. The PRC curve matching the threshold is shown in Figure 4. It can be seen that the algorithm has maintained a high precision with the increase of the recall rate. Finally, 0.8 is selected as the matching threshold. At this time, the accuracy of the algorithm is 93.56%, the recall rate is 97.14%, and the F-value is 0.9532.

To better validate our proposed approach, we also compare our approach to the other two methods of Ogawa et al. [12] and Li et al. [22]. We reproduced these two comparison experiments using our own data set. The experimental results are shown in Table 10.

Table 10 shows that the ACC, P, R, and F1 of our proposed approach are the largest, and they are 0.9827, 0.9356, 0.9714, and 0.9532, respectively. Therefore, our proposed approach using XGBoost and HTTP header statistical template is better to detect HTTP malware traffic than the method that uses HTTP header combined machine learning. The main reason is that Ogawa et al.’s approach and Li et al.’s approach are either based on a single field or based on all fields, their feature validity is low. Our proposed approach uses statistical techniques to aggregate similar features of the malicious HTTP header fields. Thus, our approach can more effectively characterize malware traffic characteristics, which can further improve the accuracy of malware HTTP traffic recognition.

In addition, we select 10%, 20%, 30%, … , 90% of the samples as the training set and set the matching threshold to 0.8 to test other sample data. The correct rate and false positive rate of malicious traffic and normal traffic are separately measured, whose results are shown in Figure 5. It can be seen that the detection rate of the normal HTTP requests has been maintained above 99%. For malicious samples, the detection accuracy rate is based on the diversity of the model. In the case that the training set is only 10% and the model data is insufficient, the algorithm can still detect 77.65% of malicious traffic, indicating that the algorithm has better generalization ability for malicious traffic variants.

We also change the malicious traffic and normal traffic ratio in our training set and testing set. The experimental results are shown in Table 11.

The accuracy rates under different malware traffic ratios all remained above 90%. However, the model has high precision but a low recall rate when malicious traffic accounted for 10% and 20%, respectively. The main reason is that the proportion of malicious traffic is too small, resulting in insufficient training of the model. The results show that if we want to build a machine learning model which can correctly identify malicious traffic, the proportion of malicious traffic and the normal flow ratio needs to be maintained at a relative balance. Malicious traffic accounts for less than 1% of the data in real-world samples. Thus, it is necessary to further process the sample, such as subsampling or oversampling, to increase the proportion of malicious traffic, thereby improving detection accuracy.

5.5. MalDetector System Testing

We also use the malicious traffic samples that do not exist in the training data and testing data to verify if the system has the ability to detect new malware and its variants. The selected malicious traffic samples have the same source as the training data, both of which are MALWARE-TRAFFIC-ANALYSIS.NET.

5.5.1. Loki-Bot

Loki-Bot [32] uses a malicious website to push fake “Adobe Flash Player,” “APK Installer,” “System Update,” “Adblock,” “Security Certificate,” and other application updates to induce user installation. The Loki-Bot malware is a bank hijacking Trojan, a variant of the BankBot Trojan. The traffic sample of running Loki-Bot and the testing result using MalDetector are shown in Figure 6. The experimental results show that MalDetector detects all the malicious HTTP traffic of Loki-Bot.

5.5.2. Emotet

Emotet [33] is a new type of banking Trojan in Germany. The sample flow is a new variant of Emotet that appeared in September 2017. It has its own ability to evade safety detection and cannot be recognized by antivirus software. The traffic sample of running Emotet and the testing result using MalDetector are shown in Figure 7. The experimental results show that MalDetector detects all the malicious HTTP traffic of Emotet.

6. Conclusion

The diversification of malware and the complication of its technologies have brought new challenges to cybersecurity. Unfortunately, rule-based traditional malware traffic detection methods are unable to detect malware variants. Machine learning-based methods can make up for this defect, and most malware uses the HTTP protocol to send malicious external traffic to the C&C server. Thus, we propose an approach to detect infected hosts using HTTP traffic combined with a machine learning algorithm. We mainly extract the common templates for the HTTP traffic header, so it still works for the traffic generated by the confusing malware. We also use the most popular XGBoost algorithm to detect infected hosts, which has the advantages of high efficiency and high accuracy. The experimental results show that the accuracy of the method reaches 98.72% and the false positive rate is less than 1%, where the experimental data is from MALWARE-TRAFFIC-ANALYSIS.NET and UNSW-NB 15. We also used two real samples that are Loki-Bot and Emotet to verify the effectiveness of the MalDetector system. We plan to combine the approach with malware dynamic analysis to further improve its detection accuracy in the future. Furthermore, some malware utilizes HTTPS to hide its content from the analyzer so that it further reduces detection possibility. Because the header information of HTTPS traffic has been encrypted, our method cannot be applied. We will consider new fields and combine with DNS traffic to refine the templates to detect anomaly-based malware infection in the future.

Data Availability

The experimental data were collected and synthesized by ourselves. It has not been published online yet.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Key Research and Development Project (Grant no. 2016QY04W0800), the National Defense Innovation Special Zone Program of Science and Technology (Grant no. JG2019055), and the National Natural Science Foundation of China (Grant nos. 61902262 and 61572115).