Abstract

Mobile malware poses a great challenge to mobile devices and mobile communication. With the explosive growth of mobile networks, it is significant to detect mobile malware for mobile security. Since most mobile malware relies on the networks to coordinate operations, steal information, or launch attacks, evading network monitor is difficult for the mobile malware. In this paper, we present an N-gram, semantic-based neural modeling method to detect the network traffic generated by the mobile malware. In the proposed scheme, we segment the network traffic into flows and extract the application layer payload from each packet. Then, the generated flow payload data are converted into the text form as the input of the proposed model. Each flow text consists of several domains with 20 words. The proposed scheme models the domain representation using convolutional neural network with multiwidth kernels from each domain. Afterward, relationships of domains are adaptively encoded in flow representation using gated recurrent network and then the classification result is obtained from an attention layer. A series of experiments have been conducted to verify the effectiveness of our proposed scheme. In addition, to compare with the state-of-the-art methods, several comparative experiments also are conducted. The experiment results depict that our proposed scheme is better in terms of accuracy.

1. Introduction

Thanks to the advance of the mobile communication networks, recent years are able to witness the opening of the mobile era. And, the smart IoT (Internet of Things) devices and smart applications may continue increasing in the 5G\6G communication networks. However, continuous improvement in mobile device hardware and mobile communication technologies has not only led to a highly interconnected world but also a world grown highly vulnerable. The explosive growth of mobile communication brings substantial burden to the mobile security management. According to a recent report [1], the number of apps in the Google Play Store has risen from 16000 in December 2009 to more than 2 million in February 2016. And, the mobile traffic amount has reached 3.7 exabytes per month in 2015. However, the increase in mobile application is highly impaired by the prevalent malware. Among different operating systems, Android becomes the most popular platform due to its open architecture [2]. Unfortunately, mobile devices running with the Android system have gradually become the main target of attackers and are infected by malicious apps. This circumstance reveals the urgency of enforcing mobile communication network security.

To address the problem of the mobile malware, many researchers have to pay attention to the detection of malware apps. Evading traffic-based methods is difficult, as malware usually launches malicious behaviors through network connections, including receiving commands from servers, transferring the stolen data, and so on. In addition, detecting malware traffic has many advantages in terms of ease of deployment. Thus, detecting malware based on network traffic has the potential to greatly reduce the threat of malicious activity in the mobile communication networks. For this reason, researchers start exploring new solutions for mobile malware detection based on network traffic.

Network-traffic-based methods can be divided into three types [3], namely, network signature-based analysis, lexical-feature-based analysis, and statistical-feature-based analysis, respectively. Signature-based detection collects domains and signatures from known malware and compares them against suspicious pieces of code in order to determine whether they are malicious or benign. However, signature-based methods lack adaptability. It can detect known attacks but has limited ability to handle novel ones. The statistic-based methods detect the malware traffic by utilizing the header information of the packets, for example, IP layer information or TCP layer information. The commonly used statistical features contain packet size, duration of flow, the radio of the packets, and so on. And, these features are combined with classifiers, e.g., Random Forest, K-Nearest Neighbor, and Decision Tree. However, the performance of these methods highly depends on the manually engineered features and some private traffic information, hence dramatically limiting their accuracy and generalizability.

The lexical-based methods detect the malware traffic according to the text analysis of the network traffic. On the one hand, comparing with signature-based methods, it can adapt swiftly to detect new attacks. On the other hand, with specific fields of traffic, it has a fine-grained characterization of flows than statistical features, which greatly improves malware detection performance. However, these methods typically consider plaintext traffic such as HTTP traffic. In the recent research [3], the researchers have decrypted the encrypted TLS traffic, before classifying the traffic. In addition, manually analyzing and extracting protocol feature domains for mobile malware detection has limitations. The extracted features fail to detect mobile malware traffic when new protocols are employed by the malware. Because of the existence of various types of the network application protocols, there are no fixed domains for network traffic flows. It is difficult to manually extract meaningful semantic features for all protocols separately. As it requires a lot of manual resources, and when new protocols appear, existing semantic features failed to cover them. To address this problem, we employ the neural network model to detect mobile network traffic in an end-to-end way.

In this paper, we propose an N-gram, semantic-based neural method to detect mobile malware network traffic. In our scheme, we transfer each five-tuple-based network flow into the text file that contains several domains. Each domain in the text consists of 20 words and each word in the text is a byte content in the network flow. To analyze the network traffic semantic, the proposed neural model processes the network traffic flow representation in two stages. It first produces continuous domain vectors from word representations with domain composition. These domain representations are treated as inputs of network flow composition to get network flow representation. Network flow representations are then used as features for flow-level malware traffic detection. And, each flow has an associated label, which is required to train the algorithm as we employ supervised method.

We employ the deep learning method to process the generated texts to find the flows generated by the mobile malware, as deep learning methods can avoid manually analyzing and selecting features, and these methods have a higher learning capability compared to traditional machine learning methods like Random Forest and Support Vector Machine [4]. One of the main drivers of this work is to assess the applicability of deep learning advances to the lexical-based mobile malware detection problem. Therefore, we have studied the adequacy of different deep learning architectures. And, for the proposed model, we have conducted several experiments to find the best performance of the architecture and parameters. In addition, the size of the text also the impacts the model performance; thus, we have conducted several experiments on the text size.

This paper’s contributions can be summarized as follows:(1)We propose a novel network traffic process method that converts network flows into texts, and the words in the text consists of a byte content of the network flow payload. Each network flow is divided into several domains and each domain consists of several character-level words.(2)We propose an N-gram semantic neural model to detect mobile malware traffic. It first produces continuous domain vectors from word representations’ domain composition using convolutional neural network (CNN) model. Afterward, several domain vectors are processed as a domain representation. Then, these domain representations are treated as inputs of network flow composition to get network flow representation using gated recurrent unit (GRU). Network flow representations are then used as features for flow-level malware traffic detection.(3)We conducted a series experiment on real-world network traffic, which contains different types of protocols to adjust the architecture and parameters so as to obtain a better performance. Extensive experimental results show that the proposed model outperforms prior state of arts by achieving 92.6% malicious flow detection rate.

The remainder of this paper is organized as follows. In the next section, we summarize the related work on the existing mobile malware detection methods. In Section 3, the proposed N-gram, semantic-based neural method is introduced. In Section 4, we firstly introduce and describe the selected dataset in our research and then a series of experiments are conducted on the datasets. Finally, in Section 5, we conclude our research and the future work.

There are many proposed security mechanisms to detect mobile malware and protect users’ mobile devices from attacks. Existing typical methods can be classified into 3 categories [5], namely, static-based methods, dynamic-based methods, and hybrid-based methods. In addition to the typical methods, some researchers employ network traffic to detect mobile malware.

2.1. Typical Methods

Static-based methods detect malware by extracting statistic features from APPs, which unpack or disassemble them without running the APPs [610]. These statistic features can be extracted from application permission list [11, 12], sensitive Application Programming Interface (API) calling or critical code segments in the source code [1315]. Static-based methods are widely used for vetting apps. However, there exist several key challenges that static-based methods are facing, for example, the static-based methods cannot detect certain source code tampering operations.

Dynamic-based methods examine the features of APPs, which are running, and the features of the execution behaviors. In these methods [1620], the information, namely, memory utilization, system calls, network connections, and battery power, is employed to detect the mobile malware. However, there still remain some issues in dynamic-based methods. On the one hand, these methods cannot fully traverse the execution path of the software. And, on the other hand, these methods cannot detect certain malicious behaviors if an app is protected by runtime security mechanisms (e.g., DexGuard).

In order to avoid the limitations of both static-based and dynamic-based methods, hybrid-based methods, the combination of both the mechanisms, are employed to detect the mobile malware. These methods have a two-step process, wherein initially static analysis is performed before the dynamic one [2123]. However, the combination of the two methods consumes more resources and results in less improvement.

2.2. Network-Traffic-Based Methods

To address the limitation of the typical methods, the researchers have investigated the malware identification and private information tracking extensively. The researchers begin to analyze and identify malicious apps using network traffic, as almost all the attackers use mobile networks to obtain sensitive information of the user or interact with its malicious APPs. Network-behavior-based method can be divided into three categories [3], namely, network signature-based methods, lexical-feature-based methods and statistical-feature-based methods, respectively.

The signature-based methods detect the malware traffic according to the predetermined malware signatures. Griffin et al. [24] extracted 48 bytes’ code sequence as a string signature of malware. In addition, automatic generation of network signatures has been explored in various previous works [2527]. Most of these studies focused on worm fingerprinting. Perdisci et al. [28] focused on generating network signatures for mobile malware from their HTTP traffic. They analyzed the structural similarities among malicious HTTP traffic trace and then clustered the similar HTTP traffic. As for each HTTP traffic cluster, they automatically generated network signatures for each type of malware. In general, signature-based methods lack adaptability. They can detect known attacks but have limited ability to handle novel ones.

The statistic-based methods detect the malware traffic according to the header information of the packets instead of the payload information. Aresu et al. [29] showed how it is possible to group mobile botnets families by analyzing the HTTP traffic they generate. The authors create malware clusters by looking at specific statistical information that is related to the HTTP traffic. This approach also allows to extract signatures with which it is possible to precisely detect new malware that belong to the clustered families. Lashkar et al. [30] proposed a detection and characterization system for detecting meaningful deviations in the network behavior of a smart-phone application with 9 traffic statistic-feature measurements. The authors employed five classifiers to verify the performance of these features, namely, Random Forest, K-Nearest Neighbor, Decision Tree, Random Tree, and Regression. Arora et al. [31] compared malware’s traffic with benign network traffic and finally found the deviation of the malware on the network behavior. The statistical features they used contained average packet size, average duration of flow, the radio of incoming to outgoing bytes, and 13 other features.

In addition, there exist several researchers that employ deep-learning-based methods to detect mobile malware network traffic. Bendiab et al. [32] proposed a novel IoT malware traffic analysis approach using deep learning and visual representation for faster detection and classification of new malware (zero-day malware). The detection of malicious network traffic in the proposed approach works at the package level, reducing significantly the time of detection with promising results due to the deep learning technologies used. Feng et al. [33] proposed a two-layer method to detect malware in Android APPs. The first layer is permission, intent, and component-information-based static malware detection model. In the second layer, a new method CACNN, which cascades CNN and AutoEncoder, is used to detect malware through network traffic features of APPs.

The lexical-based methods detect the malware traffic according to the text analysis of the traffic. Android [34] detected stealthy behavior in Android app by identifying the disparity between UI textual semantics and program behaviors. However, it only used a few keywords to identify sensitive operations such as “send SMS” and “call phone.” WHYPER [35] used NLP techniques to identify sentences that described the need for a given permission in the app description. Nan et al. [36] proposed a framework called UIPicker for identifying personal user information on a large scale, and this framework was based on a novel combination of NLP, machine learning, and program analysis techniques. The N-gram model in NLP has been used in an automatic network protocol identification system designed for traffic analysis [37]. Ren et al. [38] proposed to reveal and control personal identifiable information (PII) leaks in mobile network traffic, in which the key/value pairs are used for identifying PII. Wang et al. [3] proposed an automatic malware detection method using the text semantics of network traffic. In particular, the authors considered each HTTP flow generated by mobile apps as a text document, which can be processed by natural language processing to extract text-level features. The text semantic features of network traffic were utilized to develop a malware detection model. In addition, the authors designed a detection system on encrypted traffic for bring-your-own-device enterprise network, home network, and 3G/4G mobile network. The detection model is integrated into the system to discover suspicious network behaviors.

Through literature research, using network traffic and lexical analysis to discover hidden malware is promising. In this research, our scheme utilizes deep learning algorithm to extract traffic lexical to discover malicious behaviors in mobile network traffic. Dee-learning-based methods are expected to learn from highly complicated domains to gain higher accuracy than machine-learning-based methods with more functionality. In addition, we process network flows into character-level text files. Each word in the text files is a byte content in the network flows. We examine the performance by a published dataset that contains the network traffic consisting of multitypes of protocols, including both encrypted and nonencrypted network traffic.

3. Proposed Approach

The following sections present the proposed mobile malware network traffic detection neural model, which computes continuous vector representations for network flows of variable types of protocols. An overview of the approach is displayed in Figure 1. The proposed approach processes the network traffic flow representation in two stages. It first produces continuous domain vectors from word representations with domain composition. These domain representations are treated as inputs of flow composition to get flow representation. The flow representations are then used as features for flow-level traffic classification.

3.1. Network Traffic Semantic Analysis

Network traffic is the data transmitted over networks, using a series of network protocols, e.g., IP, TCP, and HTTP. The network protocols are a set of rules that must be observed when exchanging information between peer entities that communicate with each other in the networks. Just as languages are employed by humans to communicate, there are also languages between the devices in networks, namely, the network protocols. The network protocol is composed of three elements: semantics, syntax, and temporal. Different network devices must use the same network protocol to communicate with each other. However, as the large number of protocols in networks, it is difficult to automatically analyze and extract the key information from the network traffic.

For a section of network traffic, it can be regarded as a domain set S, as shown in Figure 2. Decompose S into domains S1, S2, S3, , Sn. Let the total length of S be L, for a given domain Si, as long as the length of each segment Li is being determined; the network traffic can be described segment by segment according to the value of Li. The length of Li is not always fixed. In addition, depending on the value of Si, the optional domain Si−j may appear. Therefore, the main problem of network traffic description is the representation problem of the variable length of domains and the branching problem of optional domains.

In the following sections, we introduce the proposed neural model to model a network flow. We process the network flows as character-level texts, namely, each word in the texts is a one-byte character. Each text document of flow is divided into nd domains, and each domain consists of words. According to Figure 2, the length of the fields in the protocols is not fixed. Therefore, we employ multiwidth kernel CNN to learn the domain representations, which can represent the fields of different lengths in the domains. These domain representations are then considered as input of the flow composition using GRU, and finally the classification of flows is output through an attention layer.

3.2. Domain Composition

Before presenting a CNN with multiple kernels for domain composition, each word is represented as a low dimensional, continuous, and real-valued vector, namely, word embedding. All the word vectors are stacked in a word embedding matrix , where d denotes the dimension of word vector and is vocabulary size. These word vectors can be randomly initialized from a uniform distribution.

We utilize CNN to compute domain representations with a semantic composition. CNN is able to learn fixed-length vectors for domains, capture the word order in a domain, and does not depend on external dependency or constituency parse results. Therefore, we divide each text into nd domains, and each domain contains words. Specifically, we use CNN with multiple convolutional kernels of different widths to produce multi N-gram domain representations to adapt to different length value of fields in protocols. Figure 3 displays the domain composition method.

The convolutional kernels of different sizes can capture different N-gram semantics. For example, a convolutional kernel with a width of 2 essentially captures the semantics of bigram fields in a domain. A domain consisting of nd (padded where necessary) words is denoted aswhere ⊕ is the concatenation operator. Each word ci is mapped to its embedding representation . The input of a linear layer is the concatenation of word embeddings in a fixed-length window size lc, which is denoted aswhere lc be the width of a convolutional kernel. For the multiwidth kernel convolution, given a collection of convolution kernel sizes , after the convolution operation with the ith kernel ki is applied, the output of each kernel is as follows:where , , loc is the output length of the linear layer. To capture global semantics of a domain, we feed the outputs of linear layers to an average pooling layer, resulting in an output vector with fixed-length. We further add hyperbolic tangent (tanh) to incorporate pointwise nonlinearity. The output of each convolution operation becomeswhere N is the number of ci convolution kernel. The set of vector outputs by the multikernel CNN is denoted aswhere O is the domain representation and it is the input of the flow composition. In this work, we use five convolutional kernels whose widths are 1, 2, 3, 4, and 5 to encode the semantics of unigrams, bigrams, trigrams, 4-grams, and 5-grams in a domain. Each kernel consists of a list of linear layers with shared parameters.

3.3. Flow Composition

The obtained domain vectors are fed to a flow composition component to calculate the flow representation. We utilize a GRU approach for flow composition in this part. GRU is a variant of Recurrent Neural Network (RNN); it is simpler than Long Short-Term Memory (LSTM) RNN. Since GRU is a variant of LSTM, it can also solve the problem of long dependence in the RNN network. As displayed in Figure 4, a common GRU unit consists of an update gate () and a reset gate .

Specifically, the transition function of the GRU is calculated as follows:where σ is the sigmoid function, tanh is the tangent function, xt is one of the input vectors, and represents the product, namely, the product of corresponding elements. The subscript j represents the index of the node and the superscript t represents the time. represents the parameter matrix from the hidden layer to the output layer, hd and yd are the number of nodes in the hidden layer and output layer, respectively. and represent the connection matrix between the input and the last hidden layer to update gate (zt), respectively, and xd represents the dimension of the input data. and represent the connection matrix between the input and the previous hidden layer to reset gate (rt), respectively. and represent the connection matrix between the input and the hidden layer to the selected state () at the previous time, respectively.

In this research, since we prefer not to discard any part of the domain semantics to get a better flow representation, the GRU units in the model are always on. Given the domain vectors as input, flow composition produces a fixed-length flow vector as output.

3.4. Traffic Classification Based on Semantic

The composed flow representations can be naturally regarded as semantic features for flow for classification without feature engineering. In addition, we add a self-attention layer for the flow representations to improve the classifying performance. The self-attention layer architecture is shown in Figure 5.

The degree of influence between the output information obtained by the GRU at each time point is the same. In order to highlight the importance of some output results to classification, the attention of weighting is employed. The attention mechanism has been widely used in various NLP tasks for the past few years. The attention mechanism is essentially a weighted sum. The set of vectors input by the GRU layer is expressed as . The attention process can be described aswhere yt is the output of the GRU model. And, represent the weighted matrix during the training process, ba represent the biases. The at means the attention matrix, and p is the output of the attention layer.

Finally, we add a linear layer to transform flow vector to real-valued vector whose length is class number C. Afterward, we add a softmax layer to convert real values to conditional probabilities, which is calculated as follows:

We conduct experiments in a supervised learning method, where each network flow in the training data is accompanied with its label.

4. Experimental Evaluation

In this section, a series of experiments are conducted and the corresponding results are analyzed in detail. In the beginning, we introduce the experiment environment, including the selected dataset, data processing, and the basic evaluating metrics scheme. Then, we compare the model classification performance using different components in the proposed semantic neural model. In addition, a comparative experiment to verify malware detection performance is conducted between the proposed scheme and several influential algorithms proposed in recent years.

4.1. Selected Dataset and Data Processing
4.1.1. Selected Dataset

For this work, we have made use of real data from CICAndMal2017 [15] datasets, which collect 4,354 malware and 6,500 benign apps from VirusTotal [39], Contagio security blog [40], and previous researchers [15, 41, 42]. The collected network traffic data are described in Table 1. Benign apps were collected from Google play market published in 2015, 2016, and 2017. These APPs were collected based on their popularity and identified based on the detection results from VirusTotal [39]. Only those APPs that VirusTotal determined as benign are included to the benign APP set. The malware network traffic consists of 4 categories, namely, adware, ransomware, scareware, and SMS malware. Each category has different malware family. Adware has Dowgin, Ewind, Feiwo, Gooligan, Kemoge, koodous, Mobidash, Selfmite, Shuanet, and Youmi family. Ransomware has Charger, Jisut, Koler, LockerPin, Simplocker, Pletor, PornDroid, RansomBO, Svpeng, and WannaLocker family. Scareware has AndroidDefender, AndroidSpy.277, AV for Android, AVpass, FakeApp, FakeApp.AL, FakeAV, FakeJobOffer, FakeTaoBao, Penetho, and VirusShield family. SMSmalware has BeanBot, Biige, FakeInst, FakeMart, FakeNotify, Jifake, Mazarbot, Nandrobox, Plankton, SMSsniffer, and Zsone family.

Because of the sample errors and the problem of inconsistent labelling in different datasets, CICAndMal2017 retains 5000 (malware 429 and benign 5,065). In addition, as most of the advanced malware employs the evasion or transformation technique to evade detection (code permutation, register renaming, idle activation) [21], the malware behaviors are triggered only after connecting network update, or over time after the restart process. In order to trigger the malware network behaviors, the network traffic data capture occurs in 3 stages, namely, 3 minutes after the app is installed, 15 minutes before, and after restarting the smartphone.

4.1.2. Data Processing

As mobile software commonly employs the TCP and UDP protocol to execute the network activity, we handle TCP and UDP connection as the main interaction granularity between the apps and the networks. In this research, we have designed a tool with the use of C++ programme to process the network traffic, called PKTPT (packet processing tool), and the PKTPT architecture is shown in Figure 6. The network traffic data processing contains 3 stages, namely, traffic segmenting, data generation (including data extracting, trimming, and padding), and data transformation.

Traffic Segmenting. Traffic segmenting stage processes the network traffic into flows based on the 5-tuple (protocol, source IP, source port, destination IP, destination port). In this research, we use bidirectional flows, which have the same IP address and port. Further details of traffic segmenting are explained in Algorithm 1. PKTPT first separates packets in the specific packet queue according to source address, destination address, source port, destination port, and transport layer protocol. Then, each packet is taken out in order, and according to the 5-tuple, the same flow packets are processed as the same flow node in the flow table, which is implemented using hash_map (a data struct in the C++ program).

WHILE (packet reading a packet)
 IF (hash_search(packet 5-tuple information m_fe) = = FALSE)
  FUNC ( ) {
   creat flow: Flow New
   establish the hash mapping of flow: HM(m_fe)
   put the flow into hash_table
  }
 ELSE
  The packet belongs to Flow A
  IF (the packet arrival time is out of the set timeout value)
   Delete Flow A
   FUNC( )
  END IF
 END IF
 flow ← packet
END WHILE

Data Generation. In the previous stage, the network traffic is processed into TCP and UDP flows. In this stage, the packet in the flow is processed. Each flow consists of several frames of packets, and each frame is generally encapsulated in a fixed structure: Ethernet II header, Internet Protocol header, Transport header, and application payload. In our framework, we only use the application payload of the packets, as the application payload is the data that interact with each other on both sides of the communication, and the header data of each layer is used to establish communication channels, such as host IP addresses and ports. Truncation and zero-padding are necessary as different flows have different sizes of application payload. In this research, we extract the first n byte data in each flow and save them into a csv file, using spaces to separate byte data. We also take the flow data size as a parameter of the impactor of performance for deep learning model.

Data Transformation. In this research, we employ text-processing method to process the network traffic. Indeed, packet contents with different protocols can be treated as texts following different grammars. Due to different protocol structures, there are different domains in packets. We employ character-level word as vocabulary in text processing and establish models to learn the semantic relationships between them and protocol structures of a specific protocol. Then, the established models extract domain and flow feature according to the semantic to classify the malware and benign traffic.

Therefore, we transform the flow data obtained in the previous stage into texts. Each flow data is processed as a text file consisting of several words. We split the words using space, namely, the words in the text files is the character level. Then, the texts are fed into a deep learning model to be classified. In addition, some existing work converts traffic to images. In order to compare the effectiveness of the two processing methods, we have also developed the function to convert traffic to images in the stage of data transformation.

4.2. Experiment Environment and Evaluation Metrics

Our experiments are executed on Ubuntu 16.04 LTS with 64 GB of RAM and one GPU card (NVIDIA GTX 1080Ti 11 GB). For the experimental implementation, we used Tensorflow-gpu 1.12.0 and Keras 2.2.4 operated with Python 3.6.10. The deep learning models are constructed, trained, and tested by Keras using the Tensorflow-gpu backend.

We separate the flow-based dataset into learning and test data. Next, the learning data are separated into training and validation data. And, the ratio of training, validation, and test data is 8 : 1 : 1. Then, the model is trained using the training and validation data, and the model performance is measured using the test data. The description of dataset is shown in Table 2.

In our cases, we employ the mini-batch gradient decent method in the training process. In this method, the data are divided into several batches and the parameters are updated according to the batches. In this process, there are two parameters, namely, batch size and iterations, which represent the number of samples for each training and the number of iterations for completing a total sample training, respectively. A single training iteration for all batches propagated forward and backward can be denoted as an epoch. The number of learning epochs is set to 200, and the learning rate of optimizers (i.e., rmsprop and Adam) is set to 0.001.

To evaluate the detection effectiveness of the proposed scheme, the following terms are used for determining the quality of the classification models:Accuracy: it estimates the ratio of the correctly recognized network traffic flows to the entire test dataset.where TP represents True Positive, TN represents True Negative, FP represents False Positive, and FN represents False Negative.Precision: it estimates the ratio of the correctly identified malware traffic flows to the total number of samples classified to malware class. It is denoted asRecall: it estimates the ratio of the correctly identified malware traffic flows to the number of all malware traffic flows. It is denoted asF1-score: it is the harmonic mean of Precision and Recall. It is denoted as

4.3. Comparative Experiments of Different Model Architectures

As the proposed model is an N-gram, semantic-based neural model, we conduct several experiments on the main components, namely, domain composition and flow composition. In addition, to avoid the impact of the flow payload length, we select 6 types of lengths of flow payload for each experiment.

4.3.1. Domain Composition

We first analyze the performance of the mobile malware detection using different domain composition architectures of the proposed model. For domain composition, we conduct an experiment to compare the impact of the single width kernels and the multiwidth kernels, and for the single width kernels, we select 5 types of widths, namely, 1, 2, 3, 4, and 5. For the flow composition, we employ the same architecture, namely, GRU to learn the flow representation and employ an attention layer to classify the flows at last.

The detailed description of the different domain composition architectures is given in Table 3.

In Figure 7, we provide the mobile malware detecting accuracy with different domain composition architectures in the case of different text sizes. And, the proposed multiwidth kernel gives the best results for accuracy for the mobile malware traffic detection. The accuracy of the proposed model for malware traffic detection can reach 0.924 at a text length of 1000, which is better than the single-width kernels.

In addition, we provide a detailed description of classification performance metrics of different domain composition architectures in Table 4.

In the network flows with various types of protocols, the domain consists of several characters, which has an indeterminate amount. Even in a single-protocol flow, different domains can have different amounts of characters. Therefore, the single-type CNN kernels fail to represent all of these cases. The character-level domain can obtain an accuracy over 0.90, as all the domains can be split into the set of characters. However, character-level domains reduce the relationship between the multicharacter domains. Multitype kernel CNN is able to represent the domains better. With multiwidth kernel CNN, we can obtain various field representations carried by different N-gram representations, which can avoid the deficiency of single-width CNN. Type 6 and Type 7 have an accuracy of 0.906 and 0.901, respectively. However, the accuracy of these two multikernel CNNs is still lower than that of the Type 8 and Type 9, as the latter two types represent domains more comprehensively. The accuracy of the two types of CNN is basically the same, namely, 0.926. A kernel of width 6 in Type 9 is already represented by the previous five types of kernels and does not affect the causal relationship between domains. To obtain a better classification performance, we add weights for each multikernel representation. And, the parameters are updated by feedback, and the model is able to obtain a better domain expression by adjusting the weight.

4.3.2. Flow Composition

In this section, we analyze the performance of the mobile malware detection using different flow composition architectures of the proposed model. For flow composition, we conduct an experiment to compare the impact of different deep learning architectures, namely, one-layer GRU, tow-layer GRU, one-layer LSTM, and two-layer LSTM. For domain representation, we selected the multikernel CNN with 5 different kernels according to Section 4.3.1.

The detailed description of the different flow composition architectures is provided in Table 5.

In Figure 8, we provide the mobile malware detecting accuracy with different flow composition architectures in case of different text sizes. The accuracies of the four models for malware traffic detection are similar, which are all more than 0.92. Both LSTM and GRU architectures are able to process sequential information, i.e. the previous input is related to the subsequent input. And, they can solve the problem of the disappearance of the governor in the long sequence of recursive neural network. The detailed description of classification performance metrics is depicted in Table 6, and it can be seen that increasing the LSTM layers or GRU layers does not improve detection performance. However, increasing LSTM layers or GRU layers can increase the model complexity and the time consumed of the training process. And, since the detection effect of LSTM and GRU is similar, while the training time of GRU is shorter than that of LSTM, we opted to use one-layer GRU as the flow composition in our model.

4.4. Comparative Experiments
4.4.1. Different Models

In this section, we analyze the performance of the mobile malware detection in different cases, namely, the deep learning architectures and different text sizes. To verify the effectiveness of the proposed scheme, we have implemented different deep learning architecture models to detect malware traffic. In order to find the most suitable architecture model, a variety of deep learning architectures have been built for comparison purposes, namely, MLP, TextCNN, Bi-GRU, Bi-LSTM, and the proposed model. The detailed description of the different architecture models is provided in Table 7.

For Table 2, there are several supplemental instructions as follows: (1) Embedding (input_dimension, output_dimension) converts a positive integer (index value) to a dense vector of fixed size. (2) FC(n) denotes the fully connected layer with n output dimensions. (3) Conv1D(kernels, kernel_size) denotes the convolutional layer, where the first parameter denotes the number of the kernels and the second parameter represents the kernel size. In TextCNN architecture, the second layer contains 4 kinds of convolution kernels with different kernel sizes. (4) LSTM(n) denotes the LSTM layer and n denotes the output dimensions. Bi-LSTM stands for the Bidirectional LSTM, and its parameters are the same with LSTM. GRU has the same usage with LSTM.

We have provided the detailed description of classification performance metrics of different deep learning architectures with different text sizes in Table 8.

The MLP architecture consists of 3 fully connected layers that have the lowest classification performance in all the text sizes. The second type of deep learning architecture, TextCNN, has a convolutional layer that consists of 4 types of kernels with different kernel sizes. It has an improvement to MLP in detecting the malware traffic. However, compared to the other three types of deep learning architectures, the classification performance of TextCNN is slightly weaker. The reason may be that these kinds of architecture are different in the way of learning features. Both can learn the characteristics of malware and benign software. The 1st and 2nd types use CNN structure to build the model, while the 3rd and 4th types use RNN structure to build the model. Since RNNs are more advantageous in time series, Types 3 and 4 outperform Types 1 and 2. For the proposed N-gram semantic neural model, the performance is improved compared to Types 3 and 4, as we combine CNN and RNN structures to further improve the effectiveness. Thus, the proposed model that employs an N-gram semantic neural model to solve the mobile malware traffic detection problem in an end-to-end way is recommended.

Specially, in [33], a two-layer deep learning method is proposed to detect android malware. The first layer employs the static features about permission, intent, and component information. And, the second layer employs the network traffic features to receive the classification results from the first layer. In our research, we reimplement the network traffic detecting method and conduct a comparative experiment.

The implemented deep learning model parameters are shown in Table 9. In the research [33], each network flow was converted to a 2424 image, namely, 784 bytes of network flow data were utilized. In the comparison experiment, we used the first 700 bytes of data from the network flow in our proposed scheme.

The binary classification performance metrics is shown in Table 10.

According to the comparative experiments, the proposed scheme performs better than the scheme in [33]. The average accuracy of the [33] model for detecting the malware from benign traffic is 0.768, while the accuracy of our proposed model can reach 0.899. The difference in detection accuracy is mainly due to the different deep learning models. Scheme [33] uses CNN architecture while our scheme employs CNN combined GRU architecture. These two kinds of architectures are different in the way of learning features. Both can learn the characteristics of malware and benign software. However, the proposed model can acquire the features of the time dimension while CNN can only learn the spatial features.

4.4.2. Different Text Size

The text size is another important impactor for the classification results, and we use the text size as the parameter to develop several experiments. We extract and splice the application layer payload of the successive packets in each flow as flow payload data. And, the text size is the length of the flow payload data. Flow payload with different sizes may contain different amounts of information, which can affect the detection of malware. For testing this parameter, we selected 6 sizes of text, namely, 200 bytes, 500 bytes, 700 bytes, 1000 bytes, 1200 bytes, and 1500 bytes. In the case of flows more than the size of the texts, we have disregarded the following payload, and in the case of flows with less than the size of the texts, we have padded them with zeros. Figure 9 depicts the mobile malware traffic detection accuracy with different text sizes using different deep learning architectures.

Figure 9 depicts that the performance of the model increases with the size of the input text. When the input text size is below 1000 bytes, the performance of the model is poor, including precision, recall, and accuracy. When the input text size is 1000 bytes, the model performance is relatively high, with an accuracy of about 0.91 using the proposed model. When the input text size is 1200 bytes, the model performance curve starts to flatten out and the performance indexes no longer increase significantly. Even when the input text size is 1500 bytes, the performance of the model tends to decline. The larger the size of the text used, the more features can be captured, and the more accurate the detection of mobile malware traffic. However, when a certain length is reached, the content of traffic payload cannot continue to provide effective features to help improve the detection accuracy.

4.4.3. Binary-Classification in HTTP Scenario

In [3], each HTTP flow generated by mobile apps is converted into a text document, which can be processed by natural language processing to extract text-level features. They examine the traffic flow header using the N-gram method from the natural language processing (NLP). Then, an automatic feature selection algorithm based on chi-square test is utilized to identify meaningful features. These automatically selected features are used to build an SVM classifier for malware detection. In our research, we reimplement the network traffic detecting method and conduct a comparative experiment. In this process, we divided the HTTP dataset into training set and test set according to the ratio of 8 : 2. The HTTP dataset is described in Table 11.

The binary classification performance metrics is shown in Table 12. As HTTP protocol has more obvious features in semantic, both methods have high accuracy in detecting malicious traffic from HTTP protocol traffic. Both approaches make full use of the semantic information of HTTP traffic. The difference is that the approach in Ref [3] analyzes and extracts specific field contents as features manually, while our approach learns valid information automatically by deep learning models.

4.5. Discussion

With the proposed N-gram semantic based neural model, the essential purpose of this paper has been achieved, namely, detecting the different protocol types of mobile malware traffic in an end-to-end way by transferring the traffic flow into character-level text.

The existing detecting method needs manual analysis of traffic semantic information generated by the specific protocol, which failed to detect the mobile malware traffic hiding in other protocols. To address this problem, we proposed to employ deep learning methods to learn the text converted from network flows. We segment the network traffic into flows according to Algorithm 1. Then, the application payload of the packets in each flow is extracted and is waiting to be transferred into other forms that can be fed into deep learning algorithms. In our research, the flow payload data are transferred into the character-level text form. Application layer payload consists of the user behavior data expressed by different types of application protocols, i.e., HTTP and DNS. We utilize a multiwidth kernel CNN as the domain composition to learn the text semantic. And, different width kernels have different weights updated in the training progress. And, we proved the advantage of multikernel CNN through experiments. Afterward, we select the GRU architecture as the flow composition in the model by conducting comparative experiments on 4 types of deep learning architectures, namely, 1-layer LSTM, 1-layer GRU, 2-layer LSTM, and 2-layer GRU.

To verify the effectiveness of the proposed scheme, a series of experiments are conducted. In these experiments, we compare the performance of different deep learning architectures and a recommended architecture has been the proposed N-gram, semantic-based neural model according to the experiments. In addition, we also consider the impact of the size of the text, namely, the length of the used flow payload. Experiment results show that as the text size increases, the detection effect gets better. However, when the size exceeds 1000 bytes, the improvement of detection effect becomes less.

5. Conclusion

In this paper, we present a scheme to detect mobile malware traffic by extracting flow payload and converting it into character-level text. And we employ multi-filter CNN and GRU to model the generated text into domain representations and flow representations, respectively. As the text is character-level based, no manual analysis about specific protocol semantic information is needed. To accomplish the above work, we also designed a tool, called PKTPT, to process the network traffic into flows and extract the application layer payload from each packet. From the several experiments, our proposed scheme is proved to be effective. Compared with the state-of-the-art methods, several comparative experiments are also conducted. And, the experiment results depict that our proposed scheme is better in terms of accuracy and is suitable for mobile malware detection that is likely to use various protocols. However, as more mobile software uses encryption protocols, semantic-based malware traffic detection becomes less reliable. We will study methods for detecting mobile malware traffic in the scenario of encrypted traffic in the future.

Data Availability

The data are available from http://205.174.165.80/CICDataset/CICMalAnal2017/Dataset/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. U1836104 and 61702235 and in part by the Fundamental Research Funds for the Central Universities under Grant No. 30918012204.