Abstract

The Internet of Things (IoT) provides various benefits, which makes smart device even closer. With more and more smart devices in IoT, security is not a one-device affair. Many attacks targeted at traditional computers in IoT environment may also aim at other IoT devices. In this paper, we consider an approach to protect IoT devices from being attacked by local computers. In response to this issue, we propose a novel behavior-based deep learning framework (BDLF) which is built in cloud platform for detecting malware in IoT environment. In the proposed BDLF, we first construct behavior graphs to provide efficient information of malware behaviors using extracted API calls. We then use a neural network-Stacked AutoEncoders (SAEs) for extracting high-level features from behavior graphs. The layers of SAEs are inserted one after another and the last layer is connected to some added classifiers. The architecture of the SAEs is 6,000-2,000-500. The experiment results demonstrate that the proposed BDLF can learn the semantics of higher-level malicious behaviors from behavior graphs and further increase the average detection precision by 1.5%.

1. Introduction

A large number of malware variants have been automatically generated per day. Recent Symantec report ‎[1] shows that new pieces of malware grew by 36 percent from the year before in 2015 with total samples exceeding 430 million. Exponential growth of malware caused a considerable threat in our daily life.

Traditional computers bring a lot of attacks in IoT environment. Malware attacks computers and uses the infected computers to attack other connected devices in IoT environment. For example, Trojan.Mirai.1 which is the variant of Mirai can infect windows hosts and utilize these hosts to infect other devices. The infected windows can steal confidential information and transform the influenced devices into a botnet to launch a new Distributed Denial of Service (DDoS) attack. Many current traditional computers’ malware attacks may also extend to other IoT devices. Unfortunately, there are no ideal solutions to avoid Mirai and other IoT threats. One approach aims to weaken these threats by protecting the security of traditional computers in IoT environment.

The fast-growing samples bring a large number of demands for malware detection in IoT environment ‎[24]. With so many sophisticated malware samples, plenty of researches have been concentrated on proposing miscellaneous malware detection methods to mitigate the rapid growth of malware. Malware detection can be divided into two main methods: static malware detection and dynamic malware detection ‎[5, 6]. Static malware detection also refers to signature-based malware detection which examines the content of malicious binary without actually executing malware samples. Signature-based malware detection is able to obtain full execution path. However, it can be easily evaded by obfuscation techniques. In addition, signature-based malware detection requires prior knowledge of malware samples.

In response to the limitation of signature-based malware detection, various dynamic malware detection methods have been put forward ‎[7]. Dynamic malware detection analyzes the sample behaviors during execution and generally called behavior-based malware detection. Behavior-based malware detection methods include virtual machine and function call monitoring, information flow tracking, and dynamic binary instrumentation. Windows Application Programming Interface (API) call graph-based method has been considered as a good prospect in behavior-based malware detection for a long time ‎[8, 9].

Machine learning algorithms such as Decision Tree (DT), K-Nearest Neighbor (KNN), Naïve Bayes (NB), and Support Vector Machine (SVM) are commonly used in malware detection ‎[10, 11]. The traditional machine learning algorithms can potentially learn the behavior features from the malware. Unfortunately, most machine learning algorithms’ performance depends on the accuracy of the extracted features. In addition, it is often difficult to extract meaningful behavior features for improving malware detection performance. Moreover, feature processing requires expertise. Therefore, traditional machine learning algorithms are still somewhat unsatisfying for malware detection.

Deep learning is a branch of machine learning that attempts to learn high-level features directly from the original data. In short, deep learning advocates the end-to-end solution directly. It completely eliminates the whole process of large and challenging project phase. Deep learning is efficient to study high-level features of samples by means of multilayer deep architecture, and it has been widely used in image processing, visual recognition, object detection, etc. ‎[1217].

This paper introduces a method to protect IoT devices from being attacked by local computers. In this paper, we build a behavior-based deep learning framework (BDLF) which takes full advantage of Stacked AutoEncoders (SAEs) and traditional machine learning algorithms for malware detection. SAEs is one of the deep learning models that consists of multiple layers of sparse AutoEncoders ‎[18, 19]. We use SAEs model extracts high-level features from behavior graphs and then do classification by the added classifiers (i.e., DT, KNN, NB, and SVM). DT, KNN, NB, and SVM combine with the SAEs model, called SAE-DT, SAE-KNN, SAE-NB, and SAE-SVM, respectively. The proposed BDLF is implemented in cloud platform.

In short, the main contributes are as follows:(1)In this paper, we construct a novel behavior-based deep learning framework called BDLF by combing SAEs model with behavior graphs of API calls for malware detection. The proposed BDLF aims to obtain deeper semantics in behavior graphs rather than previous API call sequences (e.g., n-gram).(2)In the proposed BDLF, we investigate a deep learning model of SAEs to automatically acquire high-level representations of malware behaviors. Our experiment results demonstrate that our method can extract more meaningful abstract features and help to improve the average precision in malware detection.

The remainder of this paper is organized as follows. Section 2 introduces related work. Section 3 describes the proposed behavior-based deep learning framework. The evaluation and experiment results are presented in Section 4, which is followed by the conclusion and future work in Section 5.

With more and more malware attacks and smart devices’ connection in IoT environment, security is not a separate event ‎[2022]. It is necessary to detect local computers’ attacks for weakening the threats to other smart devices in IoT environment.

Malware detection proves an effective way for preventing IoT threats. Jiawei et al. present a method for detecting malware in IoT environment ‎[23]. They first convert the extracted binaries into images and then use the convolutional Neural Network (CNN) to detect malware. The experiment demonstrated that their method obtains a good performance in malware detection. Pa et al. analyze the IoT devices and identify four malware families in IoT environment. They propose an IoT honeypot and sandbox for analyzing attacks.

Malware samples usually achieve their intentions by performing malicious actions on operating system resources. In ‎[24], the proposed behavior model captures the interactions between malware and operating system resources which consist of file, registry, process, and network. Sanjeev et al. ‎[25] observe the actions that are correlated with file system, process, network, and memory.

Behavior-based malware detection has witnessed a shift towards API calls ‎[26]. The pattern of API calls provides an excellent expression which helps to “understand malware samples better.” API calls provide efficient information about the runtime activities of a malware sample. Wu et al. ‎[27] transform API calls into regular expressions and then use these rules to detect malware when a similar regular expression appeared. Taejin et al. ‎[28] convert API calls into the formatted codes and group the API data using an n-gram. Pratiksha et al. ‎[29] recognize malware by using API calls and their frequencies. Sanjeev et al. ‎[25] propose a frequency-centric model for feature construction by employing API calls and OS resources of malware and benign samples.

Remarkably, deep learning is being applied for malware feature extraction and detection in recent years. Wenyi et al. ‎[30] propose a deep learning architecture with the input rests on a sequence of API call events and null-terminated objects. Bojan et al. ‎[31] use the Convolutional and Recurrent Network to analyze API call sequences in malware classification. Razvan et al. ‎[32] explore a few variants of Echo State Networks (ESNs) and Recurrent Neural Networks (RNNs) to predict next API call. Omid E. et al. ‎[33] extract unigrams (1-gram) API call and create an invariant compact representation of the malware behavior by using a Deep Belief Network (DBN). Wookhyun et al. ‎[34] present a deep Recurrent Neural Network (RNN) to deal with the sequence of API calls. William et al. ‎[12] design a deep learning architecture using SAEs model. The proposed architecture is based on the API calls extracted from the Portable Executable (PE) files.

Previous works have shown that different strategies can be used to build the patterns of API calls. However, the methods using API calls and their frequencies or API call fragments are limited. Ammar Ahmed E. et al. ‎[35] demonstrate that combined API calls and their parameters raise the malware detection accuracy rather than considered API calls separately. In their study, each malware is represented as an API call graph by integrating API calls and operating system resources. They first extract API calls and their parameters through preprocessing and then use the proposed API call construction algorithm to build integrating API call graph. At last, they calculate the similarity between different graphs to identify the input sample.

Different from the previous works, the proposed BDLF is a combined approach using behavior graphs of API calls and SAEs model. Our approach aims to capture the high-level malicious behaviors for improving malware detection in IoT environment.

3. Behavior-Based Deep Learning Framework

We in this section elaborate the proposed BDLF. The proposed BDLF consists of two modules: behavior graph construction and SAE-based malware detection.

3.1. System Overview

The overview of our proposed system is displayed in Figure 1. The proposed system is composed of IoT environment (IoTE) and cloud platform (CP) module. The IoTE module consists of local computers and other smart devices. The proposed BDLF is implemented in CP’s detectors, which is the main module for behavior construction and malware detection. In the proposed BDLF, each program is represented by a behavior graph which consists of many API call graphs. API call graph integrates API calls with operating system resources. After the behavior graphs are constructed, CP transforms the behavior features into binary vectors and then uses these vectors as input to the SAEs. There are 3 layers in the proposed SAEs model. The architecture of the SAEs is and the last hidden layer’s data are used as the input to the added classifiers (i.e., DT, KNN, NB, and SVM). The aim of the proposed BDLF is to learn the semantics of the high-level malicious behaviors and detect malware effectively. Specifically, the purpose and functionality of each component are described as follows.

IoTE module refers to an IoT environment. The local computer contains an installed light agent which is responsible for collecting runtime activities. In this module, computers transfer the scanning information or suspicious files which are newly installed to CP and receive responses from CP.

CP provides an unlimited storage space. Detectors in CP are responsible for detecting scanning data or files received from IoTE. For scanning information, CP constructs behavior graphs and then transforms the API call graphs into binary vectors which are used as input to SAEs models for malware detection. For suspicious files, CP executes samples in Cuckoo Sandbox and then extracts API calls from sandbox’s monitoring files. After that, CP manages the monitoring files the same way as the scanning information. After the detection, CP gives feedback to IoTE.

3.2. Behavior Graph Construction

The actions in behavior-based malware detection must only include security-critical operations and related independent operations ‎[36]. We considered the actions performed on operating system resources which include seven types such as service, process, file system, registry, synchronization, network, and system. An action contains a set of operations which correspond to a set of related API calls ‎[3739]. We list some relationships between operating system resource types and some API calls in Table 1.

API calls listed in Table 1 easily happen in benign samples. However, the combination of these API calls may lead to malicious purpose with elaborate design. We propose the behavior graphs of API calls on malware. The proposed API call graphs are designed to learn malicious behaviors from the combination of API calls. Box 1 represents a code fragment of the malware:

(i) create nso1.tmp and then delete (in line , , respectively).

(ii) create Trojan-Downloader.Win32.Zlob.bcl and obtain its information; after that, read and set the file information of the Trojan-Downloader.Win32.Zlob.bcl (delete, rename, or change attributes, in line , , , , , respectively).

(iii) create nsi2.tmp (in line ).

Figure 2 represents three features (API call graphs) extracted from code fragment which is shown in Box 1. We construct extracted features by grouping related API calls which belong to the same operating system resource type. For example, the feature sets, ,, , , ,

are performed on file resource nso1.tmp, Trojan-Downloader.Win32.Zlob.bcl, and nsi2.tmp, respectively. The first API call graph contains NtCreateFile and DeleteFile. NtCreateFile has two arguments: ( 0x000000f8) and ( …∖nso1.tmp). DeleteFile has one argument ( …∖nso1.tmp). The label denotes that the second argument of NtCreateFile has the same value as the value of DeleteFile. The second API call graph contains NtCreateFile, NtQueryInformationFile, NtReadFile, and NtSetInformationFile. NtCreateFile has two arguments: ( 0x000000f8) and (…∖Trojan-Downloader.Win32.Zlob.bcl). NtQueryInformationFile has one argument of ( 0x000000f8). NtReadFile has one argument of ( 0x000000f8). NtSetInformationFile has one argument of ( 0x000000f8). The same as the labels , , and , indicates that the first value of NtCreateFile has the same value as the value of NtQueryInformationFile. The third API call graph has only one node of NtCreateFile which has two arguments of ( 0x000000ec) and (…∖nsi2.tmp).

Our proposed API call graph is a directed acyclic graph where nodes stand for either an API call or an operating system resource and edges represent some types of dependence. We define the proposed API call graph as

In the API call graph : stands for a set of nodes, represents a set of edges, and is a function that maps nodes to API calls or operating system resources in the alphabet set . Furthermore, each node in and edge in has its attribute which can be represented as and , respectively.

The proposed system monitors the API calls and their parameters to recognize malicious behaviors and has the following rules to identify hostile attacks.

API calls and their extension functions perform the same operation, and this kind of “sibling manipulation” leads to identical features. For example, whenever there is a need to open the registry, the open operations can be expressed as RegOpenKeyExA, RegOpenKeyExW, or other forms of expressions; we identify different forms of expressions as the same operation in our proposed system. In addition, the same API call which results in identical features is performed continuously more than two times, which we regard as one operation. In Box 1, the API call NtReadFile is performed twice on Trojan-Downloader.Win32.Zlob.bcl, and the API call graph is built as a standalone implementation of NtReadFile:, NtQueryInformationFile, NtReadFile, .

The proposed API call graphs do not consider the order of the behaviors. Malware may perform malicious behaviors in totally different orders. The behaviors described in Figure 2:, ,NtCreateFile, NtQueryInformationFile, NtReadFile, NtSetInformationFile,NtCreateFile,

are considered identical toNtCreateFile, DeleteFile,NtCreateFile,NtCreateFile, NtQueryInformationFile, NtReadFile, NtSetInformationFile.

Moreover, we use the API calls and operating system resource instances to identify API call graphs. From the example represented in Figure 2 we can see that the operating system resource is used to identify related operations rather than the feature vectors. This is because malware samples inclined to use random file names or other values every time when they are executed.

3.3. SAE-Based Malware Detection

Before using the behavior graphs of API calls as input to SAEs model, we transform these features into binary vectors. We employ one-hot encoding to identify unique behavior for every API call graph . Let be the number of the extracted API call graphs in the dataset . API call graphs constructed in dataset are denoted by binary feature vectors:where represents the th API call graph in dataset . In our proposed system, the behavior graph of sample can be represented as . Sample is represented as , where is the class label the sample belongs to. There are two designated class labels associated with the proposed BDLF with representing the class of malware and representing the class of benign sample.

API call graphs in a sample are then transformed into binary vectors and the behavior graph of sample can be represented asHere ; if the sample contains the API call graph , ; otherwise .

In order to build a deep neural network, we apply SAEs model which consists of multiple layers of sparse AutoEncoders to extract features ‎[40, 41]. An AutoEncoder (AE) has three layers: input layer, hidden layer, and output layer. The hidden layer is located between the input layer and the output layer. An AE tries to use the encoder to map the input data into a hidden layer and use the decoder to map the hidden layer’s data into an output layer, so as the output is similar to the input values. In short, an AE attempts to learn the sparse representation of the input and reconstruct the input data.

Figure 3 depicts the proposed SAEs model which contains 3 layers. In our approach, the proposed SAEs model consists of 3 hidden layers:

Different hidden layers are trained one by one from bottom to top. In the proposed SAEs model, the first layer receives 11,164-sized original input data and trains simply as an AE. After training is completed in an AE, the hidden layer of 6,000-sized features generated in the first hidden layer is used as the input to a new AE which is added on top of the current AE. The new AE obtains the current AE’s output as its input and trained similarity. Generally, the hidden layer’s data are used as the input of the layer and trained simply as an AE. Finally, the last hidden layer’s output is the entire SAEs model’s output.

When all the training layers finished, the SAEs model converts the 11,164-sized original features into 500-sized high-level features. These 500-sized high-level features are regarded as the new presentations of an executable program file. The proposed SAEs model aims to reduce the number of the features and describe the features in a compact high-level expression.

Algorithm 1 describes a deep learning model which includes SAEs and some added classifiers for malware detection. Line describes the input data activation for each sample in . Once the first layer in line is pretrained, it can be used as an input to the next AE. We fine-tune the deep neural network after being pretrained in line and put final layer ’s activation to the added classifier to line . In line , represents the classifier of DT, represents the classifier of KNN, represents the classifier of NB, and represents the classifier of SVM. Line and line train the added classifiers and output the class label (malware or benign sample). In our experiment, equals 3, equals 11,164, and equals 4.

Input: including malware and benign samples ()
sample under detection
Output:// the result of the detection
Begin
Construct binary feature vectors
Activation=
For to do
Train AE use the activation AE as the input and train hidden
layer’s parameters
Fine tune the neural network
End
For to // represents different classifier
Add the classifier to the top layer of the SAEs model
Train the added classifier
End
Output the class label
End

4. Evaluation and Experiment Results

In this section, we first explain the dataset we used for evaluation and evaluation method. To evaluate the effectiveness of our method, we then compare the proposed BDLF with some shallow models which consists of DT, KNN, NB, and SVM. Furthermore, we compare our method with other deep learning methods.

4.1. Dataset and Evaluation Method

We conduct the evaluation with a dataset containing 1760 samples, where 880 are malware samples, and the other 880 are benign samples. The malware samples are collected from VX Heaven. We analyze malware and benign samples in Cuckoo Sandbox. In our experiments, we use -fold cross-validation method ‎[42] in malware detection. In -fold cross-validation, the original dataset is randomly divided into equal-sized parts. We use 10-fold cross-validation in malware detection. For 10-fold cross-validation, we use 1584 samples for training and 176 samples for testing in each experiment.

We evaluate the proposed malware detection method by using the . is the weighting-harmonic-mean of the and the . Given the notions of true positive (the positive sample is correctly identified as the positive sample), true negative (the negative sample is correctly identified as the negative sample), false positive (the negative sample is incorrectly identified as the positive sample), and false negative (the positive sample is incorrectly identified as the negative sample), the (), (), and are defined as follows.

4.2. Experiment and Evaluation Results

In this section, based on the dataset introduced in Section 4.1, we evaluate the experiments in two aspects: shallow models and deep learning models. We conduct eight experiments. The experiments include some shallow models and SAE-based deep learning models.

The shallow models select behavior features by IG and then use these features to predict the labels of the samples. IG is used to denote information exchange and select certain properties ‎[43]. The measurement criterion in IG is how much information the feature can bring to the system. The more information the feature brings, the more important it is. We describe IG in our previous work ‎[11]. Shallow models include DT, KNN, NB, and SVM.

We train deep leaning models which include the proposed SAE-DT, SAE-KNN, SAE-NB, and SAE-SVM. 3 hidden layers’ deep learning model is implemented on Keras. We feed our 11,164 features to SAEs model and convert them to 500-sized features. The batch size for the deep leaning models is 1,000. The SAE-based systems (SAE-DT, SAE-KNN, SAE-NB, and SAE-SVM) are trained with 100 epochs.

The average , , and are shown in Table 2. It can be observed from Table 2 that the of the SAE-based methods outperformed the other shallow methods. The best performance of the detection is that of the SAE-DT model. In the proposed SAE-DT model, the is as high as 98.6%. The high performances were obtained from the SAE-DT, SAE-KNN, SAE-NB, and SAE-SVM model, which indicate that the features learned from the SAEs model help to improve the performance compared with traditional classification.

In addition, we compared our best with previous works. William et al. ‎[12] design a deep learning framework using the SAEs for intelligent malware detection based on API calls. The experiment results on their testing dataset demonstrate their proposed deep learning method achieves 95.5% detection precision. Zhenlong et al. ‎[44] build an Android malware detection engine (DroidDetector) based on Deep Belief Networks (DBN). The proposed DroidDetector can achieve 96.8% detection precision by analyzing the features of required permissions, sensitive APIs, and dynamic behaviors (13 app actions). Toshiki et al. ‎[45] focus on studying the similarity of data structure between malware communications and applying Recursive Neural Network (RNN) for malware analysis. Their proposed method achieves 97.6% detection precision. Our proposed BDLF based on SAEs and behavior graphs achieves 98.6% detection precision. It can be seen from Table 3 that our proposed SAE-DT improves the performance in malware detection. It is meaningful for mining the deep semantic relationships in behavior graphs.

5. Conclusion and Future Work

In this paper, we build a novel behavior-based deep learning malware detection framework in IoT environment for malware detection. By combining behaviors and Stack AutoEncoder, we obtain optimal detection performance. The experimental results in Section 4 demonstrate that SAE-based models can learn deeper abstract semantics features and help to improve the average precision of the detection by 1.5%. We are hopeful that additional works in SAEs model can be applied in malware detection and classification.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant no. 61601041), the Fundamental Research Funds for the Central Universities (2018RC55), and the Beijing Talents Foundation (Grant no. 2017000020124G062).