Abstract

With the popularity of Android applications, Android malware has an exponential growth trend. In order to detect Android malware effectively, this paper proposes a novel lightweight static detection model, TinyDroid, using instruction simplification and machine learning technique. First, a symbol-based simplification method is proposed to abstract the opcode sequence decompiled from Android Dalvik Executable files. Then, N-gram is employed to extract features from the simplified opcode sequence, and a classifier is trained for the malware detection and classification tasks. To improve the efficiency and scalability of the proposed detection model, a compression procedure is also used to reduce features and select exemplars for the malware sample dataset. TinyDroid is compared against the state-of-the-art antivirus tools in real world using Drebin dataset. The experimental results show that TinyDroid can get a higher accuracy rate and lower false alarm rate with satisfied efficiency.

1. Introduction

With the fast development of mobile Internet, the popularity of mobile devices, and the rapid growth of mobile applications, smartphones have become the most popular tools for people to access the Internet. The statistics from Gartner show that more than 400 million smartphones were sold globally in the final quarter of 2015, with Android operating system accounting for 80.7% [1]. As of February 2016, the number of Android apps in Google Play is approaching 2 million. Meanwhile, internal application vulnerabilities and malware are the primary threat to the security of Android applications. Since application developers lack the necessary knowledge and skills about security code audit, it is easy to create vulnerable code in every development step in the software life cycle. Once the attackers exploit these vulnerabilities, they will damage the confidentiality, integrity, and availability of applications. The creator of an Android malware usually inserts a small malicious code to a popular application and spreads the malware through some third-party app stores without security management [2, 3]. According to a new report from the mobile department of Alibaba, 18% of the Android devices have ever been infected with viruses, and 95% of the popular mobile applications have been counterfeited and kinds of malware have evolved quickly [4]. The malicious activities of malware typically include system failure, information leakage, and data corruption, so malicious mobile code will damage the user's benefits.

The existing methods of malware detection include signature-based detection methods and behaviour-based detection methods. Signature-based detection methods detect malware by comparing the binary code of software with a database that stores all the signatures of the known malware. Although the signature-based detection methods have the advantages of simplicity and efficiency with high accuracy, they cannot detect an unknown malware and need to maintain a vast signature database.

On the other hand, behaviour-based detection methods detect malware by comparing the behaviour pattern of software with that of known malware. Although this method leads to a high rate of false alarm, it can effectively detect unknown malware and compensate for the disadvantages of signature-based detection methods. Two types of analysis could be performed on behaviour-based detection, that is, static analysis [58] and dynamic analysis [914]. The dynamic analysis can analyze object-code obfuscation to some extent by monitoring the run-time situations of an application, which runs in a secure sandbox mode, but it requires a significant amount of computing resources and has low code coverage. Instead, static analysis can get features based on analysis on sequences of instructions obtained using reverse engineering. In a word, static analysis has the advantages of efficiency and high code coverage, and it is a relatively lightweight process as compared to the dynamic analysis.

In this paper, we firstly identify the problem of Android malware detection and summarize a model of threats faced by developers and researchers. Then, we propose a detection method named TinyDroid using instruction simplification and machine learning technique. The main contributions of this paper are twofold as follows:(i)N-gram direct on the reduced symbolic Dalvik opcode sequences instead of the original full instructions

A simple symbol set is introduced to simplify the original Dalvik instructions sequence, where one series of instructions with similar function can be assigned as one symbol. N-gram technique is employed to handle the symbolic sequence.(ii)Reduction on both features and number of samples for building a lightweight model

In order to achieve high efficiency of detection and classification, a further reduced scheme is proposed to largely cut down the number of N-gram items and training samples. Specifically, information gain is employed for attribute reduction and affinity propagation is used for sample selection.

In this section, we review related work in two areas: usage of opcodes and N-gram to detect malware and lightweight approaches to detect Android malware.

2.1. Usage of Opcodes and N-Gram to Detect Malware

Some references study the effectiveness of opcodes for detecting or characterising malware. Rad et al. [15] use the histogram of opcodes to classify metamorphic virus family variants. This method takes into account the frequency distribution of opcodes but ignores the sequential patterns of opcodes.

References [1618] have investigated the effectiveness of N-gram opcodes extracted from the disassembled application and represented programs as N-gram density that made a robust indicator of malware. However, these methods have a large number of features and do not consider efficiency.

Wolfe et al. [19] extracted Java bytecode from the Android application and computed the N-gram frequencies of the bytecode. Then, they applied dimensionality reduction on the N-gram frequencies using principal components analysis (PCA). This approach, where the dimensionality of the feature set is very large and difficult to handle due to the unrestricted selection of features, goes to the other extreme when compared to the permission approach.

2.2. Lightweight Approaches to Detect Android Malware

Some static and lightweight mechanisms for detecting Android malware have been introduced. For example, in 2012, Zhao et al. [20] proposed a lightweight framework named RobotDroid that uses active learning on Android applications to induce an accurate detection model with minimal labeled samples.

In 2013, Santos et al. [21] proposed an efficient model, which is based on the frequency of the appearance of opcode sequences. Also, they provided empirical validation which shows that opcode sequence is feasible to detect unknown malware.

In 2014, Arp et al. [22] designed a lightweight method of Android malware detection. They apply linear-time static analysis and learning techniques for efficiency.

In 2017, DroidSieve [23] relies on several syntactic and resource-centric features, which are robust and computationally inexpensive to detect Android malware.

These methods have not considered the issue of scalability. This problem produces excessive storage requirements, increases time complexity, and impairs the general accuracy of the models.

3. Threat Model

Before introducing our detection method, it is necessary to summarize the threats we currently encounter in Android malware detection. Four threats arise from the Android malware detection:

Repacking and Reflection. A more recent trend we have observed is the increasing prevalence of Android malware leveraging packing technology. The application can use the secondary packaging to modify the source code or add a small amount of malicious code due to insufficient reinforcement. Java’s reflection mechanism is expressed as a way to dynamically get information and call objects. Malware creators use reflection mechanisms as an important way to hide malicious behaviour in software. Malicious applications, in order to evade static analysis, can transmit malicious code by invoking sensitive methods at reflection at runtime.

Malware in MultiDEX. Android programs are compiled into Dalvik Executable (DEX) files. A typical Android app has a single DEX file, and some complex applications contain multiple DEX files. Therefore, some malware would include malicious code in multiple DEX files. If only a single DEX file is analyzed, it may not be able to distinguish whether it is malicious. This simple step can serve as an evasion technique against static analysis.

Run-Based Malware. Instant Run is a new feature in Android system. It allows developers to quickly deploy patches to a debug application .zip file into the application. Some malware authors hide the malware payload portion of their app in code fragments that are hidden in the zip file used by Instant Run. This approach to detection evasion can only be used on Android Lollipop and later SDK levels. Although it cannot be used on apps in Google Play, it was still possible to spread in the third application markets.

Native Code. To avoid static detectors at the bytecode level, malicious code relies on some special native API functions and kernel system functions to infect, spread, and hide. Moreover, sometimes, malware also embeds malicious payload within the native content.

Each threat can evade some methods of malicious detection. To cope with some of the above threats, TinyDroid is proposed to detect Android malware using instruction simplification and machine learning technique.

4. Design of TinyDroid

4.1. Overview

To solve the aforementioned problems, this paper focuses on lightweight machine learning-based detection of Android malware using Dalvik instructions simplification, exemplar selection, and optimization.

We develop a detection system TinyDroid, which has a high accuracy on detection and classification of Android malware. TinyDroid includes two procedures. The first one is to create the detection model and classification model. The second is to predict Android malware. The main steps are shown, respectively, in Figures 1 and 2.

Training set for the detection model is divided into two subsets: malware apps and benign apps. An APK file always contains several DEX files, which can be executed in Dalvik Virtual Machine. The APK file can be disassembled into smali codes by Apktool [24], where smali is an explanation for the Dalvik bytecode and can be simplified into symbolic instructions by our method. The detection model is supported by N-gram, exemplar selection method, and machine learning algorithm. Similarly, the classification model has the same steps as in Figure 1, except that it needs to divide the training set into different family subsets.

As shown in Figure 2, the apps to be predicted are firstly preprocessed and the reduced N-gram sequences are extracted. The malware can be detected based on the detection model. Furthermore, the information about malware family can be obtained based on the classification model.

The threats mentioned in Section 3 are almost the hardest part of all malicious detection methods. Our solution mitigates some threats as follows:

Repacking and Reflection. Our study has shown that reflection APIs are more frequently used by malware set than in the benign set, which makes them part of the feature vector for the classification.

Malware in MultiDEX. In the feature extraction process, we analyze all the DEX files to extract the opcodes and generate the N-gram sequences.

Run-Based Malware. This is a difficult point of malicious detection and cannot be detected by static analysis. The method in this paper may be insufficient for such threats.

Native Code. Since our detection tool only works at Dalvik bytecode level, it is not able to detect any dangerous methods invoked. However, the use of invoking calls such as invoke-direct is also used as a feature by our model.

4.2. Feature Extraction and Simplification

Dalvik instructions contained 230 instructions such as method call instructions, data manipulation instructions, and return instructions [25]. For malicious detection, the full instruction set is too complicated. In addition, there are some instructions of the same semantics. Condensing these instructions would not affect the test results but also improve the efficiency of detection. For example, Hang et al. [26] classified all 218 Dalvik instructions into a simplified instruction set with only 13 types of instructions, which is called SDIL.

Thus, we have designed a more simplified symbolic instruction set that is suitable for more efficient machine learning. The detailed steps are as follows:(1)We analyze many smali source code files and found that many instructions have multiple versions due to the different parameters. For example, consider two Dalvik move instructions (i.e., move-object/from16 vAA, vBBBB; move-object/16 vAAAA, vBBBB), the only difference between them is how many bits they use to represent registers indices (vAA and vAAAA require 8 and 16 bits). According to the above rules, finally, we identified 107 representative Dalvik instructions which have the core semantics and high frequency of occurrence.(2)Then, we assign them into 10 types of symbols according to the semantics of each instruction. For example, G stands for all the jump instructions, it could be goto, goto16, or goto/32, because they have the almost the same semantics. Table 1 shows the full symbolic instruction set with different semantics we proposed.(3)We use Apktool to decompile apk for the sake of getting smali source code. Then, we extract instructions according to the reduced instruction set. Finally, we replace ordered instructions with symbols.

N-gram [27] is widely used for natural language processing (NLP) and also used for the analysis of malware [16, 28]. When the N-gram sequences of Dalvik instructions are generated, they can be easily analyzed for the detection of Android malware. N-gram is a type of probabilistic model, and it assumes that the appearance of the nth word only correlates with the previous n−1 words. For example, suppose that we have a sequence of symbolic instructions “MRGITPV,” the 3-gram features are [{MRG}, {RGI}, {GIT}, {ITP}, and {TPV}]. Table 2 shows the occurrence rate of several 3-gram features in the selected APKs.

Since the number of N-gram features is excessive, so we use information gain (IG) for feature reduction as it has been proven to be effective [29]. The information gain of a given feature X with respect to the class attribute Y is the reduction in uncertainty about the value of Y, after observing values of X. It is denoted as . The uncertainty about the value of Y is measured by its entropy defined aswhere is the prior probabilities for all values of Y. The uncertainty about the value of Y after observing values of X is given by the conditional entropy of Y given X defined aswhere is the posterior probabilities of Y given the values of X. The information gain is thus defined as . According to this measure, a feature X is regarded more correlated to class Y than feature Z, if . By calculating information gain, we can rank the correlations of each feature to the class and select key features based on this ranking.

4.3. Exemplar Selection and Optimization

Recent researches show that the clustering algorithm affinity propagation [30] can be used to generate good representative data for intrusion detection training [31]. The intrusion detection training is very similar to the malware detection training, so we choose this clustering algorithm in our paper. Affinity propagation is employed as a reduction method to select a smaller set composed of representative exemplars from the original large training data. When the size of the training dataset becomes smaller, the detection model efficiency would be improved after the sample optimization.

Affinity propagation (AP) clusters instances by passing messages between data points iteratively. Unlike clustering algorithms such as k-means, AP does not require the number of clusters to be determined or estimated before running the algorithm, and the only parameter that should be set is the preference. The number of clusters generated by AP is decided by its preference, so we choose the preferences between the minimum and the maximum of the similarities to generate the expected number of clusters that are separately distributed.

The time complexity of AP is , where N is the number of samples and T is the number of iterations until convergence. Further, the memory complexity is if a dense similarity matrix is used, but reducible if a sparse similarity matrix is used.

5. Evaluation

5.1. Experimental Setup

Classification algorithm is a typical method of machine learning. The most common evaluation metrics include true positive (TP), false positive (FP), true negative (TN), and false negative (FN) [32]. These four metrics can make up a confusion matrix as shown in Table 3.

Depending on these basic metrics, a series of common evaluation metrics can also be generated as follows:

True-positive rate (TPR) is sometimes called detection rate or recall rate. False-positive rate (FPR) is sometimes called false alarm rate. Receiver operating characteristic (ROC) is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the TPR against FPR at various threshold settings. The area under the curve (AUC) is the area under the ROC curve. The closer that this value is to 1, the better the performance is.

We used a dataset of real Android applications and malware to conduct the experiments. The dataset includes all Android malware from Drebin [33]. Additionally, all the benign applications are randomly collected from Google Play [34]. Previous work [35] has pointed out that only 2 in 52208 applications were malware from Google Play, so the applications can be considered as benign applications.

We utilize Waikato Environment for Knowledge Analysis (Weka) to train data and create the model. Weka is an open-source software, and it integrates a lot of machine learning algorithms, so we choose it as our experimental tool.

5.2. Malware Detection Performance

We randomly choose 300 malware and benign samples, respectively, as experimental data. 2-gram sequences, 3-gram sequences, and 4-gram sequences are generated, respectively. These sequences are respectively fed into Random Forest, Support Vector Machine (SVM), k-Nearest Neighbor (kNN), and Naive Bayes to build the classifier. The time complexity is shown in Table 4. 10-fold cross-validation is selected to evaluate our method. The results are shown in Tables 57.

AUC, TPR, FPR, precision, recall, and F-Measure are selected as performance measurement indicators. In particular, AUC is an important indicator because it can comprehensively reflect TPR and FPR. As shown in Table 5, Random Forest has the best performance on AUC and FPR, and its TPR is a little less than kNN. As shown in Table 6, every indicator of Random Forest is the best. As shown in Table 7, Random Forest performs best on the two indicators of F-Measure and AUC. Through the above analysis, we can see Random Forest is the optimal algorithm among 2-gram sequences, 3-gram sequences, and 4-gram sequences.

Next, we will find out the optimal N-gram sequences based on Random Forest. By analyzing Tables 57, we can easily find that Random Forest’s AUC from good to bad: 4-gram, 3-gram, and 2-gram; Random Forest’s FPR from good to bad: 3-gram, 4-gram, and 2-gram. The performance of 2-gram sequences is the worst, so it is no longer to be considered. 4-gram sequences’ AUC is 2%, which is higher than 3-gram sequences’, but 4-gram sequences’ TPR is 30% higher than 3-gram sequences’. Through the above analysis, 3-gram sequences based on Random Forest is the optimal method.

Different size of sample dataset would affect the evaluation results. Three different sizes of sample dataset are randomly generated: 600 samples, 1200 samples, and 2400 samples. The 3-gram sequences based on Random Forest method is used for the experiment. The final model is tested by 10-fold cross-validation. As shown in Table 8 and Figure 3, the results show that performance becomes better when the number of samples increases.

5.3. Comparison of Training Time

In order to further improve malware detection performance, especially for a large volume of training data, affinity propagation is used to generate the excellent representative data for malware detection training. We randomly choose 2400 samples as training data. Then, the training data are processed by AP algorithm; 834 exemplars and 516 exemplars are generated, respectively. The representative exemplars are trained to create a detection model and the model is tested by 10-fold cross-validation. As shown in Table 9, TPR and FPR show little change, but the time cost has declined dramatically. The time performance is improved after exemplar selection.

Then, we compare the proposed method with several baselines. Drebin [23] is a lightweight method of Android malware detection. They apply linear-time static analysis and learning techniques for efficiency. Canfora et al. [16] proposed a method using frequencies of N-grams of opcodes to detect Android malware effectively. We extract the corresponding number of features as much as possible for experiment and use the same dataset (randomly choose 2400 samples) to train three models, respectively. The results are shown in Figure 4.

Here, we divide our method into two parts. In the first part, we only use opcode instruction symbolization and N-gram techniques to reduce feature dimensions (Our in Figure 4). In the second part, on the basis of the first part, we use the AP clustering algorithm for sample selection (Our-AP800 in Figure 4).

We can see that the time spent to perform model training in our method (first part) is about 2.9 s, which is better than the other two baselines. Subsequently, we performed AP clustering algorithm processing on the samples and selected 800 center samples to retrain. We can see that the accuracy of the model has decreased a little, but the time cost has declined dramatically. And our method still has higher accuracy than other methods. The experimental results clearly show that TinyDroid is better than the state-of-the-art methods.

5.4. Comparison with Real World AV Scanners

In this part, two different sizes of sample dataset are chosen to evaluate the detection rate of TinyDroid. For this experiment, we randomly split the dataset into a training set (60%) and a testing set (40%). To compete with common antivirus products in practice, we send each sample to the VirusBook service [36] and get the output from 9 common antivirus scanners (Avira, Dr.Web, AVG, Kaspersky, ESET, GDATA, Rising, MSE, and Avast). Finally, we obtain the detection rates and false-positive rates by the statistics of output.

The results of the experiments are shown in Table 10 and Table 11. While most scanners detect over 90% of the malware, the detection rates for some scanners are below 50%. Obviously, these antivirus scanners may not be specialized in detecting mobile applications. On the 2000 samples’ dataset, TinyDroid provides the third best performance with detection of 95.3% and outperforms 7 out of the 9 scanners. On the 4000 samples’ dataset, TinyDroid provides the second best performance with detection of 98.6% and outperforms 8 out of the 9 scanners. This shows that TinyDroid achieves better detection rates when the number of samples increases. Also, these samples have been public for a longer period, so almost all antivirus scanners provide proper signatures for this dataset. The machine learning method has much more advantage than the traditional technique when the samples are unknown malware.

The false-positive rates of the antivirus scanners range from 0% to 1% on our dataset of benign applications. Although the false-positive rate of TinyDroid is higher than the antivirus scanners, Table 11 shows that the false-positive rate of TinyDroid has seen a decline when the number of samples increases.

5.5. Malware Classification Performance

In addition to the detection performance, we need to evaluate the malware classification performance. In this part, we select 9 top malware families as the experimental data. The experimental data contains a total of 900 samples and is separated into a training set (60%) and a testing set (40%). The results of the experiments are shown in Table 12. All the values of AUC are higher than 0.97 and the values of FPR are lower than 0.04. TinyDroid shows a good performance on malware classification.

6. Conclusion

This paper proposes a novel and lightweight static detection system TinyDroid. We firstly use the reverse engineering to get the Dalvik instructions from DEX files and simplify the instructions into a small symbol set. Then, the detection model integrating N-gram, exemplar selection method, and machine learning algorithm is set up based on those above reduced symbolic sequences. We compare TinyDroid against the antivirus scanners in real world. The experimental results show that TinyDroid outperforms the state-of-the-art tools on detection and classification.

There are still some deficiencies and improvements to be found in this study: The samples used in this paper mainly originate from the academic sample set of the organization or platform and lack metamorphic malware samples. Metamorphic malware could escape the proposed method. It is clearly vulnerable to obfuscation and packing.

In the next step, according to the limitation, dynamic analysis methods can be studied and combined to extract more useful features. Meanwhile, other optimization methods can also be further studied to keep the lightweight and improve the detection efficiency.

Data Availability

Previously reported (Android Malware Applications) data were used to support this study and are available at (https://doi.org/10.14722/ndss.2014.23247). These prior studies (and datasets) are cited at relevant places within the text as references [16, 24]. And the benign applications are randomly collected from Google Play and are available at (https://play.google.com/store). All samples included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This paper was partially supported by the Key National Natural Science Foundation of China with Grant nos. 61772026, U1509214, and 61602412.