Abstract

With the rapid growth of Android devices and applications, the Android environment faces more security threats. Malicious applications stealing usersʼ privacy information, sending text messages to trigger deductions, exploiting privilege escalation to control the system, etc., cause significant harm to end users. To detect Android malware, researchers have proposed various techniques, among which the machine learning-based methods with static features of apps as input vectors have apparent advantages in code coverage, operational efficiency, and massive sample detection. In this paper, we investigated Android applicationsʼ structure, analysed various sources of static features, reviewed the machine learning methods for detecting Android malware, studied the advantages and limitations of these methods, and discussed the future directions in this field. Our work will help researchers better understand the current research state, the benefits and weaknesses of each approach, and future technology directions.

1. Introduction

The number of Android applications continues to grow. As of June 2020, there are more than 2.99 million apps in Google Play [1]. At the same time, the number of malicious apps is also increasing rapidly. Experts at AV-TEST Institute had counted around 1.8 million new malicious apps in the first half of 2020 [2]. The malicious apps steal personal information from end users, automatically call and send text messages to trigger deductions, and exploit root privileges to control the system, causing massive harm to end users [3]. Various detection techniques for Android malware have been proposed one after another to cope with these security threats. The machine learning-based detection method is one of the efficient ways. This type of approach has several advantages:(i)Comprehensiveness of the detection: when detecting malware, traditional methods usually pay close attention to certain malicious functions, such as the framework proposed by Xu et al. [4] against privilege escalation and the technique offered by He et al. [5] for privacy leakage analysis. Most machine learning methods can detect various malicious behaviours at one time and carry out a classification of multiple families, which can be used as a universal and comprehensive detection means.(ii)Accuracy of the detection: benefiting from the rapid development of machine learning algorithms, the accuracy of identifying malicious applications based on such methods is increasing. Detection frameworks based on SVM, Naïve Bayes, Perceptron, and deep neural network algorithms are continually being proposed. They perform better and better in dealing with malicious application identification, multiclassification, and malicious code location.(iii)Reduction of dependence on experts: traditional detection methods rely heavily on the rich experience of human experts. However, through feature extraction at many different levels with less experience knowledge, more malicious application patterns can be identified by machine learning methods, which are more conducive to discovering new malicious software and improving detection efficiency.

In the implementation of Android malware detection using machine learning, the two primary sources of the feature are static extraction and dynamic extraction [6]. Static features are extracted from the manifest, Dalvik bytecode, native code, sound, image, and other reversed APK files. Dynamic features are collected from the log records, code execution paths, variable value tracking, sensitive function calls, and other behaviours in the process of application execution by running APK files in a monitored environment.

Although the detection method based on static features has some limitations compared with those based on dynamic ones, such as it is challenging to combat code obfuscation, it also has distinct advantages:(i)Full code coverage: static feature extraction can cover all code and all resource files by scanning code or symbolic pseudoexecution. In contrast, dynamic feature extraction can hardly cover all code execution paths. Many applications require users to provide login credentials to use most of the features, making it difficult to detect all the functions in dynamic execution, resulting in incomplete feature extraction.(ii)Reliable detection efficiency: static feature extraction will complete the detection task in the expected time because it does not need to run the application. In contrast, dynamic feature extraction requires triggering various functions in code execution, which will consume lots of time. While the application is running, it takes some time to simulate a click-through interface. The program may perform a very complex computation or enter an infinite loop. These conditions make it difficult for the detection task to be completed within the specified time frame.(iii)Unperceived by malicious code: static detection does not require the codesʼ execution, so malicious applications cannot recognize that they are under check. Although some malware attempt to make the static analysis more challenging by setting up interference codes, these added codes may themselves be an identifier to assist in identifying malicious applications.(iv)Easier to generate generic fingerprint identification: static malicious sample analysis is more inclined to extract features with invariance and universality. In contrast, dynamic analysis is very likely to be affected by the operating environment. Statically extracted features are suitable for fingerprinting and can be used for the rapid predetection of large-scale malicious applications.

There are some surveys about Android malware detection published in the past few years. The authors in [79] analysed the Android security mechanism and typical malware detection methods. The authors in[1012] focused on applying deep learning algorithms such as Restricted Boltzmann Machines, Convolutional Neural Network, Deep Belief Network, Recurrent Neural Network, and Deep Autoencoder to malware detection and analysed the advantages and results achieved. The authors in [13] is mainly concerned with the analysis of Android malware variants’ detection methods. The authors in [14] investigated Android malware detection and protection technology based on data mining algorithms. The study in [15, 16] collected the literature research of the past few years, systematically analysed the static detection technology, and discussed datasets, features, algorithms, empirical experiments, and performance measures. The study in [17] introduced the Android architecture, security mechanism, malware classification, and entire detection process, including sample collection, data preprocessing, feature selection, machine learning model construction, and experimental evaluation. The study in [1820] comprehensively discussed static, dynamic, and hybrid detection techniques. The study in [21] mainly focused on mobile malware detection techniques, analysed signature-based detection, anomaly-based detection, and other traditional detection methods. The study in [22] discussed various threats and the current Android platform security state, introduced three attack types, explained the factors contributing to the increase of malware, and analysed defensive mechanisms of Android protection.

The above surveys have done excellent work, but there are still some aspects that can be improved. For example, the sources of static features, the challenges of obfuscation technology to static analysis, and the deterioration issues of machine learning models are not investigated in detail. Our work aims to provide a comprehensive survey about Android malware static detection based on machine learning technologies. To this end, we searched in IEEE, ACM, Springer, Wiley, Hindawi, and other databases and used Google Scholar and DBLP to find the related papers. It is worth noting that we only use research papers from DBLP since 2016 in statistics of static feature types, machine learning algorithms, dataset usage, papers’ number, and evaluation metrics. The reason is that DBLP has become an authentic database, and the papers it saves are of relatively high quality and are relatively few in number. We can manually check the detection technology, algorithm model, and evaluation method used in each article to form accurate statistical results. Based on the papers we collected on Android static malicious application detection, we finish this survey work.

In our work, we analysed the Android application static features and the typical obfuscation methods, discussed machine learning algorithms suitable for Android malware detection, explained evaluation metrics of machine learning models and sustainability issues, investigated the technical route, advantages, and disadvantages of the existing research, and made an outlook on the possible future research directions in this field. The main contributions can be summarized as follows:(i)We carried out a comprehensive review of Android malwareʼs various static detection methods based on machine learning. The basic principles, feature sources, datasets, performance metrics, contributions, and limitations of the methods were compared vertically.(ii)We analysed the Android application composition, the source of static feature extraction, and the feature vector generation method in detail.(iii)The limitations of current methods were discussed, and the future development directions were prospected.

2. Android App’s Static Features’ Analysis

2.1. Android App’s Static Features

Android apps are composed of loosely coupled components bound together by the manifest file. The manifest file describes each component, application metadata, platform requirements, external libraries, required permissions, etc. Activity, service, content provider, and broadcast receiver [23] are the basic components that make up an Android app. Each of them performs different functions.

An Android app is released in the APK package form, a zip archive mainly composed of assets, lib, res, manifest, Dalvik bytecode, and resource files [24]. These files are commonly used as static feature sources. According to each fileʼs role, the feature vectorʼs extraction method and expression are different. As shown in Figure 1, there are mainly the following types of features.

2.1.1. Permission Features

From AndroidManifest.xml, the various permissions that an app requires to use during its runtime are declared in AndroidManifest.xml. Extracting the permissions listed in the file can help determine whether the app is malicious. The Android system provides about 250 kinds of permissions [25], resulting in the feature vector taking a form as a binary vector of about 250 bits long.

2.1.2. Component Features

From AndroidManifest.xml and classes.dex files, the four basic components need to be registered in AndroidManifest.xml. They will be declared and created by calling related system calls in the classes.dex file. The types and quantities of components used in the app will generally be included in the formed feature vectors.

2.1.3. Intent Features

From AndroidManifest.xml and classes.dex files, intents are used to pass messages between components. When an intent is passed to a component, a predefined call-back function is executed to process this intent. In the formation of feature vectors, intents are often used with components because they can help analyse the association between two components.

2.1.4. Constant String Features

From resource and dex files, the strings.xml resource file stores the developer-defined strings, and the dex file stores the string defined by the Smali code. Extracting the content and frequency of these strings from these files can reflect the appʼs characteristics. Considering that there are many types of strings and some of them are very long, the hash operation is generally carried out first before subsequent processing when forming feature vectors.

2.1.5. Resource File Features

From res and assets directories, including sounds, images, and layout files, many repackaged, cloned malicious applications do not modify resource files so that such files can be used as static features.

2.1.6. Opcode and API Features

From dex files, the frequency of Dalvik opcodes and API calls reflecting developersʼ programming habits is very suitable for generating detection features. In general, the occurrence number of opcodes and APIs presents a significant distinction between malware and benign apps. Feature vectors can be generated by measuring the frequency of continuous N opcodes and APIs.

2.1.7. Native Code Frequency Features

From .so files, many malware use native instructions to perform malicious operations, for the reason that compiled codes of these instructions are more difficult to decompile, which will bring many obstacles to the detection work. Extracting the arm opcode frequency and system call invocation frequency from the .so file can significantly help the detection work.

2.1.8. Control Flow Graph and Data Flow Graph Features

From dex files, the control flow graph and data flow graph can be obtained by analysing the instruction invocation relationship and data flow direction in the codes. Transforming these graphs into vector representation through graph embedding and other methods can distinguish malicious from benign behaviours well.

The above static features are used differently according to detection scenarios. We counted the research papers on machine learning-based static detection of Android malware in the DBLP database between Jan. 2019 and Nov. 2020. After eliminating repetitive and irrelevant articles, we obtained 118 papers. The statistics of static feature usage are shown in Figure 2. All features were used 277 times, of which API features were the most frequently adopted and native opcodes the least. This situation illustrates that these features play different roles in identifying malicious apps.

In recent years, the academic community has paid great attention to the sustainability and deterioration issues faced by learning-based detection models [26, 27]. A key to achieving high sustainability of a classifier lies in the underlying features being able to differentiate benign apps from malware for a long period. Zhang [28] built an API Graph based on API features to enhance malware classifier performance and slow down model aging with the similarity information among evolved Android malware. Xu [29] built and dynamically expanded the feature set based on APIs to train a variety of online learning models in a model pool that can determine the drift samples and update the aging model in a weighted voting style.

2.2. Android Malware Datasets with Static Features

To facilitate the analysis and utilization of static features, some researchers provided Android malware datasets with static features.

Drebin dataset [30] contains 5560 apps from 179 different malware families in the period of August 2010 to October 2012, providing eight types of features including hardware features, requested permissions, app components, filtered intents, restricted API calls, used permissions, suspicious API calls, and network addresses.

MalGenome dataset [31] provided 1260 Android malware samples in 49 different malware families from Aug 2010 to Oct 2011 and analysed the characteristics of malicious applications during installation and activation, including some static features, such as used permissions.

AMD dataset [32] contains 24,553 samples, categorized in 135 varieties among 71 malware families ranging from 2010 to 2016. The authors use the n-gram sequence of Dalvik bytecode to denote an Android app feature. The feature categories include Dalvik bytecode opcode sequences, Java VM-type signatures, string value of const-string instructions, and type signatures for invoked functions.

ISCX Android Botnet Dataset [33] includes 1929 botnet samples belonging 14 different families collected between 2010 and 2014. The static features they analysed are mainly embedded and obfuscated URLs.

Mystique [34] generated over 10,000 samples of Android malware by combining different attack and evasion features. Their features include entry points for malicious attack behaviors, permissions, intent filters, and source and sink APIs.

PRAGaurd dataset [35] contains 10,479 samples, obtained by obfuscating the MalGenome [31] and the Contagio [36] datasets with seven different obfuscation techniques that obfuscate some static features to avoid inspection by analysis tools.

Datasets such as AndroZoo [37], Contagio [36], VirusShare [38], and VirusTotal [39] provide detailed classification descriptions, but do not provide clear static features, which need to be extracted by the researcher.

2.3. Static Feature’s Obfuscation Technology

To make the applications more difficult for reverse engineers to understand or be checked by detection tools, malicious application developers will obfuscate static features to a certain extent. The main methods are as follows.

2.3.1. Identifier Renaming

The name of the identifier in the code is usually meaningful, and developers generally follow similar naming rules. To make it difficult for reverse engineers to understand code logic and reduce potential information leakage, the malware developers may use random strings to replace the identifiersʼ names [40]. This kind of renaming is more effective for humans than for machine analysis. Since the identifier renaming method does not work with Android API functions, Ficco [41] bypassed this techniqueʼs obstacle by comparing the API call sequence.

2.3.2. String Confusion

Encoding and encrypting the strings in the resource and code files and decrypting and reading them at runtime can achieve the effect of bypassing the detection [42]. In this case, the encryption function and encoding function should be included in the investigation scope and examined carefully. For binary programs, previous works [43, 44] had proposed various approaches to identify cryptographic functions in a program, such as AES, DES, and RC4. For Android apps, Suarez-Tangil et al. [45] dealt with this kind of obfuscation by strengthening the checking of cryptographic system API calls.

2.3.3. Call Indirection

The malicious developer modifies the original methodʼs invocation entry, inserting a new randomly named method before invoking the original. Calling the original method through the newly inserted method will add many unrelated nodes to the control flow graph, rendering some detection methods based on these technologies ineffective. Garcia [46] constructed a detection framework that used sensitive API usage, data flows between APIs, intent action, and package API usage features to detect malicious apps that use various obfuscation techniques, including call indirections.

2.3.4. Junk Code Insertion

Malicious developers often add junk instructions to the code files, such as NOP, jump, and register operation instructions, to increase the codeʼs complexity and eliminate the original static characteristics. Some detection methods based on opcode statistics [47, 48] will be disturbed by this obfuscation technology. In comparison, the method based on the source-sink data flow [49] will normally work because the influence of the inserted junk instructions on the data flow is limited.

2.3.5. Dynamic Code Loading

The Android app can load native code and additional Dalvik bytecode from local resource files, other apps, or remote networks. Malicious developers often load dynamic code through the Ldalvik/system/DexClassLoader function or use the Ljava/lang/reflect/Method function to make reflection calls, making it challenging to locate malicious code. The typical detection method is to check the appʼs sensitive functions to judge whether it is malicious. Poeplau [50] constructed the super control flow graph (sCFG) to check sensitive function calls and parameter passing to judge whether dynamic calls are vulnerable or malicious.

For the above obfuscation methods, various static detection techniques have different countermeasures. Generally speaking, it is difficult for malicious developers to obfuscate all static features. Static detection typically improves the detection accuracy by using multiple features comprehensively [45, 46, 51, 52].

Using machine learning algorithms for Android malware classification must consider the algorithm efficiency and accuracy. Generally speaking, shallow machine learning algorithms are used to construct simple classification models, with the advantages of simple implementation and fast running, but the precision is relatively low. Using complex machine learning models, detection accuracy is high, but efficiency is not desirable. Many methods are compromised between the two.

Logistic regression, Naïve Bayes, Support Vector Machine, k-Nearest Neighbour, Decision Tree, and Random forest are suitable shallow machine learning algorithms for detecting Android malware. (1) Logistic regression [53] is a generalized linear regression analysis model for estimating a particular thingʼs probability. The purpose of logistic regression is to find a best-fit model that describes the relationship between the dependent variable and a set of independent variables. (2) The Naïve Bayes [54] is based on the Bayesian theorem and assumes that the feature conditions are independent. The Bayesian network [55], also known as the reliability network, is suitable for expressing and analysing uncertain and probabilistic events and can make inferences from incomplete, inaccurate, or uncertain knowledge. (3) The Support Vector Machine (SVM) [56] is a generalized linear classifier that classifies data in a supervised learning manner. The decision boundary is the maximum-margin hyperplane for solving the learning samples. The SVM can perform nonlinear classification by the kernel method and is one of the common kernel learning methods. (4) The idea of the k-Nearest Neighbour (kNN) [57] algorithm is to find k nearest neighbour samples of a sample. Most of them belong to a specific category and have similar attributes. (5) The Decision Tree [58] is a nonparametric supervised learning method that can summarize decision rules from a series of data with features and labels and use the treeʼs structure to present these rules to solve classification and regression problems. (6) The Random forest [59] is a classifier that contains multiple decision trees. Its output category is determined by the output categories of most decision trees.

Deep neural network models suitable for detecting Android malware mainly include Deep Belief Network, Convolutional Neural Network, Recurrent Neural Network, Generative Adversarial Network, Multimodal Machine Learning, Multiple Kernel Learning, Graph embedding, and Representation learning. (1) The Deep Belief Network (DBN) [60] is a probabilistic generation model that establishes a joint distribution between observation data and labels. It is composed of multiple restricted Boltzmann machines. The layer-by-layer training method is used to solve the problem that the traditional neural network training method is not suitable for the multilayer network. (2) The Convolutional Neural Network (CNN) [61] is a feedforward neural network with convolutional computation and deep structure. It has the ability of representation learning and can perform translational shift-invariant classification of input information according to its hierarchical structure. (3) The Recurrent Neural Network (RNN) [62] is a recursive neural network with sequence data as input and is recursive in the sequenceʼs evolution direction. It is Turing-complete and has advantages when learning nonlinear features of sequences. (4) The Generative Adversarial Network (GAN) [63] is one of the most promising methods for unsupervised learning in complex distribution. The model produces a reasonably good output through mutual game learning between the frameworkʼs generated and discriminant models. (5) Multimodal Machine Learning (MMML) [64] aims to realize the ability to process multisource modal information. It can perform various tasks, such as representation learning, collaborative learning, and modal conversion. (6) Multiple Kernel Learning [65] uses multikernel functions to map and combine various features so that the data can be more reasonably expressed in the new combined space. (7) Graph embedding [66] maps high-dimensional sparse graph data into low-dimensional dense vectors, which can solve the problem that the graph data is difficult to feed into machine learning algorithms efficiently. (8) Representation learning [67] is a learning feature representation technique that transforms raw data into a form that can be effectively recognized by machine learning. It avoids the hassle of manually extracting features, allowing computers to learn to use features, while also learning how to extract features.

In the DBLP database, there are 225 classifier algorithms used in 118 papers of static machine learning-based Android malware detection from Jan. 2019 to Nov. 2020. Figure 3 shows these algorithmsʼ distribution. There are more shallow learning algorithms used in the research. The SVM algorithm ranks first with used times of 36. In the deep neural network models, the CNN algorithm ranks first with used times of 16.

To evaluate the performance of machine learning algorithms, researchers developed various evaluation metrics. The traditional metrics are shown in the following equation:where TP is the number of true positive samples, FN is the number of false negative samples, FP is the number of false positive samples, and TN is the number of true negative samples. Another commonly used evaluation metric is AUC (Area Under Curve), which is defined as the area enclosed by the receiver operating characteristic curve and the coordinate axis. The higher the AUC value, the better the effect of the model.

The ML-based Android malware detection models are prone to deteriorate for the rapid emergence of malicious apps. Researchers have introduced a series of metrics to evaluate the modelʼs sustainability, including AUT (Area Under Time) [68], Stability [69], Algorithm Credibility [70], and Algorithm Confidence [70].

AUT is a metric proposed by Pendlebury [68], which defines the area under the performance curve over time to represent the model’s sustainability, as shown in the following equation:where f is the performance metric (e.g., F1-score, Precision, Recall, etc.), N is the number of test slots, and f (k) is the performance metric evaluated at the time k. The perfect classifier with robustness to time decay has an AUT metric closer to 1.

The Stability metric proposed by Cai [69] of a classifier indicates how stable the classifier is without retraining or any other model updates and is measured by a tuple <es, n>, where es is classification accuracy the classifier achieves in an average case when trained on apps of year x and tested on apps of year x + n, n ≥ 1.

Jordaney [70] proposed Algorithm Credibility and Algorithm Confidence metrics to assess the decision of the classifiers and identify aging classification models before the performance starts to degrade. They first defined a -value for an object in a set of objects K, which means the proportion of objects in the class K that are at least as dissimilar to other objects in a set C as . The value is defined as the following equation:where tells how different an object z is from a set C. Based on the value, they then defined Algorithm Credibility and Algorithm Confidence as the following equation:where is defined as the value for the test object corresponding to the label chosen by the algorithm under analysis and is defined as 1.0 minus the maximum value among all values except the value chosen by the algorithm. Through these two metrics, it is possible to understand whether the choices made by an algorithm are supported with statistical evidence, making it easier to discover concept drift and model aging issues.

4. Literature Review

It is worth noting that many papers use more than one machine learning algorithm. We analysed in detail which algorithm the researchers work on, or which algorithm is discussed or improved, or which algorithm has achieved the best results in the authorʼs experiment, as a classification basis.

4.1. Logistic Regression

Tiwari and Shukla [71] proposed a method to detect android malware using permissions and API. The authors divided the detection of malicious applications into four steps: reverse engineering, feature extraction, feature vector generation, and classification. They reversed the APK file with reverse engineering tools and obtained AndroidManifest.xml and Smali files. Permissions from AndroidManifest.xml and APIs from Smali files were extracted to generate combined feature vectors. They achieved 96.56% accuracy for combined features using the logistic regression algorithm.

Milosevic et al. [72] presented two machine learning-aided approaches for static analysis of Android malware. The first approach is based on permission features. The Precision of 0.823, Recall of 0.822, and F-score of 0.821 are achieved using the logistic regression model as a classifier. The other approach extracts features from code files. Android apps are first reversed into multiple Java files; then, the natural language processing method is used to generate feature vectors through the bag-of-words model. SVM with SMO, logistic regression, simple logistic regression, and AdaBoostM1 with SVM algorithms are integrated into the framework, and they achieved Precision, Recall, and F-score values of 0.958, 0.957, and 0.956, respectively.

4.2. Naïve Bayes and Bayesian Network

Yerima [73] proposed an effective Bayesian classification method to deal with Android malware. The authors developed three tools: API call detectors, command detectors, and permissions detectors. They extracted features from API call, resources, assets, libraries, and Permissions, respectively. Through experimental data analysis, top-n attributes with the most discriminative ability are selected to form effective features. Finally, a Bayesian classifier is trained to make decisions. Their experimental dataset contains 1000 malware samples and 1000 benign apps. Under the condition of using 20 attributes as classification features, the performance reaches Accuracy, Precision, and AUC with 0.921, 0.935, and 0.97223, respectively.

Sanz [74] proposed a method for categorizing Android apps through machine learning techniques. They extracted three different feature sets: the frequency of the printable strings, the various permissions of the app itself, and the appʼs permissions gathered from the Android market. They used Random Forest, J48, kNN, Bayesian Networks, Naïve Bayes, and SVM as classifiers to carry out experiments on 820 samples of seven different families and concluded that Bayes TAN was the best classifier obtaining an AUC of 0.93.

4.3. Support Vector Machine

Zhao [75] presented a Feature Extraction and Selection Tool (FEST) based on machine learning approaches for malware detection. According to the predefined rules, the authors first implemented a feature extraction tool named AppExtractor. Then, they proposed a feature selection algorithm named FrequenSel, which selects features by finding the difference of permission and API frequencies between malware and benign apps. In experiments, the authors tested various classification algorithms and found that the SVM algorithm was the best.

Nissim et al. [76] introduced a framework named ALDROID based on the active learning method. Their framework aimed to select only new informative applications (benign and especially malicious) to reduce security expertsʼ labelling efforts. They first extracted permissions from manifest files, counted the number of activities, services, receivers, and content providers as features, and finally used the SVM as a classification algorithm. The highest performance achieved an Accuracy of 98.8%, a TPR of 90%, and an FPR of 0.0008.

Xu et al. [77] analysed intercomponent communication- (ICC-) related characteristics and proposed a method of identifying malware called ICCDetector, which could capture interactions between components or across app boundaries. The ICCDetector outputs all ICC sources and sinks from the APK file. These sources-sinks and other ICC-related features can be extracted to form feature vectors. The authors experimented with a dataset of 5264 malware and 12,026 benign apps with the SVM algorithm as a classifier. They achieved an accuracy of 97.4%, with a lower FPR of 0.67%. Furthermore, they discovered 43 new malicious apps from the benign dataset using the ICCDetector tool.

4.4. k-Nearest Neighbour

Wu [78] developed a system called DroidMat, considering the static information, including permissions, component deployments, intent messages, and API calls for characterizing the Android app’s behaviour. Firstly, the DroidMat extracts the information from the manifest file and regards components as entry points for drilling down to trace API calls. Next, it applies the k-means algorithm to enhance malware modelling capability. The number of clusters is decided by the Singular Value Decomposition (SVD) method on the low-rank approximation. Finally, it uses the kNN algorithm to classify the app as benign or malicious. The model achieved 97.87% Accuracy, 87.39% Recall, 96.74% Precision, and 91.83% F1-score on the dataset from the “Contagio mobile” site.

Baldini and Geneiatakis [79] investigated the simple machine learning classifiersʼ performance. The authors performed an extensive comparison using various well-known distance measures over the Drebin dataset. Results show that the distance measureʼs proper choice can provide a significant enhancement to the classification accuracy. Specifically, Hamming and CityBlock can boost the classifiersʼ performance in mobile malware detection. For instance, CityBlock can improve the kNN algorithmʼs false positive rate by up to 33% compared to the Euclidean distance.

4.5. Decision Tree and Random Forest

Sanz [80] developed a method that extracts several features from the manifest file to build machine learning classifiers. These feature sets are the permissions required for the app and the features under the uses-features group. They generated an input vector for all possible permissions, then used Naïve Bayes, J48, Random Forest, and other classifiers to perform experiments, and achieved the best results on the Random Forest containing 100 trees with an AUC of 98% and an Accuracy of 94.83%.

Canfora [47] employed the probability of n consecutive opcodes in the code segment as features such as 2-opcodes’ sequences [(move, invoke), (invoke, add), …], and each parenthesis is a component of the feature vector. If (move, invoke) probability is 0.001, (invoke, add) probability is 0.003, etc. Then, the value of this feature vector is [0.001, 0.003, …]. The authors trained two classifiers, SVM and Random Forest, to do binary classification. Results show that 97% accuracy can be obtained on average when 2-opcodes is used. Kang [48] did a similar job. They also used n-opcodes as a feature to test Naïve Bayes, SVM, Partial Decision Tree, and Random Forest classification algorithms. For N = 3 and N = 4, the SVM shows the best F1-score of 98%, and Random Forest shows the best performance in terms of both training and prediction speeds.

4.6. Deep Belief Network

Su [81] proposed DroidDeep, a malware detection approach for the Android platform based on the deep learning model. Requested permission, used permission, sensitive API call, action, and app components were used as static features. The authors built a Deep Belief Network model to learn the features. In experiments with 3986 benign apps and 3986 malware, DroidDeep achieved a 99.4% detection accuracy.

Li [82] proposed a weight-adjusted malware detection approach named DroidDeepLearner. The approach uses both dangerous API calls and risky permission combinations as features to build a Deep Belief Network model, which can automatically distinguish malware from benign ones. The results show that their approach achieves over 90% accuracy, with only 237 features on the Drebin dataset.

4.7. Convolutional Neural Network

Zhang et al. [83] proposed DeepClassifyDroid, which takes a three-step approach as follows: feature extraction, feature embedding, and detection, to discriminate malware from android apps based on the Convolutional Neural Network. They embedded permissions, intent-filters, API calls, and constant strings from the disassembled codes in a unified joint-vector space. Then, they trained a CNN model with two convolutional layers, a pooling layer, and a full connection layer to learn these vectors. Experiments show that the approach achieves an accuracy of 97.4% with few false alarms on a dataset of 5546 malware and 5224 benign apps.

Nix and Zhang [84] investigated the effectiveness of CNN and LSTM for Android apps’ classification using system API call sequences. They encoded each API call using a one-hot vector, and then, each segment is encoded by a matrix of size n × m, which serves as the input to a CNN model. They compared their CNN model with LSTM and other n-gram-based methods. Both CNN and LSTM significantly outperformed n-gram-based methods, and the performance of CNN is the best. The experiments show that the results achieve 99.4% Accuracy, 100% Precision, and 98.3% Recall on a dataset of 1016 APK files.

Ganesh [85] proposed a CNN-based deep learning model can extract the patterns of malware. They demonstrate that CNN is appropriate for malware detection by using data transformation. The APK file is parsed and decompiled using Androguard and Smali disassembler. Then, the extracted manifest file is converted into a 12 × 12 vector of permissions, which is fed into the trained CNN model. Their solution identifies malware with 93% accuracy on a dataset of 2500 Android apps, of which 2000 were malicious and 500 were benign.

4.8. Recurrent Neural Network

Amin et al. [86] proposed an end-to-end deep learning architecture that detects and attributes Android malware via opcodes extracted from bytecode files. They confirmed that bidirectional long short-term memory (BiLSTM) neural networks can be effectively applied to detect Android malwareʼs static behaviour without using handcrafted features. Experimental results report an accuracy of 99.9% and an F1-score of 99.6% on a large dataset of more than 1.8 million Android applications.

Lee et al. [87] proposed a stacked RNNs and CNNs-based classification model for learning the generalized correlation between obfuscated strings from the package’s and certificate owner’s name. The model uses the embedded method and the GRU unit to extract features and uses additional CNN units to optimize the extraction process. Their experiments demonstrate that the feature extraction process is robust to obfuscation and sufficiently lightweight for Android devices and that the CNN-RNN method improved classification performance by 16% against n-gram features and reduced training time by 50% against an RNN model.

Ma [88] proposed Droidetec, a deep learning-based method for android malware detection and malicious code localization, to model an application program as a natural language sequence. Droidetec adopts a depth-first algorithm to extract API sequences from the Android app as features. Based on that, the BiLSTM network is utilized for malware detection. Each unit in the extracted behaviour sequence is inventively represented as a vector, allowing Droidetec to automatically analyse the semantics of sequence segments and eventually discover the malicious code. Experiments with 9616 malicious and 11,982 benign programs show that Droidetec reaches an accuracy of 97.22% and an F1-score of 98.21%. In all, Droidetec has a hit rate of 91% to find out malicious code segments properly.

4.9. Generative Adversarial Network

Amin [89] proposed a Generative Adversarial Network-based model to detect Android malware inspired by the famous two-player game theory for rock-paper-scissor problems. Inside the discriminator and generator, they incorporated LSTM as deep learning architecture to learn the opcode-based binary sequential data on a large and unlabelled dataset. The test data sequences are passed through the context window, determining the bytecodesʼ sequences that differ from the previously recorded ones. If the sequence mismatch at one or more locations, it would help evaluate and characterize the behaviour of the APK. The technique achieved an F1-score of 99% with a receiver operating characteristic of 99%.

4.10. Multimodal Deep Learning and Multiple Kernel Learning

Kim [90] proposed a model based on Multimodal Deep Learning to detect Android malware. The model has five initial networks and a final network. The initial networks take five types of features as inputs and output intermediate vectors to train the final network. The features are refined using an existence-based or similarity-based extraction method, which can reflect Android appsʼ properties from various aspects. The authors achieved 98% and 99% accuracy on the VirusShare and MalGenome dataset, respectively.

Narayanan et al. [91] proposed MKLDroid, a unified framework that systematically integrates multiple views of apps for performing comprehensive malware detection and malicious code localization. MKLDroid uses a graph kernel to capture structural and contextual information from appsʼ dependency graphs and identify malicious code patterns in each view. Subsequently, it employs Multiple Kernel Learning (MKL) to find a weighted combination of the views, which yields the best detection accuracy. Besides multiview learning, MKLDroid can locate fine-grained malice code portions in dependency graphs. On benchmark datasets, MKLDroid achieves more than 97% F-measure. In the malicious code localization experiments on a dataset of repackaged malware, MKLDroid identifies all the malice classes with 94% average recall.

4.11. Graph Embedding

Pektas and Acarman [92] used the API call graph to represent all possible execution paths that a malicious app can track during its runtime. The API call graph was embedded into a low-dimension numeric vector feature set, which was introduced to the deep neural network. The authors built a CNN model as the classifier, which contains two convolution layers, one pooling layer, one flatten layer, and one dense layer, to decide whether a given app was malicious or benign. They evaluate four different graph embedding methods, namely, DeepWalk, Node2vec, Structural Deep Network Embedding, and Higher-Order Proximity Preserved Embedding, and found that SDNE provides more discriminative features. The result reached 98.86% accuracy when graph embedding size is equal to 32.

4.12. Representation Learning

Narayanan [93] designed a semisupervised representation learning framework named apk2vec to automatically generate a compact representation for a given app. Apk2vec is an integration technology that draws on the idea of doc2vec to embed an app into a vector. It can encompass information from multiple semantic views, use labels associated with apps, and combine RL and feature hashing to build appsʼ profiles efficiently. The evaluations with more than 42,000 apps demonstrated that apk2vec’s app profiles perform well in malware detection, familial clustering, app clone detection, and app recommendation tasks.

All the methods that we have analysed are vertically compared, as shown in Table 1.

5. Limitations and Future Directions

5.1. Limitations of the Static Machine Learning-Based Detection Method

We have discussed various algorithms used in past research works. These algorithms perform very well in some respects, but there are also some limitations, mainly as follows:(1)Lack of standard benchmark datasets: there are 17 malware datasets used to verify the practical effect in 118 papers that detect Android malware based on static machine learning from Jan. 2019 to Nov. 2020. We can see the statistics of these datasets in Figure 4. These datasets only provide malicious samples, resulting in the dilemma that researchers must collect benign samples from several app stores. The lack of standard benchmark datasets makes it challenging to evaluate which detection method is better or worse judicially.(2)There is no guarantee that the classifier model based on the existing dataset still has a good detection effect on new malicious applications. Many algorithms may achieve good detection results on some datasets for a while. However, as time goes, the new malicious samples may not be suitable for classification using the previous learning model, or the previously trained model leads to poor results.(3)The ability to resist obfuscation and other targeted attacks is generally weak. The article [87] mentions antialiasing, but it is limited to the package name and the ownerʼs name in the certificate and is invalid for obfuscation of Smali and native code. Obfuscation attacks conceal many of the original features, causing some static machine learning methods not to work very well.

5.2. Future Directions

According to the published papers of DBLP from Jan. 2016 to Nov. 2020, Android malware detection has always been a hot research direction. As shown in Figure 5, the number of papers on Android malware detection based on machine learning is roughly equivalent in recent years. The detection method using static features has always been an absolute advantage. It can be concluded that the static machine learning detection method will still be a hot spot in the foreseeable future. At the same time, new detection technologies are continually emerging, and they will be more lightweight, fast, stable, and robust.(1)The performance of static deep learning methods will reach a higher level. Among the Android malware detection papers based on static machine learning in DBLP from Jan. 2019 to Nov. 2020, the Accuracy metric was used 88 times. As shown in Figure 6, we can see that most Accuracy metrics are over 90%. It can be predicted that the detection methods will continue to improve efficiency and speed, while maintaining this crucial Accuracy metric.(2)The detection method will be more inclined to support large-scale detection. Many algorithms in past research works were evaluated over small datasets. Although excellent performance metrics were achieved, these algorithmsʼ scalability was not verified on large datasets. With the increasing number of applications in the future, there will be a growing need for fast detection methods to support massive apps.(3)The detection model needs to have the ability to identify zero-day attacks and new malware. Static detection based on machine learning has excellent classification ability for known malicious applications. However, it is easy to misreport new and unknown 0-day samples because the new virus weakens the known features. Future detection technology will be developed towards improving 0-day detection ability.(4)The antiattack capabilities of the machine learning model used in Android malware detection will be further enhanced. Many machine learning algorithms are vulnerable to poisoning attacks, spoofing attacks, impersonate attacks, and inversion attacks [94]. In the literature surveyed, no effective protection measures are proposed for possible attacks, which will be improved in the future.

6. Conclusion

With the continuous growth of Android devices and applications, Android appsʼ security has attracted more and more attention. This paper studied Android app composition, analysed the source of static features, reviewed Android malware static detection technology based on machine learning, and discussed the future development direction. We analysed the algorithm model, core ideas, datasets, and performance metrics of the existing methods through the vertical comparison method and pointed out the advantages and limitations. Compared with other types of Android malicious application detection technology, the static detection method based on machine learning has advantages in the comprehensiveness, accuracy, and less expert dependence of detection, although it also has some weaknesses. This paperʼs work may provide Android application security researchers with reference, help them quickly grasp various methods, master key issues, and understand the development trend of technology.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant no. 61572513.