Abstract

Android is the most widely used mobile operating system and responsible for handling a wide variety of data from simple messages to sensitive banking details. The explosive increase in malware targeting this platform has made it imperative to adopt machine learning approaches for effective malware detection and classification. Since its release in 2008, the Android platform has changed substantially and there has also been a significant increase in the number, complexity, and evolution of malware that target this platform. This rapid evolution quickly renders existing malware datasets out of date and has a degrading impact on machine learning-based detection models. Many studies have been carried out to explore the effectiveness of various machine learning models for Android malware detection. Majority of these studies use datasets that have compiled using static or dynamic analysis of malware but the use of hybrid analysis approaches has not been addressed completely. Likewise, the impact of malware evolution has not been fully investigated. Although some of the models have achieved exceptional results, their performance deteriorated for evolving malware and they were also not effective against antidynamic malware. In this paper, we address both these limitations by creating an enhanced subset of the KronoDroid dataset and using it to develop a supervised machine learning model capable of detecting evolving and antidynamic malware. The original KronoDroid dataset contains malware samples from 2008 to 2020, making it effective for the detection of evolving malware and handling concept drift. Also, the dynamic features are collected by executing the malware on a real device, making it effective for handling antidynamic malware. We create an enhanced subset of this dataset by adding malware category labels with the help of multiple online repositories. Then, we train multiple supervised machine learning models and use the ExtraTree classifier to select the top 50 features. Our results show that the random forest (RF) model has the highest accuracy of 98.03% for malware detection and 87.56% for malware category classification (for 15 malware categories).

1. Introduction

In most parts of the world, almost everyone uses a smartphone to access the Internet. These devices make it very convenient to perform a variety of tasks including sensitive banking transactions and exchanging private messages. They also store personal data such as credit card numbers and passwords.

The compromise of a smartphone and the sensitive data it handles can lead to very serious consequences for users and organisations. In fact, according to the 2023 Global Mobile Threat Report, there has been a dramatic increase in the number of mobile users and devices being targeted by phishing attacks and mobile malware [1]. The main motivation of this research is to help counter the threat of mobile malware by developing an effective and efficient Android malware detector.

Because the operating system of a smartphone is its main component, it must be taken into consideration when developing an effective malware detection solution. According to the latest statistics, Android dominates the global mobile OS scene with 67.56% of the market share [2]. The other OSes have significantly lower share of the market (iOS has 31.6% and the remaining 1% is composed of BlackBerry, Windows, and Symbian). The lower price and open source nature of Android are the main reasons behind this substantial market share. However, the downside of this popularity is that there more than 400,000 new Android malware variants being detected per month [3].

Malware usually enters a smartphone through the installation of an infected app. It can also enter from a third-party Website or through a phishing attack. To minimise the chances of an infection, repositories like Google Play Store implement security checks (such as Google Play Protect) during the publication stage of an app [4, 5]. However, some sophisticated types of malware are able to bypass these checks. There are also a few app repositories and third-party Websites that do not include sufficient security checks during the publication stage.

To evade security checks on an app repository, malware authors use techniques like repackaging. During repackaging, an existing legitimate app is reverse engineered and malicious code is inserted. The modified app is then repackaged and uploaded onto the app repository. Malware authors also use obfuscation, antidynamic techniques, update attacks, mimicry, and reflection attacks to evade detection. During an update attack, malware is embedded into a legitimate app when an update is pushed. In obfuscation, sensitive methods or APIs are disguised in order to thwart analysis of the malware. To frustrate dynamic analysis, antidynamic techniques (that avoid executing malicious actions in the presence of a virtual or emulated environment) are used.

The explosive increase in malware targeting Android devices has made it imperative for researchers to automate the process of malware detection by using machine learning (ML) models. These models are trained on datasets compiled from output generated by the static or dynamic analysis of malicious and benign apps. The models trained with datasets compiled using only static analysis techniques are not effective against behaviour-modifying malware [6]. If dynamic analysis techniques are used, they should be carried out on a real device. Using an emulated environment instead of a real environment would result in a dataset that is limited because malware that use antidynamic techniques would not execute their malicious intentions in an emulated environment [7].

Another complication is the impact of changes in the Android platform and apps on malware detection. The Android ecosystem has evolved to add new functionality, maintain compatibility with new hardware and devices, and to incorporate improved security features. The negative consequences of such rapid transformation are the emergence of several forward and backward compatibility problems. The differences in behaviour and characteristics of Android apps across different versions and years have been explored in [8]. These frequent changes have a significant impact on malware detectors because of a phenomenon known as concept drift [9]. The impact, in terms of performance, was investigated in detail in [10]. The researchers trained five state-of-the-art ML-based Android malware detectors and a dataset with samples for one year and tested them on samples for the following year. Their results show that the performance of all detectors deteriorated significantly (30%–90% in some cases) within the span of that single year.

The performance of malware detection models is also affected by the evolution of Android malware. A malware detector can detect malware from different periods only if it is trained on a dataset that contains samples that span across as many years as possible. While some datasets have increased their size (to a few hundred thousand samples), the majority have failed to even include a time frame for their malware samples. One unique dataset that includes a time frame for 17,697 malware samples from 2010 to 2019 is introduced in [11]. However, this dataset has two important limitations. It only includes certain types of dynamic features (run-time traces of system calls) and it has very few samples for some years [11].

It is clear that an effective and efficient malware detector requires the hybrid analysis of samples and the compilation of a balanced dataset with samples from the largest possible time span. Also, the dynamic features should be captured by executing the sample on a real device instead of an emulator or virtual environment. The Android malware detector developed in this research is trained on a very comprehensive dataset that meets all of the abovementioned requirements. Specifically, it is trained on a subset of the KronoDroid [12] dataset for the malware detection and category classification.

The main contributions of this paper are:(1)The creation of an enhanced subset of the KronoDroid dataset. This dataset is improved by adding a category label for each malware sample. The KronoDroid dataset was selected because(i)It is recent (2021)(ii)It contains malware samples spanning almost all the years of the Android platform (2008–2020)(iii)It has both static and dynamic features. Also, the dynamic features were captured by running the samples on a real device.These characteristics make our model effective for detection of evolving and antidynamic malware.(2)An effective and efficient supervised ML model for malware detection and category classification.Initially, four supervised ML models were trained using the enhanced subset of the KronoDroid dataset mentioned in contribution 1. The performance of these models was evaluated and the model with the best performance was selected. To minimize computational cost, only traditional ML algorithms were used and only the top 50 features were selected for training.

The rest of this paper is organised as follows. Section 2 discusses related work and Section 3 presents the research methodology. In Section 4, the results are analysed and discussed. A comparison of the developed malware detector with existing malware detectors is also included in this section. Section 5 provides the conclusion and future work.

2.1. Approaches to Malware Detection

There are three main approaches for malware analysis. The static approach extracts features such as permissions, intent filters, strings, services, activities, dalvik opcodes, and metadata from an app without executing it. The dynamic approach extracts features from an app by running it in an emulated environment or on a real device. Examples of features extracted during dynamic analysis include system calls, binders calls, API calls, network usage, and memory usage. The hybrid approach combines the use of static and dynamic analysis.

After analysis, the collected features are compiled into a dataset that is used to train and test a ML model. The rest of this section reviews existing research on ML-based malware detection and category classification.

2.2. Review of Static Analysis Techniques for Malware Detection

In [13], a model based on the Cat-Boost algorithm was developed for malware detection and family classification using static features (permissions and intents). For benign apps, the Drebin dataset was used, but for malicious apps, a new dataset was compiled. The developed model had an accuracy of 97.40% for malware detection and 97.38% for family classification. The limitations of this model include its failure to detect some advanced evolving malware and its low detection rate for the Linux-Looter ransomware family.

In [14], a malware detection and categorisation model based on deep image learning was proposed. This model used static features such as activities, services, broadcast receivers/providers, intent actions, permissions, and metadata. For feature selection, the ExtraTree classifier was used. The main contribution of this work was the generation of a large dataset (i.e., CCS-CICAndMal2020) with 200,000 malicious apps from 12 malware categories and 191 malware families. The reported results include a detection accuracy of 93%. However, like [13], this work also relied on static features only.

In [15], pairs of permissions were extracted from the manifest file of malicious and benign apps and used to construct a graph. The graph was used to train a ML model. The reported detection accuracy was 95.44% but any malware that did not utilise permissions was not detected.

In [16], permissions and APIs (along with their relationships) were extracted for around 30,000 benign apps and 15,000 malicious apps. The resulting dataset, though comprehensive, was imbalanced and limited to malware until 2015.

A probabilistic discriminative model was developed in [17] to detect malicious samples based on decompiled source code and permission data. The dataset contained almost 11,000 samples (9% malicious and 91% benign), some of which were very old. The downside of this model was its inability to detect malware that used byte code encryption or obfuscation.

2.3. Review of Dynamic Analysis Techniques for Malware Detection

In [7], a dynamic analysis technique called Entropylyzer was developed. This work analysed the behaviour of malware samples by running them in an emulator. Malware from 12 categories and 191 families was analysed using six classes of features, and the extracted data were used to compile the CCS-CICAndMal2020 dataset. Shannon’s entropy was employed for feature ranking and different ML models were trained. The overall accuracy reported for malware category classification and family identification was high. However, it is important to note that during dynamic analysis, some malware failed to run because it detected the presences of a virtual environment. This limitation can only be overcome if dynamic analysis is carried out on a real device.

Similarly, the authors in [18] also relied on an emulated dynamic malware analysis platform and used multiple types of features (e.g., system calls, binder calls, and composite behaviours). The proposed semisupervised deep neural network approach for malware category classification also achieved good results. Like [7], they state that some malware samples did not execute after detecting the presence of an emulated environment.

Droidbox [19], DroidMat [20], and AMAT [21] were developed after conducting dynamic analysis in an emulator or sandbox. The feature types obtained included permissions, intents, and API calls. Despite the excellent performance of these models, these studies failed to analyse malware with antiemulation mechanisms. In other words, their dataset did not include all the features that represent the true nature of the sample. It should be noted that the ability to detect the use of an emulator is not limited to malware. Other apps (e.g., Telephony Manager) are also able to detect the presence of an emulator using Android API like methodTelephonyManager.getDeviceId() [22].

2.4. Review of Hybrid Analysis Techniques for Malware Detection

The need to conduct dynamic analysis on a real device is also highlighted in the few malware detection studies that adopt hybrid analysis approach. In [23], a malware classification technique based on pseudolabel stacking auto encoders is proposed. Malware, from 5 categories, was executed in a virtual machine introspection (VMI)-based system. The proposed model detected and classified malware with an accuracy of 98.28%. Although this study uses an emulated environment, it provides evidence that the hybrid analysis approach provides higher accuracy than only static analysis or only dynamic analysis. In [24, 25], the use of hybrid analysis is further supported by its use for effective identification of resource misuse and malware detection.

2.5. Impact of Changes in Android Ecosystem on Malware Detection

The evolution of the Android platform and its app structure has been explored in several studies along with its impact on the detection of Android malware. Some of these studies also investigate how the evolution of Android malware affects the performance of malware detectors.

In [26], a toolkit was used to mine different app platforms and user characteristics with the aim of studying the health and sustainable security of different apps. Even though this study provides significant insight into the evolution of the Android platform, the approach used has its limitations in terms of practicality and feasibility. It requires continuous data mining, crawling, and autoupdating to achieve durable results.

In [27], the static and dynamic characterization of Android apps developed between the years 2010–2017 was carried out. The authors present important findings along with recommendations related to app structure, behaviour, and evolution. However, because the researchers only analyse benign apps, the recommendations cannot be generalized to malicious apps because malicious apps are very different from benign apps in terms of their metrics and behaviour.

In [28], a self-evolving detection solution, called Droidevolver, was developed (using a dataset consisting of API calls). The dataset used consists of 34,722 malicious samples from 2011 to 2016 and the final model achieved a F-measure of 95.27%. This score decreased by 1.06% per year (on average) until it reached 89.97% in 2016. Because Droidevolver is based on static analysis, it inherits the limitations of this approach. Furthermore, the most recent samples in its dataset are more than 6 years old.

In [29], the evolution of malware was explored using static analysis of code fragments (via a technique known as differential analysis). Most importantly, this work revealed alterations in several malware characteristics over time in order to evade detection. Because this study is based solely on static analysis, it inherits the limitations of this approach. It highlights the importance of including dynamic analysis and using carefully crafting feature engineering processes on a balanced and recent dataset to effectively detect evolving malware.

In [30], the dynamic evolutionary behaviour of benign and malicious Android apps on code-level execution was evaluated. The dataset used consists of 15,451 benign samples and 15,183 malware samples from 2010 to 2017. The main contribution of this research is the uncovering of several important metrics that can help differentiate between benign and malicious apps. These metrics could be used to develop a durable malware detection solution but its effectiveness depends on the stability of certain patterns in different versions of evolving malware. Malware patterns that are not represented by these metrics will not be detected by such a malware detection solution. In addition, the solution will be subject to the limitation of dynamic analysis being conducted in an emulated environment.

To summarise, it is clear that hybrid analysis is the most promising approach for ML-based malware detection and category classification. It is also evident that dynamic analysis should be conducted on a real device to capture features from malware capable of modify its behaviour in the presence of an emulator. Furthermore, to be effective, a ML model should be trained using a dataset that includes malware that dates back to the start of the Android platform. This is important for the detection of malware that evolves over time (referred to as concept drift). Formally, concept drift refers to the change in relationship between the input variable(s) and the target variable over time. This change has a negative impact on the accuracy of a trained model (as highlighted by Droidevolver), and because malware evolves very rapidly, it is important for a dataset to be extensive and to include both old and new samples. It is for these reasons that the malware detection and category classification solution developed in this paper uses the recent, comprehensive, and hybrid analysis-based KronoDroid dataset.

3. Methodology

The steps in the methodology are presented in Figure 1.

3.1. Acquisition of Dataset

Android apps use a variety of features or attributes for performing different actions. It is crucial to select the right set of features and to this end several Android datasets have been published. These datasets differ in the number of samples, the type of attributes, and the date of malware publication and collection. In this research, we use a subset of the KronoDroid dataset published in [31]. This dataset consists of 41,382 malicious apps and 36,755 benign apps. It also includes samples from the almost entire history of Android, starting from 2008 and ending in 2020. This is the first hybrid dataset to take the time variable into account in such detail. A total of 200 static features and 289 dynamic features are extracted by running each app on a real device. These features are mostly numeric (e.g., number of times a system call is invoked by an app or the number of permissions requested by an app). Also, each sample in the dataset is given two labels. The first label identifies the sample as either benign or malware and is relevant for malware detection.

Because there are only two possible classes, malware detection is a binary classification task. The second label refers to the name of the malware family. Because there are more than two possible malware families, detecting the family is a multiclass classification task. Similarly, detecting the category to which a malicious app belongs is also a multiclass classification task. It should be noted that the original KronoDroid dataset does not contain a label for the category of malware.

3.2. Data Preprocessing

In the exploratory data analysis (EDA) step, the data are examined to understand its structure and important attributes. We used Python libraries to determine the weight of different features and to check for missing values, null values, and outliers. The KronoDroid dataset includes a few non-numerical features such as hash values and date of publication. These non-numerical features have informative value and do not have any impact on the output. This was verified through feature selection as explained in the next section.

In the Data Cleaning and Data Integration stage, the Pandas data frame was used to remove (e.g., non-numerical features), merge, and finally generate a clean version of the dataset [32].

In the Data Labelling step, each malicious sample was labelled with the name of its malware category. The malware category label was obtained from VirusTotal [33], F-Secure [34], and FortiGuard [35] online repositories. Any discrepancy (e.g., malware labelled as Riskware by one repository and Adware by another repository) was resolved by cross-checking the label given to the sample in other datasets. If a sample is not found in any dataset, it was given the label assigned to it by the most number of repositories. In this way, up to 70% of the samples were labelled. The remaining 30% of the samples were not labelled and removed from the dataset. Their labelling is left for future work. The final modified and improved version of the dataset has been made publicly available on Github [36] for the research community. Table 1 presents the number of malware samples in each of the 14 categories (+1 unknown/blank category).

In the final Data Normalization/Scaling stage, the MinMax scaling technique was applied [37]. This technique is known to provide good results [38].

3.3. Feature Selection

Features selection techniques can be divided into the following three main types: filter methods, wrapper methods, and embedded methods. Wrapper and embedded methods are computationally expensive and not recommended for large dataset with wide dimensions [39]. From the different filter methods available, the ExtraTree classifier and the mutual information algorithm were selected because of their effectiveness and relevance for classification tasks [14, 40]. Extra Trees builds multiple decision trees in parallel, where each tree is constructed using a random subset of the features and data points. During the tree construction process, it randomly selects feature splits, making it less prone to overfitting. By evaluating the importance of features across multiple trees, Extra Trees implicitly ranks the features based on their contribution to reducing impurity (Gini impurity for classification) and making accurate predictions. Mutual information is used to quantify the relationship between individual features (independent variables) and the target variable (dependent variable). It helps assess the relevance of each feature in predicting the target variable.

We tried many different features sets starting from top 5 features, top 10 features, and top 20 features but finally, 12 different feature sets with the best classification results were selected (see Table 2).

3.4. Model Training

We selected the following four supervised ML algorithms for this study: random forest (RF), decision tree (DT), K-nearest neighbour (KNN), and support vector machine (SVM). The dataset was split into two parts as follows: 70% for training and 30% for testing. Each algorithm was trained on each of the 12 feature set shown in Table 2.

3.5. Model Evaluation and Tuning

In this step, the performance of each model was evaluated using the metrics accuracy, precision, F1-score, and recall. The results are discussed in the next section.

4. Results

4.1. Initial Results

For malware detection, the accuracy of each model is shown in Table 3. Top50 MinMax ExtraTree classifier has the highest accuracy (97.98%) using RF. For malware category classification, the accuracy of each model is shown in Table 4. Again, Top50 MinMax ExtraTree classifier has the highest accuracy (87.24%) using RF. In both tables, the values in bold are the highest values for each model. These Top50 features include dynamic features (e.g., system calls) and static features (e.g., permission and intents).

The next highest accuracy for malware detection was obtained using Top100 MinMax ExtraTree classifier feature set and RF. For malware category classification, it was Top50 ExtraTree classifier and RF.

From Tables 3 and 4, we can also note that features sets with top 30 have the lowest accuracy. Features sets with top 100 have higher accuracy than those with top 30 but lower accuracy than feature sets with top 50. This is because the top 30 feature sets are missing some important features that have an impact on the output. These features are included in the top 50 features set which is the reason for their higher accuracy.

When top 100 feature sets are used (instead of the top 50 feature set), some features that have a low impact on the output are included. These features add noise and complexity to the classifier thereby reducing the accuracy. The heatmap in Figure 2 shows the importance of some top features for malware detection.

4.2. Model Tuning

The results of the RF model can be improved using hyperparameter tuning. Hyperparameters are classifier-specific parameters that control the learning rate during training and are set before a model is trained. Initially, default parameters were used. Then, hyperparameter tuning was carried out using GridSearchCV [41]. This is a Python library which facilitates the process of selecting the best parameters for a ML algorithm. There are also other techniques like random search but GridSearchCV implements the best technique for obtaining optimal hyperparameter values. The best hyperparameters selected by GridSearchCV for the four models are as follows:(1)RF: (bootstrap = True, max_depth = 300, max_features = “log2”)(2)DT: (criterion = “gini,” max_depth = 19)(3)KNN: (n_neighbors = 1)(4)SVM: (C = 20, kernel = “rbf”)

The results for malware detection and category classification after hyperparameter tuning are summarised in Table 5. It can be clearly seen that RF still provides the best performance for both tasks. Its values are highlighted in bold text. Specifically, the accuracy of malware detection has increased to 98.03% and the accuracy of malware category classification has increased to 87.56%. The confusion matrix for RF is presented in Figure 3. The false positive rate and false negative rate are less than 0.1%.

4.3. Validation

The abovementioned models are validated using -fold cross validation for malware detection and malware category classification. This validation process splits the dataset into 5 folds. In each iteration, folds of the dataset are used for training and 1 fold is used for testing. This technique of training the model on different chunks of dataset and testing it on remaining chunk is carried out to validate the accurate performance of the model by avoiding model overfitting and underfitting. The results are summarised in Table 6 and clearly demonstrate the validity of the models.

5. Discussion

Multiple ML and DL solutions have been proposed and they are briefly discussed here for the purpose of comparison. The comparison is also summarised in Table 7 and the values in bold emphasizes aspects of this research that are important in comparison toexisting research.

The selection of a high-quality dataset is essential for the effectiveness of any proposed solution, especially for handling concept drift. Sustainable performance (across years) of five different ML-based malware detectors was evaluated in [47]. The results show a drastic decrease in the performance by all malware detectors including DroidSeive [45]. DroidSeive had good detection and family classification results (99.82% and 99.26%, respectively) but after its 7-year evaluation period, it had the lowest sustainable performance (with an accuracy of 34.59%).

The popularity of emulators for dynamic analysis is also a notable trend and more than half of the studies shown in Table 7 use an emulator. In [46], a dynamic malware classification technique called Droidcats is developed. The dynamic features were captured using an emulator. As pointed out by several studies, antidynamic malware can detect an emulated environment and chose not to execute malicious behaviours or intents. Therefore, any dataset compiled using an emulator will not have a comprehensive profile of such sophisticated malware. Despite this limitation, their technique achieved good results with an accuracy of 97.4% for malware detection and 97.8% for malware categorization. Another limitation of this research is that the malware samples used are more than 5 years old (samples date from 2009 to 2017).

DL-Droid [44] is one of the few models trained using data extracted from dynamic analysis carried out on a real device. The dataset used in this research consists of 30,000 samples and the developed DL model reported an accuracy of 98.5% for malware detection.

Interestingly, their list of top 20 hybrid features contains features that also are part of our top 50 feature set, e.g., RECEIVE_SMS, RECEIVE_BOOT_COMPLETED, SEND_SMS, and READ_PHONE_STATE. This shows that hybrid features are ideal for malware detection and despite malware evolution they persist.

One limitation of this research is that it does not take into account the timeframe of malware samples and is, therefore, not effective for detection of malware that evolves its behaviour with time.

Feature selection also plays a key role in training machine learning classifiers. In [48], the developed malware detection model achieved an accuracy of 96.3% when trained using only 27 features (permissions) selected by a regression-based feature selection technique. The Android malware detection solution developed by [49] achieved a F-measure of 95% using random forest classifier using only 20 features (permissions). These features were selected using a filter-based feature selection technique. In another study [50], BFEDroid was proposed. The researchers attained a F-measure of 99% using a new embedded feature selection-based detection technique called embedded BFEDroid and LSSVM (with radial basis function kernel and principal component analysis). Their solution used a dataset compiled from the Mendeley repository that consists of 5,000 subsets. It utilizes dynamic features such as permissions and API calls. In a similar study, FSDroid, a supervised machine learning malware detection model was developed by implementing LSSVM (least square support vector machine) with RBF (radial basis kernel function) [51]. The features were extracted by performing dynamic analysis on an emulator. This model achieved 98.8% detection rate using the RSA (rough set analysis) features subset selection technique and PCA (principal component analysis) for feature ranking. Lastly, in [52], a supervised machine learning malware detection model (called DroidRL) that uses a wrapper-based feature selection method was developed. The model achieved 95.6% accuracy for malware detection using the random forest classifier and a reduced subset of only 24 features from a dataset of 5,000 benign and 5,560 malware samples. None of these five studies use hybrid approach, which makes it difficult to compare their detection rate with ours. The only new study to use hybrid approach is [53]. It uses random forest (with chi-squared feature selection-based method) to achieved the best detection rate of 95%. Overall, these studies demonstrate the significant role of using suitable feature selection techniques in malware detection models.

The authors conclude that the following are essential requirements for the development of an effective ML-based malware detector [44]:(i)A balanced dataset (with old and new samples),(ii)Dynamic analysis on real devices, and(iii)Carefully crafted features sets.

A similar conclusion is reached in [29]. To meet abovementioned requirements, the model developed in this paper uses the KronoDroid [31] dataset. This dataset is almost balanced (41,382 malware samples and 36,735 benign samples) and contains both dynamic and static features. The dynamic features were extracted using dynamic analysis conducted on both an emulated setup and a real device. The use of real devices means that antidynamic malware was successfully analysed. The dataset also includes samples from 2008 to 2020, making it effective against malware that changes its behaviour over time (concept drift). KronoDroid’s hybrid features and the large time span of its samples make it effective for the detection of evolving malware and require minimal retraining. Furthermore, using only 50 hybrid features lowers the computational cost of retraining the model.

To the best of our knowledge, no other malware detection solution has combined and achieved the following:(i)The ability to detect malware that can changes its behaviour when run in an emulated environment(ii)Handling concept drift for the detection of malware that evolves with time(iii)Accuracy of 98.03% for malware detection and 87.56% for malware category classification.

6. Conclusion

In this research, a malware detection and category classification model for advanced and evolving Android malware is developed. The model uses supervised ML and is trained using an enhanced subset of the KronoDroid dataset. The KronoDroid dataset includes malware samples from the entire history of Android and is ideal for a model that can handle concept drift. It also contains features extracted from both the static and dynamic analysis of malware. In addition, dynamic analysis is conducted by running the malware on a real device. The trained model is, therefore, effective for the detection of antidynamic malware capable of bypassing or modifying its behaviour in an emulated environment.

One shortcoming of the KronoDroid dataset is that it does not include labels for malware categories. We added malware category labels to a subset of this dataset with the help of multiple online antimalware repositories. This enhanced dataset was used to train random forest (RF), decision tree (DT), K-nearest neighbour (KNN), and support vector machine (SVM) classifiers. Also, the ExtraTree classifier and mutual information algorithms were used for feature selection. The performance of the trained models was evaluated using metrics such as accuracy, precision, F1-score, and recall.

The results show that the highest accuracy was obtained using RF (with Top50 MinMax features selected using the ExtraTree classifier) for both malware detection (98.03%) and malware category classification (87.56%). MinMax scaling selects the features which have highest impact on output. Initially, multiple models were trained using different subsets of top features such as top 30, top 50, and top 100. Because top 50 provided the best results, it was validated using 5-fold cross validation. The selection of optimal number of top features not only enhanced the results but also reduced the computational overhead. Compared to existing solutions, this makes our proposed model more suitable for adoption in a production environment.

To summarise, the main contributions of this paper are as follows:(i)A subset of the KronoDroid dataset enhanced by adding malware category labels.(ii)An effective supervised ML model (RF with Top50 MinMax ExtraTree classifier features) with 98.03% accuracy for malware detection and 87.56% accuracy for malware category classification (for 15 categories).(iii)A comparison with related work that shows the closest work to our RF model [38] has 4.63% lower accuracy for malware detection. For malware category classification, it only includes four categories compared to our 15 categories. Their model achieves 4.94% higher accuracy for static detection but 7.26% lower accuracy for dynamic detection. The comparison also shows that DL-based solutions like [44] have almost the same accuracy but our proposed solution has the following advantages:(1)It is effective for detecting antidynamic and evolving malware because the dataset includes hybrid features and malware samples from almost entire timeline of the Android platform (2008–2020).(2)It is effective for malware category classification because it uses a more comprehensive and reliable dataset with 15 categories.

In future, our enhanced dataset could be improved by adding more samples to ensure each malware category is balanced (currently, the malware categories have unequal number of samples as shown in Table 1). This should enhance the accuracy of the model for malware category classification. Also, additional steps could be taken to validate the malware category label. Around 30% of the malware samples were assigned different labels by different repositories (e.g., Trojan by one antimalware repository and Riskware by another other repository). We believe that the accuracy of malware category classification could be improved through this validation process.

Although our detection model does its best to minimize retraining requirement and computational cost (by using samples from the longest possible time span and including hybrid features for the detection of evolving and emerging malware), its effectiveness, to some extent, still depends on the stability of patterns in future. Therefore, the timeframe for retraining and the stability of the model with respect to evolving patterns should also be investigated.

Data Availability

The malware dataset (KronoDroid) used in this research work is from previously reported research work, which have been cited. The modified and improved version of data is publicly available at Github semw/kronodroid_improved_hybrid_detection_v2 in the form of csv files.

Disclosure

The authors conducted this research while affiliated with the School of Electrical Engineering and Computer Science (SEECS) at the National University of Sciences and Technology (NUST). This work was completed as part of Master’s (MS) thesis at National University of Sciences and Technology (NUST) and is not part of any funded research project.

Conflicts of Interest

The authors declare that they have no conflicts of interest.