Abstract
With the fast development of smartphone technology and mobile applications, the mobile phone has become the most powerful tool to access the Internet and get various services with one click. Meanwhile, susceptibilities of the application are the primary hazard to the security of Android devices. Due to these weaknesses, an attacker can easily hack the confidential data of the mobile phone. The malware application automatically performs fraudulent activities on mobile phones without the user's knowledge. Thus, these attacks are the major threats to the security of mobile phones. To detect malicious applications installed on Android smartphones, we have conducted a study that focuses on permissions and intentbased mechanisms. The study was done in three phases: in the first phase, the dataset was created by extracting intents and permissions from APK files; in the second phase, correlationbased feature selection (CFS) and best first search (BFS) were combined to select the most representative features from the feature space of the extracted dataset; and in the third phase, machine learning (ML) techniques were trained and tested against the preprocessed dataset obtained in the second phase. The accuracy, precision, recall, F1 score, and error metrics of seven machine learning techniques (REPTree, Rule PART, RF, SMO, SGD, MCC, and LMT) were demonstrated over the Android dataset.
1. Introduction
Nowadays, mobile devices have become an essential part of people’s lives. According to Statista [1], the number of smartphone users worldwide is around 6.4 billion and is likely to increase several hundred million in the next few years. The Google Play Store is the largest app store in the world, and it will have 3.48 million apps by the first quarter of 2021. Most daily activities such as online shopping, bill payment, and mobile banking are done through mobile applications. This has increased the risk of theft of confidential information such as bank details, credit and debit cards, and ATM PINs. Cybercriminals always remain active in attacks on various mobile payment systems, including brute force attacks to obtain user’s PINs, account manipulation, and redirecting fake traffic to mobile money servers [2]. Future mobile is compatible with many applications, including AIpowered health care, mobile edge computing, and innovative industry applications; therefore, future smartphones must be equipped with the latest highlevel security mechanisms [3].
Every app acquires a list of permission from the users at the installation time. Appropriate permissions allow a user to infer the behavior of any application. Identifying key permissions for functionalities and expected permission requests helps leverage unusual application behavior and provides a simple risk warning for users. In this way, the permissionbased mechanism warns the user before installation. It gives the option for the user to understand the risk of allowing the application’s permissions on the user’s mobile. For example, if an app is allowed to use the Internet, then that app can access the Internet through your mobile and consume your daily data. But due to the lack of knowledge, users cannot understand and recognize the permissions required by each application and often ignore preinstallation warnings, so this mechanism does not succeed in protecting the user from malware. According to Statista, 482,579 new malwares are added monthly. Therefore, it is necessary to detect malware on mobile devices. If we can detect malware during installation, we can stop the malicious app installation. We need a scalable malware detection approach to combat this severe malware attack, which can effectively and efficiently detect malware applications. However, scaling the detection for a large number of apps is a challenging task. Finding a way to identify malware is a big problem that needs to resolve immediately. This paper introduces a static analysisbased malware detection system to protect against Android malware. This research presents a novel and unique contribution, which are as follows:(i)This research creates an intent and permissionbased feature set using the APK and Aapt2 tools.(ii)A new framework for identifying the optimal subset combines the CFS and Best first search techniques.(iii)A total of 3000 app samples were evaluated using seven machine learning (ML) techniques.(iv)The proposed method can detect malware in realworld apps in a shorter amount of time.
The remaining sections of the paper are arranged as follows: Section 2 goes over previous works on Android malware detection and their limitations. Section 3 describes the different machine learning techniques applied. Section 4 contains the proposed methodology. Section 5 describes the experimental arrangement and various performance evaluation matrices. Section 6 presents the obtained results. Section 7 concludes the works, followed by references.
2. Related Works
The explosive growth of Android malware, well as its destructive nature, motivates researchers to conduct malware analysis. Malware analysis is classified into two types: signaturebased analysis and behaviorbased analysis. To identify malware, the signaturebased detection method compares the binary code of an app with the binary code of known malware. As a result, this method necessitates massive databases of known malware. Hence, signaturebased techniques are simple, efficient, and accurate, but they cannot identify unknown malware [4].
On the other hand, behaviorbased detection compares the behavior patterns of applications with the behavior patterns of known malware [5]. The behaviorbased method is commonly used to detect unknown malware. But this method yields a high false alarm rate. Typically, malware analysis is done in three ways, static, dynamic, and hybrid. The static analysis tested the application without actually running it on a real environment. In the dynamic approach, the application is tested by running it in a specific environment [6]. And hybrid approach combines the characteristics of static and dynamic methods. This section will review some of the efforts made in three categories.
2.1. Static Analysis
Lou et al. [7] developed the TFDroid method to detect malware applications. They used topics and sensitive data flows as a feature set. Using machine learning (ML) techniques, they achieve a classification accuracy of 93.7%. However, this approach suffers from a high false alarm rate. In another paper, Taheri et al. [8] proposed the hamming distancebased approach to computing a similarity scores between malware and benign ware. They used three datasets for the experimental purpose and confirmed the accuracy of 90% using API call features. This method also suffers from a high false alarm rate. Wu et al. [9] presented the MVIIDroid framework. They trained multiple kernel learning (MKL) classifiers and compared the performance with other methods such as RF, JS, and SVM, and confirmed an accuracy of 94.8% for Android family classification and 99% for simple malware detection. Feng et al. [10] proposed a twolayer deep learning (DL) model where the first layer analyzes the static features. The second layer inspects network flow data for malware detection and performs 99.3% accuracy. Li et al. [11] proposed the SIGPID malware detection approach, which selects significant features from the dataset and performs 93.62 percent accuracy using SVM classifiers. This method suffers from high computational complexity. Taheri et al. [12] created the “CICAndMal2017” dataset using the network flow and API call features. This approach achieved 95.3% precision in static classification at the first layer and 83.3% in dynamic classification. Santosh et al. [13] proposed a gain ratio method to select relevant features from the dataset and confirmed 94.2% accuracy by applying ML techniques such as J48, RC, MLP, SMO, and randomizable filters. In the paper [14], Alzaylaee and others have proposed DLDroid, a deep learning system for detecting malicious Android applications through dynamic analysis using stateful input generation. The study shows that the detection rate of DLDroid with dynamic features is 97.8%, and the detection accuracy with dynamic and static attributes is 99.6%, which improves the traditional machine learning technique. Chen et al. [15]presented the android malware identification by using traffic features by using three supervised machine learning methods and confirmed 95% average accuracy. Jiang et al. [16] proposed finegrained dangerous permission (FDP) method that collects the difference between malicious apps and benign apps, performs classification, and achieves a 94.5% TP rate.
2.2. Dynamic Analysis
Dynamic analysis typically encompasses executing and testing applications in a safe environment and providing the necessary resources to identify malicious activities. Thangaveloo et al. [17] presented a dynamic DATDroid approach with 91.7 percent accuracy, 93.1 percent recall rate, and 90.0 percent accuracy. Ananya. et al. [6] presented a “sysDroid” system call tracebased dynamic approach that uses LR, CART, RF, XGBoost, and DNNbased evaluations to achieve 95 to 99 percent accuracy. Casolare et al. [6] demonstrated a tracebased dynamic analysis system with 89 percent accuracy. Y Yang et al. [18] presented a dynamic approach named DroidWard using SVM, RF, and DT, which attained an accuracy of 98.49%, recall rate of 98.54%, and falsepositive rate of 1.55%.
2.3. Hybrid Analysis
Some authors offered a hybrid method for malware detection. Amin et al. [19] demonstrated an AndroShield hybrid approach. Static analysis inspects the code without running the application and dynamic examination of the application by running the application. Zhang et al. [20] introduced a DAMBA, a hybrid malware detection method using the ORGB analysis method. They extract static and dynamic attributes from apps and propose a TANMAD malware finding algorithm and confirm the very high accuracy. Ahmed et al. [21] have designed a hybrid StaDART malware detection method for dynamic code update features. In another form [22], Gajrani et al. have presented a hybrid analysis approach EspyDroid + that incorporates a reflectionguided static slicing (RGSS) method, which helps handle C&Ccontrolled execution, logic bombs, time bombs, etc. In another study, AliGombe et al. [23] designed the hybrid analysis technique named AspectDroid in which 100 malware and 100 benign ware apps were evaluated and achieved 94.68% accuracy. Some other essential works and their limitation are summarized in Table 1.
From the brief literature review discussed above and the literature in Table 1, it is clear that the previous research had the following limitations: an imbalanced dataset, a long detection time, and a high FPR. To overcome the above problem, we created the balanced dataset with 1500 malware and 1500 benign ware samples. Furthermore, to overcome the computational complexity and reduce the detection time, we implemented a lightweight approach, which combine the CFS and Best First search, which provides distinctive features that will help to reduce computation time and FPR.
3. Description of Employed ML Techniques
This section describes the seven ML techniques employed to detect malware using the hypothesis that the developed model has minimum detection time, higher accuracy, and lower error rate.
3.1. Random Forest (RF) Classifier
It is a classifier that uses a number of decision trees on different subsets of a given dataset and averages them to improve the predictive accuracy of that dataset. For example, if in a random forest method has six decision trees and three classes, namely, A, B, and C. If three of these trees predict class A, then class A's score will be three, and if two trees predict class B, then class B’s score will be two, and class C's score is one. Thus, class “A” has the highest score among the three classes. So, class “A” will be the predicted class. The random forest method determines the relative importance of each attribute, reducing the variance and reducing the possibility of overfitting. It also reduces computational cost and training time [39] Algorithm 1 shows the Pseudocode of Random Forest(RF) classifier.

3.2. Reduced Error Pruning Tree (REPTree) Classifier
REPTree is a fast and straightforward decision/regression treebased classification approach. It uses the information gain/variance value, and the prunes are used for obtaining a reducederror pruning tree (with backfitting). The information gain value is used to create a node in the decision tree. Let D denote the training dataset, which D = {X1, X2, X3,…,Xn, Y), where X is the attributes, and Y is the class label [39] Algorithm 2 shows the Pseudocode of REPTree Classifier.

3.3. Rule PART Classifier
The Rule PART method applies a divideandconquer strategy with a separateandconquer strategy for classification. The quality of classification is dependent on the coverage function. A simple pseudocode of the Rule PART method is given below Algorithm 3.

3.4. Logistic Model Tree (LMT) Classifier
LMT is a classification tree that deals with binary and multiclass classification. It applies a logistic regression function at the leaves and works with numeric and nominal values. The LMT uses crossvalidation to find LogitBoost iterations that do not overfit the training data [40].
3.5. Sequential Minimal Optimization (SMO) Support Vector Classifier (Linear Kernel)
The SMO is a method of decomposition in which the problem of multiple variables is decomposed into a series of subproblems, which optimizes an objective function. It usually takes one variable at a time, and other variables are treated as constant and remain unchanged. An SMO solves SVMQP (Support Vector Machine Quadratic Programming) problems by breaking them down into the smallest possible subproblems and incorporating two Lagrange multipliers at each step. These minor quadratic programming issues are solved analytically, avoiding using a timeconsuming numerical quadratic programming optimization as an inner loop. An SMO can handle a large dataset and optimizes Lagrange multipliers using a heuristic. It rapidly solves the SVMQP problem without the need for additional matrix storage [41].
3.6. Meta Multiclass Classifier (MCC)
In this classification, meta multiclass classifier uses a oneagentsall heuristic method, which divides the multiclass dataset into multiple binary classification problems. It trains a binary classifier and predicts the model that produces the highest confidence score [42]. The pseudocode for the multiclass classifier is given as follows Algorithm 4.

Since the dataset used in this study only has binary labels, the multiclass classifier behaves like a binary classifier.
3.7. Stochastic Gradient Descent (SGD) Classifier
The term “stochastic” refers to a process associated with a random probability, and gradient descent is a popular optimization method used in deep learning and machine learning. The gradient descent algorithm finds the best possible values for the parameters of a given cost function. For each iteration of stochastic gradient descent, some samples are chosen at random rather than the entire dataset. The term “batch” in gradient descent refers to the total number of samples from a dataset that are used to calculate gradients for each iteration. The gradient is a function's slope that measures the degree of change of one variable with respect to the changes in another variable. Gradient descent is a convex function whose output is a partial derivative of the input parameters. It reduces the computational load, particularly in highdimensional optimization problems, allowing faster iterations for lower convergence rates. The stochastic gradient descent (SGD) classifier works as follows Algorithm 5.

4. Proposed Methodology
This section explains the overall methodology that has been proposed. This section is divided into three sections: the first section describes the proposed architecture, the second section deals with feature extraction and dataset preparation, and the third section explains how to choose the best features using the CFS and Best First search (BFS) technique.
4.1. Proposed Architecture
The overall architecture of the malware detection process is depicted in Figure 1. The sixstep method is included in this architecture. The first step deals with the collection of benign and malicious APK (Android Package Kit) files from “CICAndMal2017.” The second phase involves using APKTOOL to decompose APK files and the AAPT2 tool to extract features. The dataset is created in CSV format in the third stage, and data preparation is done in the fourth step. Classification is performed in the fifth stage using a variety of machine learning algorithms. The outcomes are reviewed using different performance evaluation metrics in the final stage.
4.2. Feature Extraction and Dataset Creation
This section describes the feature extraction and dataset preparation process. The Android Package Kit (APK) is the file format used by the Android operating system to distribute and install apps on Android devices. An APK package contains everything needed for an application to be correctly installed and operate on a mobile device. Malware and benign files are downloaded from “CICAndmal2017” [43] and use the APK TOOL [44] to decompile APK files to obtain the necessary information. The AAPT2 tool [45] is used for extracting permissions and intents from the AndroidManifest.xml. The process of feature extraction and dataset creation is illustrated in algorithm 1 and Figure 2.
4.3. Optimal Subset Selection
The combination of correlationbased feature selection (CFS) and Best First techniques is used to assess the optimality of features from the dataset. Combining both methods selects 78 out of 300 attributes from the dataset.
4.3.1. CFS Subset Evaluation
Correlationbased feature selection assesses the value of a subset of attributes by considering each feature’s predictive ability and the degree of redundancy between them.
4.3.2. Best First
Best First searches the space of attribute subsets using greedy Hillclimbing with a backtracking facility supplements. The amount of backtracking done is controlled by the number of consecutive nonimproving nodes allowed. It is best to begin with an empty set of attributes and search onward, begin with a complete set of features and search backward, or begin at any point and search in both directions (by considering all possible single attribute additions and deletions at a given moment) Algorithm 6 shows the steps of Creating Feature Set (Dataset).

5. Experimental Environments and Performance Evaluation Matrix
The datasets produced by the previous process have been preprocessed. All entries are reviewed during this phase, and data cleaning and filtering are performed. After effectively preprocessing the dataset, the optimal subset was obtained using the CFS + Best First method. Following that, 70% of the dataset is used for training, 30% is used for testing purposes, and 10fold crossvalidation is used to validate the given model. The Waikato Environment Knowledge Analysis (WEKA) tools were used in all experiments. All tests are run on a machine with 8 GB of memory and a 1.80 GHz Intel (R) Core (TM) i7 8550U processor.
5.1. Performance Evaluation Matrix
The confusion matrix is an NN matrix used to evaluate the performance of ML techniques. Performance. The confusion matrix provides more insight into the predictive model's performance and describes which classes are classified correctly and are not.(1)A true positive (TP) : a true positive is correctly classified trials that belong to a positive class.(2)A false positive (FP): a false positive is incorrectly classified trials that belong to a positive class.(3)A true negative (TN): a true negative is suitably classified trials that belong to a negative class.(4)A false negative (FN): a false negative is wrongly classified trials that belong to a negative class.
We calculate the following metrics to evaluate the effectiveness of the proposed method.
Accuracy: the classification accuracy represents the classifier's performance. The accuracy is calculated using the following equation:
Recall: recall is also known as sensitivity (SN) and true positive rate (TPR). The recall is a ratio of the total number of predictions that is relevant to the total number of relevant predictions. The recall is calculated using the following equation:
Precision: the precision ratio of truepositive predictions to the total number of positive predictions (TP + FP). The precision is represented as following equation
Fmeasure: the Fmeasure is the harmonic mean of precision and recall and gives a better measure of the incorrectly classified cases than the accuracy metric. The Fmeasure uses harmonic mean because it penalizes the extreme values. The Fmeasure is calculated using the following equation:
False positive rate (FPR) (also known as false alarm rate): the Fmeasure employs the harmonic mean of precision and recall. It penalizes extreme values and provides a more accurate measure of incorrectly classified cases than the accuracy metric. The falsepositive rate is calculated using the following equation:
AUC: the AUC is an acronym for “area under the ROC curve.” It is a metric that measures performance across all possible classification thresholds. The AUC is simply the area between that curve and the xaxis. The area under the ROC curve is measured using the following equation:
ROC: a ROC curve (receiver operating characteristic curve) is a graph formed by plotting the truepositive rate (TPR) against the falsepositive rate (FPR) at various threshold values to represent the performance of a classification model across all classifications.
MCC: the Matthews correlation coefficient (MCC) is used to measure the dominance of binary classifications. Its value lies between 1 and +1, where +1 represents the perfect classification, and −1 represents total failure to classify. The Matthews correlation coefficient is calculated using the following equation:
Mean absolute error (MAE): the MAE is used to measure the prediction error in the classification problem. The absolute difference ignores the negative value. It is not very sensitive to outliers. The MAE goes from 0 to infinite the values near 0 represent the best performance. The MAE is used to calculate performance on continuous data. It gives a linear value, which averages the weighted individual differences equally. The mean absolute error is calculated using the following equation:where , and n = total instance.
Root mean squared error (RMSE): the RMSE measures are the standard deviation of the residual errors. The errors are measured by subtracting the actual value from predicted values, and the errors are squared before they are averaged. The values near 0 indicate the better performance of the model. The RMSE is very sensitive to outliers, and significant errors are penalized. The RMSE is very useful when significant errors are present and considerably influence the model's performance. The root mean squared error is calculated using the following equation:where , and n = total instance.
Relative absolute error (RAE): the relative absolute error (RAE) compared a mean error to errors produced by a naive model and expressed as a ratio. It indicates a reasonable model (which gives a better result). It is relative because the mean difference is divided by the arithmetic mean. The value of RAE is closer to 0 represents better performance. The RAE is expressed in percentage; the formula of RAE is given below in the following equation:where , and n = total instance, and a mean value of θ.
Root relative squared error (RRSE): the RRSE squared error takes the total squared error and normalizes it by dividing by the total squared error of the naive model. The RRSE is expressed in percentage. The lower value of RRSE indicates the better performance of the model. The root relative squared error is calculated using the following equation:where , and n = total instance, and a mean value of θ.
6. Results and Discussion
This section evaluates the performance of all employed ML techniques. This section is divided into five subsections: the first subsection presents the obtained confusion matrix of all the employed techniques. The second subsection describes the performance under the correctly and incorrectly classified precision and recall rate criteria. The third subsection evaluates the performancebased time and accuracy. The fourth subsection assesses the performance using FPR, ROC, and MCC criteria. Fifth subsection deals with the various MAE, RMSE, RAE, and RRSE error criteria.
6.1. Confusion Matrix
This subsection presents the derived confusion matrices from all employed ML techniques.
6.2. Evaluation Based on Correctly and Incorrectly Classified Instances, Precision, and Recall Rate
The confusion matrix generated by each model is given in Tables 2–8. Figure 3 depicts the correctly and incorrectly classified instances for each model. The precision and recall values for each classifier are shown in Table 9 and Figure 4. The LMT model correctly classifies 2998 instances, while two are incorrectly classified as malware. The LMT model also demonstrates more significant than 99 percent precision and recall rate, indicating that this approach has the best performance based on these criteria.
6.3. Evaluation of Results Based on TimetoBuild and Accuracy
The REPTree method is the fastest in timetobuild, taking 0.99 seconds, while the LMT method is the slowest, taking 22.71 seconds. As shown in Figure 5, the LMT and multiclass classifiers perform best in terms of classification accuracy, with an accuracy of 99.9%. By contrast, REPTree performs poor, with an accuracy of 96.5%, which is the lowest of all applied approaches.
6.4. Evaluation Based on FPR, ROC, and MCC Criteria
The FPR values of the LMT and multiclass are 0.001, which is lower than the all applied approaches. The LMT classifier outperforms all other models in terms of ROC, while the MCC criterion is multiclass classifier and LMT outperform all other models. The FPR values are depicted in Figure 6, while the ROC and MCC values are described in Table 9 and Figure 7.
6.5. Evaluation Based on MAE, RMSE, RAE, and RRSE Error Criteria
This subsection evaluates ML techniques based on error values obtained from all applied classifiers. The errors obtained from all the classifiers are outlined in Figure 4 and 8 respectively. Figure 7 depicts MAE and RMSE, and Figure 8 shows RAE and RRSE. According to the error criterion, the performance of MCC and LMT is the best, as both have the lowest error rate. By contrast, REPTree has the poor performance because its errors are higher than all the classifiers.
7. Conclusion
This paper presented an Android malware detection using a lightweight algorithm (CSF + BFS) for optimal feature selection. Whereas correlationbased feature selection (CFS) evaluates the value of a subset of attributes by considering each feature's predictive ability and the degree of redundancy between them, Best First searches the space of attribute subsets using greedy hillclimbing with a backtracking facility. The number of consecutive nonimproving nodes allowed determines the amount of backtracking performed. Thus, this hybrid approach takes the advantage of both CFS and BFS, and the results demonstrate the promising behavior of the proposed CFSBFDroid framework. The proposed algorithm resulted in high classification accuracy, low computational complexity, and quick convergence. The performance of the CFSBFDroid is better than the results reported in the literature. The highest detection accuracy was 99.89%, and the highest obtained F1 score was 99.9%. In the evaluation of precision, recall, and MCC metrics, the proposed approach has achieved a more than 99% score. The proposed model takes 2.34 seconds to build the model, which gives a very low false alarm rate of 0.001. In the future work, we will extend our work to implement some intelligent techniques for Android malware familial classification.
Data Availability
The data are available at https://205.174.165.80/CICDataset/CICMalAnal2017/Dataset/APKs/.
Conflicts of Interest
The authors declare no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors are very grateful for providing the resources to carry out this work by the Dept. of Computer Science and Applications, Makhanlal Chaturvedi University, Bhopal, M.P, India.