Abstract

Data mining techniques have numerous applications in malware detection. Classification method is one of the most popular data mining techniques. In this paper we present a data mining classification approach to detect malware behavior. We proposed different classification methods in order to detect malware based on the feature and behavior of each malware. A dynamic analysis method has been presented for identifying the malware features. A suggested program has been presented for converting a malware behavior executive history XML file to a suitable WEKA tool input. To illustrate the performance efficiency as well as training data and test, we apply the proposed approaches to a real case study data set using WEKA tool. The evaluation results demonstrated the availability of the proposed data mining approach. Also our proposed data mining approach is more efficient for detecting malware and behavioral classification of malware can be useful to detect malware in a behavioral antivirus.

1. Introduction

Malicious code is one of the serious threats on the internet platform that is called malware [1]. Malware is known as a malicious application that has been obviously considered to damage the networks and computers [2]. The malware detection design depends on a signature database [3, 4]. For example, a file can be examined with comparison of its bytes using signatures database. If there is an equal specification in the bytes, the suspicious file will be recognized as a malicious file [5, 6]. Some subjects concentrate the signature-based malware detection less than dependable entirely which cannot handle the dynamic modification of malware behavior and cannot identify the hidden malware. In contrast, the behavior based malware detection can find the real behavior of a malicious file [7, 8].

The data mining objectives contain refining advertising abilities, irregular patterns detection, and the upcoming based experiences prediction [9] which can be influenced to identify the suspicious programs which have a destructive content for computer systems such as Virus, Worm, and Trojan [10]. The malware word is assigned to [11, 12] as a destructive file. Data mining techniques rely on data sets that contain some individual configurations for the malicious files and benign software to construct the classification methods for malware detection [13, 14].

Because of the growing malware in the technology, the knowledge of unknown malware protection is an essential topic in the malware detection according to the machine learning methods. Generally, the data mining approaches specified both malicious executable and benign software programs as set of malware programs in the wild [13, 15, 16]. Usually, the data mining algorithms can be categorized into two various forms: supervised and unsupervised learning procedures. The supervised learning methods are called classification algorithms that are needed to the exercise for data set [13, 17]. In contrast, the unsupervised learning methods are called clustering algorithms that are attempted to evaluate organizing data into different clusters [18, 19].

Usually, the malware programs are classified into some parts such as Worm, Virus, Trojan, Spyware, Backdoor, and Rootkit [10, 2022]. The base of typical and traditional approaches to identify the malware is using signature-based techniques. In recent years, the disappointment of old methods in unrecognized malware detection or polymorphic malicious files exasperated researchers and they attempted to present more dependable approaches for malware detection with behavior of the malware [23]. The procedure of detecting and finding malware has been done by two types of analysis: static analysis and dynamic analysis. In the software analyzing methods, analyzing without running the codes is called static analysis which can detect the malicious code and put it in one of the available collections based on different learning methods [24]. In the static analysis, malicious files and malware are detected based on binary codes. The main disadvantage of static analysis is unavailability of the source codes of the program. It is valuable to declare that extracting binary codes is a relatively complex and complicated work.

In contrast, the dynamic analysis detects malicious codes according to the runtime behavior [10]. The runtime code analyzing is called dynamic analysis which also denoted behavior analyzing and observing behavior and system interaction [23]. Dynamic analysis mechanism needs to execute the infested files in a virtual machine [21]. Dynamic analysis can be used with classification and clustering methods to navigate the increasing volume and range of malware. The malware classification methods help to assign unknown malware to recognized families [7, 20]. Therefore, malware classification is used to filter unknown cases and thus decreases the costs of analysis [8, 2529].

The contributions of this paper are included as follows:(i)Proposing a behavioral analysis mechanism for malware detection.(ii)Presenting a converter program for transforming a malware behavior executive history XML file to a suitable WEKA input.(iii)Discussing some classification methods on a real case study of malware.(iv)Comparing the experimental results such as Correctly Classified Instances, mean absolute error, and accurate optimistic ratio in the real data set by WEKA tool.(v)Testing the best classification method based on the important features in the malware detection in order to develop a behavioral antivirus.The structure of this paper is organized as follows: in Section 2 we have discussed some backgrounds and related works in the malware detection and data mining techniques. Section 3 depicts the malware behavioral analysis. In this section we propose a new approach for analyzing the malware behavior and translating the malicious files to data mining files by using a real case study. Also this section describes the classification and prediction approaches using data mining platform. Then, we apply some of the popular classification methods on our real case study using WEKA tool. The evaluation and experimental results are reported in Section 4. Section 5 concludes discussion and the future work.

This section discusses a brief background and some related works for malware detection in data mining methods. Firstly, we review data mining approach briefly based on classification methods in malware and other systems. Recently, some researchers presented the different approaches in malware analysis. Schultz et al. [30] proposed a data mining method to recognize the new malicious files in runtime execution. Their method was based on three types of DLL calls such as the list of DLLs used by the binary; the list of DLL function calls; and number of different system calls used within each DLL. Also they examine byte orders extracted from the hex-dump (a hexadecimal schema of computer data) of an executable file using signature methods. The main structure of this method is based on Naive-Bayes (NB) algorithm. They compared the experimental results by traditional signature-based methods.

Also Kolter and Maloof [31] presented a data mining approach and -gram analysis to identify malicious executable files based on signature approach. They presented a hex-dump utility for translating each executable file to hexadecimal code in an ASCII format. Their main data set consisted of the clean programs and the malicious programs. They analyzed the proposed approach by some popular classification methods such as instance-based learner, TFIDF, Naive-Bayes, support vector machines, decision tree, boosted Naive-Bayes, and boosted decision tree. In the other research, Siddiqui et al. [32] proposed data mining techniques for recognition some malware programs such as Worms. They considered variable length instruction sequence for their approach. Their main data set includes some Windows files and Worms. As experimental results, sequence reduction was executed, 97% of the sequences were removed, and random forest decision tree model was performed slightly better than the others.

Also some research work presented the data mining methodologies for different approach. For example, in [33] the researchers presented various data mining methods that have been developed for cancer diagnosis. Consequently, this research focused on captivating the clinical information which can be found without surgery to exchange the pathology report. They used to discover the association between the clinical information and the pathology report in order to maintain lung cancer pathologic staging diagnosis using data mining techniques. In the other research [1, 34], the authors proposed a data mining approach to analyze the students careers. Their approach is based on clustering and sequential methods with the aim of categorizing strategies for refining the performance of the exams scheduling and students. They analyzed a real case study using -mean cluster techniques in WEKA tool. Likewise [26] presented a new data mining method for the problem of detecting the phishing websites using a developed associative classification method called multilabel classifier that generates multiple labels rules. They analyzed the experimental results by various patterns in WEKA software. Also the researchers in [35] analyzed the several decision tree models to classify patients of the hospital surveillance data as a real case study. The experimental results of their analysis showed that their approach improved identical dissemination of instances in each class. Other related work [36] used a neurofuzzy data mining approach for classification of generalized bell-shaped membership functions. They applied the proposed technique to ten real standard data sets from the UCI machine learning repository for classification using Kappa statistic. They simulated proposed technique in MATLAB. Also some researches focused on the other approaches that consist of the host behavior classification methods [3740]. For example, [29] presented a novel managed discretization technique for analyzing multivariate time series which uses frequent temporal patterns as features for classification of time chain for geared near improvement of classification correctness. This paper used temporal abstraction classification approach and time intervals mining for the presented multivariate time series. Also [38] presented novel Artificial Neural Networks (ANN) based mechanism for discovering the computer Worms based on the behavioral computer events. According to estimation of different parameters of the infected computers, the ANN, decision tree, and -nearest neighbors classification techniques are compared. The other research is [41] where the authors presented computer measurements extracted mechanism for identifying unknown computer Worm activity in the operating system using support vector approaches. This paper separates a series of trials to check the new technique by retaining several computer configuration activities.

To the best of our knowledge, there is no any approach that analyzes the malware behavior in data mining platform exactly and also there is no any approach to convert malware behavior XML executive history file to a suitable WEKA tool input. Our approach can be used in base of a behavioral antivirus. For improving this defect, we present a new approach to translate a malicious file to the data mining platform. Then we consider some classification methods for evaluating our approach based on malware behavior.

3. Malware Behavior Analysis

In this section, we proposed a malware behavioral analysis mechanism as shown in Figure 1. In this mechanism, a XML file of malware behavior executive history will be converted to a nonsparse matrix using a suggested application. This application is produced with VB.Net language. Figure 2 shows a snapshot of XML convertor to a nonsparse matrix using our suggested application. The procedure of converting each XML file to a suitable WEKA input includes two elements: the number of library file calls which are attacked by malware and their volume. For example, in Box 1 the XML library file ntdll.dll has been called 16 times by the malware which are between . Then, we translate this matrix to WEKA input data set. The training methods will be proceeded by some classification algorithms. Each classification that has best performance will be chosen for test platform by new data set malware. Finally, this procedure can be used for developing a behavioral antivirus. For describing the behavioral model of malware we should download the XML file which is available in PIL (http://dws.informatik.uni-mannheim.de) as an XML file [3840]. We use 7155 XML files as data set 1 and data set 2. Our first data set contains 4024 XML file and data set 2 has 3131 XML files too. Data set 1 has 89 properties and data set 2 has 91 properties for each malware.

Then, we convert this XML file to a nonsparse matrix by using our suggested program. The nonsparse matrix includes two numbers: the first number shows the number of properties and the second number shows their importance. The first row of this matrix is shown as follows:(0 1.068, 2 0.534, 8 0.534, 11 0.534, 12 0.534, 23 0.534, 32 0.534, 33 0.534, 35 0.534, 36 0.534, 40 0.534, 45 1.068, 46 1.603, 47 1.068, 48 1.068, 49 1.068, 50 1.068, 51 1.068, 52 1.068, 53 1.068, 54 2.137, 55 1.068, 56 0.534, 57 1.068, 58 2.137, 61 0.534, 62 0.534, 63 2.137, 65 0.534, 66 0.534, 73 1.603, 83 22, 84 16, 85 4, 86 8, 87 6, 88 T1).The last number of this row is 88 T1 that shows the kind of malware.

Finally we analyze the executive history of malware in WEKA environment. The malware executive history can be developed by some applications such as SandBox tool and virtual machine for safe execution of malware in computer systems and preventing malware spread [28, 3841]. The XML file includes useful information such as system library files calls, creating, searching, and change of files, modifying registry, main processes information, creating the mutex (a mutex is an application object which permits the multiple program threads to share the same resource), modifying virtual memory, sending email, registry operations, and switches communications. By using the suggested program all of the information is read and saved as a nonsparse matrix.

Now, the matrix has been converted to a standard form of WEKA tool input as  .arff file for data set 1 and data set 2. This standard form is shown in Box 2.

3.1. Classification and Prediction Approaches

This section describes the classification methods in two real case studies as data set 1 and data set 2. At first, we analyze the data mining result on data set 1 and data set 2 by WEKA classification algorithms. For specifying the performance of classification methods in WEKA, we describe some effective features briefly [27]. The Correctly Classified Instances (CCI) depict the test cases percentages that were correctly classified. Also the Incorrectly Classified Instances (ICI) represent the test cases percentages that were incorrectly classified.

The relative absolute error (RAE) is qualified to a simple predictor error which is objective for the typical real values. In the RAE, the error is only the total absolute error rather than the total squared error.

Definition 1. A relative absolute error is a 3-tuple in formula (1), where is the value predicted by the individual program for sample case (out of sample cases); is the objective value for sample case ; and is given by the following formulasAlso the mean absolute error (MAE) shows the mean average greatness of the errors in a set of predictions, without allowing for their course. The MAE depicts the correctness of incessant variables in prediction procedure. The MAE specifies and verifies an average on the absolute values between forecast and the corresponding statement. The MAE is a linear score which means that all the individual differences are weighted equally in the average [4244].

Definition 2. A mean absolute error is a 2-tuple in formula (3), where is the prediction of value and is the true value. This feature specifies the average error in the classification procedure in Also we can measure the classifiers proficiency using a true optimistic ratio (TOR), where is the number of correctly detected malware programs and is the number of incorrectly detected malware programs in (4). The AOR creates the cost of estimated classification that is significant to setting the cost of malware classification [45]:Also there are two error rates for measuring the classification performance. The False Acceptance Rate (FAR) is the ratio of the number of test cases that are incorrectly accepted by a given model to the total number of cases. This means that this ratio shows the percentage of invalid inputs which are incorrectly accepted. The False Rejection Rate (FRR) is the ratio of the number of test cases that are incorrectly rejected by a given model to the total number of cases. This means that this ratio shows the percentage of valid inputs which are incorrectly rejected [46]. By using these factors we can calculate the Total Error Rate (TER) as follows [47]:In the classification process, we use NaiveBayse, BayseNet, IB1, J48, and classification via regression algorithms. The NaiveBayes and BayesNet are a probabilistic learning algorithms based on supervised learning method which require a small number of training data to estimate the constraints. The IB1 data mining algorithm is based on lazy approaches. Also J48 data mining algorithm is based on decision tree methods. Finally, classification via regression algorithm is based on Meta approach that is the new approach in data mining methods. In other words regression analysis is a statistical method which is used to achieve data analysis. Regression is applied with correlation analysis usually. The correlation analysis evaluates the association degree between two quantitative data sets [37]. For example, Figure 3 shows the classification result of NaiveBayse algorithm in WEKA tool. The following section describes the experimental results of classification algorithms in WEKA. Some effective features such as Correctly Classified Instances, Incorrectly Classified Instances, mean absolute error, and relative absolute error are compared with each other in order to achieve the best classification algorithm for developing a behavioral antivirus.

4. Experimental Results and Discussion

In this section, we implemented our approach using WEKA tool. We use a system by Intel Core i3 2.13 GHz CPU, 4 GB RAM, for the classification methods. This analysis has been done by some classification algorithms such as NaiveBayse, BayseNet, IB1, J48, System Vector Machine (SVM), and logistic regression method. We compared performance of classification methods in two malware data sets.

In Tables 1 and 2, the statistical analysis of data sets 1 and 2 is specified for proposed classification methods. The compared factors in the classification methods are Correctly Classified Instances, Incorrectly Classified Instances, Kappa statistic, mean absolute error, relative absolute error, root mean squared error, and root relative squared error. In this comparison, we show that the classification via regression method has best performance in malware detection. For example, in data set 1, the number of correctly classified malware programs is 3051 from total 4024 malware programs. Also in data set 2, the number of correctly classified malware programs is 3069 from total 3131 malware programs.

According to Tables 1 and 2, the percentage of Correctly Classified Instances of the logistic regression algorithm is higher than the other classification methods in each of data sets 1 and 2. Also the percentage of Incorrectly Classified Instances of the logistic regression algorithm is lower than the other classification methods in each of data sets 1 and 2.

After data mining process, we test a new malware case by the regression classification algorithm. 100 binary malware programs are downloaded from NetLux (http://vxheaven.org/) and we analyzed their behaviors by using CW-Sandbox tool and we get its XML file [38]. Then, we add these 100 malware programs to the new data set and compute the quality of their classification as true optimistic ratio. As we expect, by classification via regression 88 malware programs are detected. So we can use the classification via regression to develop a behavioral antivirus.

Figure 4 depicts the true optimistic ratio percentage for malware detection in the new data sets. The true optimistic ratio percentage of regression method is higher than the other classification methods in the new data set.

After testing our new case study by 100 malware programs, Table 3 describes a statistical result for the False Acceptance Rate (FAR) number of cases and the False Rejection Rate (FRR) number of cases. Of course, there are some platforms such as STAC (http://tec.citius.usc.es/stac/) [48] for statistical comparison of the tested algorithms. But we use the WEKA tool for statistical and experimental results for our data sets.

According to Table 3, there is no valid input which is incorrectly rejected using our approach by regression method. Also NaiveBayes method rejected 6 valid inputs incorrectly.

Also in this test case we find one FAR incorrectly accepted as a malware. So, Figure 5 shows the Total Error Rate (TER) for our new test case using our approach by the regression method.

5. Conclusion and Future Work

In this paper, we proposed a new data mining approach based on classification methodologies for detecting malware behavior. Firstly, a malware behavior executive history XML file is converted to a nonsparse matrix using our suggested application. Then, this matrix was translated to WEKA input data set. To illustrate the performance efficiency, we applied the proposed approaches to a real case study data set using WEKA tool. The training methods proceeded using some classification algorithms such as NaiveBayse, BayseNet, IB1, J48, and regression algorithms. The regression classification method had best performance for classification of malware detection. Also we analyzed the new data set by the regression classification method. The evaluation results demonstrated the availability of the proposed data mining approach. Also our proposed data mining mechanism is more efficient for detecting malware. By notice to the experimental results, classification of malware behavioral features can be a convenient method in developing a behavioral antivirus. In the future work, we will try to develop and analyze a real behavioral antivirus platform based on classification via regression algorithm.

Competing Interests

The authors declare that they have no competing interests.