Abstract

Ransomware (RW) is a distinctive variety of malware that encrypts the files or locks the user’s system by keeping and taking their files hostage, which leads to huge financial losses to users. In this article, we propose a new model that extracts the novel features from the RW dataset and performs classification of the RW and benign files. The proposed model can detect a large number of RW from various families at runtime and scan the network, registry activities, and file system throughout the execution. API-call series was reutilized to represent the behavior-based features of RW. The technique extracts fourteen-feature vector at runtime and analyzes it by applying online machine learning algorithms to predict the RW. To validate the effectiveness and scalability, we test 78550 recent malign and benign RW and compare with the random forest and AdaBoost, and the testing accuracy is extended at 99.56%.

1. Introduction

Computers are now becoming a legal part of our daily life, and the world cannot imagine life without a computer. Internet and computer applications have facilitated our daily life. The development has also brought us several threats to the computer, i.e., malware [1]. Malware is a malicious code, which is composed of two words “Mal” mean malicious and “ware” mean software. Through e-mail, this malicious software sent a link or file, and when the user clicks on the link or opens the file, their malware type viruses, ransomware (RW) and spyware, get executed [2]. Malicious software consists of codes developed by cyber attackers and designed to extensively damage the victim data. There are numerous types of malware but the most common types are spyware, virus, and scareware. The spyware is designed to spy on the users’ activities. It is a hidden application that is secretly executed in the background on the victim’s computer. This type of malicious software collects information such as details of credit cards, passwords, and other sensitive information without the user’s permission. A computer virus is a common type of malware which attaches itself to victims’ other files. It gets downloaded or installed itself in the computer system. The computer virus spreads quickly in the computer system. It also damages the main functionality of the computer systems and corrupts or locks the victim’s system and files [3]. The third type of malware is scareware, also acknowledged as RW, that comes with a high price. It is capable to lock or encrypt user data and restrict a user to get access to their data until the demanded money or ransom is paid.

RW attacked some of the largest organizations in the world. It is the main type of malware related to cybercriminals, and it is very common. The aim and objective of this malware are to collect money as a ransom. RW encrypts the files or locks the user’s system by holding and taking user files’ hostage that leads to financial gain [4]. In today’s Internet market, RW is the most dangerous and significant security threat and is on the top of the list. The history of RW goes back to 1980 [5]. In the last few years, such kinds of attacks are in the headlines around the world. They have resulted in increasing new families, e.g., Cryptowall 3.0 is one amongst the family of RW, which known as costly and effective RW family that had caused around $325 million damage to the industry. Sony RW attack is also very dangerous which got huge media headlines. North Korea was behind the attack, and US government confirmed it [69].

1.1. Types of Ransomware (RW)

According to the current arrival and weekly arising stories of RW, it is difficult to identify the different strains, as each of them spread differently. They generally follow similar strategies to gain the benefit of users’ security weaknesses and hold data hostage [10]. There are several forms of RW in which some of them are discussed here in detail.

1.1.1. Bad Rabbit RW

Bad Rabbit is the type of scareware, which is on the top of the list. In Eastern Europe and Russia, RW infected different organizations. The RW spreads itself by showing itself as a fake adobe flash update on compromised websites. When this RW infects a system, the user is directed to the payment page and shows that you are infected or hacked and now you have to pay $285 [10].

1.1.2. Cerber RW

It is the most dangerous and powerful RW because it also works even if you are not connected to the Internet, and even if your PC is unplugged, it still works. Cerber function is to encode the files of infected users, and then if you want to give access to your files back, you need to pay money. It attaches and sends the infected Microsoft Office document through e-mail to the victim’s system. Accessing the attached file automatically encrypts the files with Rivest–Cipher (RC4) and Rivest–Shamir–Adleman (RSA) algorithms and updates or modifies them with Cerber extensions [11].

1.1.3. CryptoLocker RW

Crypto RW is also a special type of malware. It works like a Trojan horse, which is also used to earn money. It encodes files on the specific system, and the users will be asked to pay to decrypt their files. Through spam emails, Ads, or fake sites or by malicious methods, these threats affect the user system. Thus, once the system is infected by Trojan, it stores the path of encoded files through several registered entries and runs when the system restarts and specific extensions are made in the computer which encrypts the records, and to find the decryption key, it creates additional files. To get the key, this dangerous family tries to convert the user to pay money. They use different kinds of techniques for users to pay the money for ransom [12].

1.1.4. Cryptowall RW

Ransom Cryptowall is a Trojan horse type virus that encodes files on the specific computer and asks the user to pay for file decryption. These threats typically arrive on the affected PC through exploit kits, spam emails held through malware ads or compromised sites, or other malicious. Once the Trojan is entered into the compromised system, it makes several registry entries to store the path of the encoded files and run when the computer restarts. It encrypts the records with specific extensions on the system and creates additional files with instructions on how to find the decryption key. This danger family attempts to convince the user to pay money to get the key to unlocking their documents. It uses different techniques to convince the user to pay the money for ransom [13].

RW is a specialized form of malware that encrypts files and condenses them unreachable until the victim pays a ransom. It is an extremely serious problem, and it is quickly getting worse. The statistics gathered by the FBI’s Internet Crime Complaint Center (IC3) for 2018 show Internet-enabled theft, fraud, and exploitation remain pervasive and were responsible for a staggering $2.7 billion financial losses [14]. The FBI reports the IC3 received 351,936 complaints in 2018 and an average of more than 900 every day. There is a dramatic increase in extortion payments with tens of thousands of ransomware victims paying several hundred dollars each to recover their encrypted files. In some instances, the ransom is larger, such as South Korean web hosting company Nayana, which paid 397.6 Bitcoin (about $1 million) in June 2017 and Hollywood Presbyterian Medical Center, which paid $17,000 in Bitcoin in February 2016 [15].

This emerging issue needs the attention of the research community to detect and prevent the families of RW that will protect users from huge losses. In this paper, we proposed a robust solution to detect RW at runtime by monitoring network, registry activities, and file systems. We use the API-call series to represent the behavior-based features of malware. The proposed methodology extracts the 14-feature vector by using runtime analysis by applying online machine learning algorithms for the classification of malware samples in a distributed and scalable architecture.

This paper organized as follows: Section 2 has the literature about recent work on RW classification and detection. In Section 3, we present our proposed methodology in detail. Section 4 has the experiments, dataset used, time of the proposed approach, evaluation metrics, and experimental results. In Section 5, we conclude this paper and outline for future direction.

2. Literature Review

In this section, the existing research work done on the detection and classification of RW is analyzed. The summary of the literature on RW with findings is given in Table 1. The existing computational model for detection and classification of RW is summarized in Tables 1 and 2.

Alhawi et al. [16] presented a machine learning- (ML-) based solution for the detection of RW. The dataset was collected from VirusTotal, and both data are the malicious and benign and contain 264 records having 9 RW families and 3 types of benign. Wireshark is used to capture the data and features. T-Shark is used to extract the features. The experiment was carried out in WEKA version 3.8.1. The WEKA machine-learning tool splits a dataset for training and testing purposes. The training dataset contained 75,618 simples, and the test dataset contained 48,526 simples. The training and testing datasets are split as 70 percent and 30 percent, respectively. Six different machine learning algorithms were applied. Using dataset network traffic features, we got a true positive detection rate of 97.1 percent, and using a decision tree classifier, we achieved a zero false positive rate (FPR) and true positive rate (TPR) of 96.3 percent.

Rhode et al. [17] carried out a study for the detection of RW. To achieve high accuracy, the author presented a novel approach. The proposed algorithm detects RW files during the execution stage in the first 20 sec. The dataset was collected from VirusTotal and VirusShare. The dataset contains 23,145 benign and 2,286 malicious records. A preprocess was carried out to convert all alphabetic values into numerical range for presenting of RW. Recurrent Neural Networks (RNNs) are applied to predict RW. The accuracy in 5 sec is 94 percent and 10 sec is 96 percent. The minimum false negative rate (FNR) for a model was 4.5 percent and FPR was just 3 percent. The actual value of the model in 20 seconds is 93 percent. The experiment carried was out in Python version 2.7 using Keras to implement the RNN model.

Carlin et al. [20] developed a dynamical analysis with a new detecting cryptomining technique. The dataset consists of 490 samples and is collected from VirusShark. A total of 490 samples, 194 are benign and Cryptomining has 296 HTML files or malicious samples. The RF classifier is used and implemented in WEKA version 3.9. The data will have used 10-fold cross-validation. The best accuracy of RF is 99.05 percent. The FPR is 99.7 percent, and FNR is 98.6 percent.

Carlin et al. [21] emphasized the analysis of low-level opcode, both dynamic and static, to detect the malware on runtime dataset 1,000 labels samples to affect the traditional AV labels. The dataset was collected from VirusShare. The author selected the size modality and facility. 180,000 records are malware, and all records are named by message digest MD5 hash with no other metadata. Data will be preprocessing only 1,000 opcodes with a 1.0 percent margin. The dataset contains 764 benign and 18,827 malicious samples. The counterbased classifier uses RF and implements it in WEKA version 3.8. The best accuracy of the RF is 98.4 percent.

Takeuchi et al. [24] introduced RW detecting using support vector machines (SVMs). The dataset consists of 588 samples, which have 312 benign and 276 RW, and was collected from VirusTotal. The authors design different sequence of API calls into the same vector symbols. The author tested and trained the data form the SVM classifier. The standard accuracy of the vector symbol is 93.52 percent, and the best accuracy of SVM is 97.48 percent.

3. Methodology

In this section, the new methodology is discussed. The main objective of the new methodology is the detection of the RW family at runtime. The dataset used in this paper is collected from a virus’s total website [27]. VirusTotal is an online provision that examines the files and uniform resource locations (URLs) to help in the detection of worms, viruses, and other kinds of malicious gratified using website scanners and antivirus engines. The dataset is used to identify benign and malware from the data. The proposed methodological model has different phases as shown in Figure 1.

First, the selected dataset is processed. The second phase is used to extract useful features from the preprocessed dataset using API calls. In the third phase, the dataset is divided into testing and training subsets. Finally, for the classification purpose, three diverse machine-learning algorithms, i.e., modified decision tree, random forest, and AdaBoost, are used.

3.1. Data Sets

The dataset is collected from the VirusTotal. It consists of 78550 samples; among them, there are 35369 malware and 43191 benign.

The dataset has a total of 18 features, and we select 14 features that are most relevant for the classification of a file in malware or benign. For the accuracy and improvement of the result, the 10-fold cross-validation technique is applied to the data [27].

3.2. Feature Extraction

In this step, we extract 14 features from the dataset. The detail of these features given in Table 3. The file names and MD5 hash features are dropped from the dataset. The last feature will be used as a class label, i.e., benign or malware.

3.3. Training and Testing

After extracting all vector’s features, we utilized the feature vectors with class labels to train the model. Then, the trained classifiers can calculate the labels of new instances in the form of feature vectors. Later, the performance of the proposed model is calculated. In this research, we utilized three different machine-learning algorithms, namely, decision tree, random forest, and AdaBoost.

3.4. Classification

During classification, the dataset is split into training and test datasets. This process has a key role in the field of RW detection and ML. The set training is used to train the model, and the test set is used to validate the model results.

3.4.1. Modified Decision Tree

Algorithm 1 is used to split a huge collection of records into continuously smaller subsets of records by applying a sequence of simple decision rules.

Input:
 Training samples = series of API calls
  C:
  c:
  large:
  attribute_list:
  test_attribute:
Output:
 Vector feature
Function
 1. Create a node N
 2. If N = c Then
 3. return(n)
 4. Else
 5. C = n
 6. End if
 7. If attribute_list = 0 Then
 8. return(n)
 9. Else
 10. C = large
 11. End if
 12. test_attribute = large
 13. For aiTo test_attribute
 14. Sample_set = N is portitioned.
 15. If test_attribute = ai
 16.  n = test condition
 17. End if
 18. End For Loop
End Function

The algorithm 1 splits the feature space into subsets where each subset consists of a homogeneous group of samples [28]. The outcome is a tree with leaf nodes and decision nodes. The topmost decision node in a tree, which corresponds to the best predictor, is called the root node. Decision trees can handle both categorical variables and numerical data [29].

The decision tree uses the information gain theory to select the best partitioning attribute from the dataset. The info (X) is calculated using (1):

The key advantage of the decision tree is its’ easy implementation. Decision trees and the underlying principle that they work on are easy to interpret and understand as compared with other complex machine-learning algorithms.

3.4.2. Random Forest

Algorithm 2 is a combination of different decision trees, each with the unique nodes, but utilizing diverse data that leads to different leaves Figure 2. It combines the decisions of multiple decision trees to find the best answer, which denotes the average decision trees [4]. Random forest is a flexible, easy to practice machine-learning algorithm that generally generates, even without hyperparameter tuning, an improved result. It can be used for both regression and classification problems [30].

Input:
 Dataset = malicious and benign files
  C:
  c:
  N:
  attribute_list:
  test_attribute:
  k
Output:
 vector feature
Function selection of features (RW dataset)
 1. For i = 0 to N do
 2. Replace Sample_set = c
 3. Create node N
 4. Call tree(N)
 5. End for
 6. Createtree(n)
 7. IF N = attribute_list then
 8.  Return(N)
 9. Else
 10.  Select from vector feature
 11. Select vector feature F.
 12. For i = 0 to k do
 13. Set sample N to C, where C is features = match vectors calltree (N)
 14. End for
 15. End if
End Function
3.4.3. AdaBoost

AdaBoost stands for adaptive boosting and combines weak classifiers into a strong classifier. Adaptive boosting is the first practical learning technique for building a strong classifier by the combination of weaker one [31]. A tree just has one node, and two leaves are called decision stump Figure 3.

h (x) is a weak classifier. This is equivalent to saying that (h) is computed as a weighted majority vote of the weak hypothesis (h), where each hypothesis is assigned weight F (x). The weak classifier learns by considering one simple feature and h (x) is the most useful feature for the classification selection Figure 4.

4. Experiments and Results

In this phase, the experimental environment, experiments, and results are discussed. The datasets are statistically analyzed to understand the data. Then, different classification techniques were applied to classify the malicious and benign files, and finally, the performance evaluation measures were used to assess the performance of the classifiers.

4.1. Datasets

The dataset used in this study consists of 78550 samples, where 35369 samples are malware and 43191 samples are benign. RW is the type of malware, and benign is good ware. The dataset is nearly balanced; therefore, it does not need the balancing techniques.

4.2. Experimental Environment

All the experiments for this study are conducted on the core i5 machine with 2.4 GHz CPU and 8 GB of memory. The decision tree, random forest, and AdaBoost were implemented in Python due to its simplicity and scalability.

4.2.1. Evaluation Matrices

In this study, different evaluation measures are used to relate the performance of the classifiers. These include accuracy, sensitivity, specificity, and f1. All these measures are grounded on the confusion matrix given in Table 4.

Accurateness is the utmost intuitive performance measurement. It is a relation of correctly predicted observation concluded over total observation. The accuracy of the model is calculated using (2). Sensitivity statistic (recall) is a proportion of correctly predicted positive observation and overall positive observations in the actual class, and it is calculated using (3). The negative class prediction power of the classifier is called specificity, which can be calculated using (4). Finally, the f1 measure is calculated using (5) which is the Harmonic mean of the sensitivity and specificity:

(1) Accuracy-Based Analysis. Table 5 presents the accuracy-based result of the experiments. According to the reported results, the performance of the decision tree is promising with the highest accuracy of 99.34%. Random forest is very close to the performance of decision tree having 99.24% accuracy. It is clear that the AdaBoost has less performance as compared with that of the decision tree and random forest. The AdaBoost has the lowest accuracy of 98.37%.

(2) Sensitivity-Based Analysis. The sensitivity-based comparison of 10-fold cross-validation is performed the best as shown in Table 6. The experiments show that the sensitivity of the decision tree is higher.

(3) Specificity-Based Analysis. Table 7 represents the specificity-based comparison of the different classifiers. The experiments show that the specificity of decision tree has a higher accuracy and the value is 99.62% because the feature of specificity is higher.

Specificity (Precision) is a proportion of correctly classified positive observation over total predicted positive observation.

(4) f1 Measure Based Analysis. Table 8 represents the f-measure based comparison of performing. The experiments show that the f1 value of Decision Tree is higher accuracy value is 99.55%.

4.3. Performance Comparison with State-of-the-Art Techniques

By comparing the performance using different classifiers used on the dataset, it is clear that the proposed technique availed a higher accuracy as matched to the already developed models. The results in Table 9 show that the modified decision tree has the highest accuracy of 99.56%. AdaBoost has the lowest accuracy of 98.37%. random Forest has an average accuracy of 99.38%.

It also clearly shows that the proposed technique availed a higher accuracy as matched to the already developed models. Table 10 presents the results of the contrast of the suggested algorithm with other multiple methods.

5. Conclusion and Future Direction

In this research, the RW detection at runtime scheme is developed which uses a preprocessed dataset that comprises benign and RW files. Benign is good ware, and RW is a special type of malware that keeps the data encrypted until a ransom is paid to the attacker. In the experiment, three different algorithms, namely, decision tree, random forest, and AdaBoost, are used to detect the RW and benign files. The modified decision tree, among the three algorithms, performed well in terms of accuracy, sensitivity, specificity, and f1-measure. Our experimental outcomes demonstrate that the presented malware classification’s testing and training accuracy is reached at 99.56%. Researchers stated some facts about sheltered device from attack and established some parameters to save data from the attack in the future, because RW is Trojan-type attack and malware, and so anomaly-based IDS may be used in the future for detecting abnormal behaviors of the network. Data mining techniques are used for detecting the activity of attack.

Data Availability

The data used to support the findings of the study are available upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest associated with the publication of this article.

Authors’ Contributions

Faizan Ullah and Qaisar Javaid conceptualized the study; Dilawar Shah was involved in the formal analysis; Abdu Salam was responsible for the methodologies and resourcesand wrote the original draft; Masood Ahmad and Muhammad Abrar were responsible for the software; Qaisar Javaid supervised the study; Nadeem Sarwar wrote, reviewed, and edited the article.