Abstract

Modern day antivirus software, which is available commercially, is incapable of providing the protection from the malicious portable document format (PDF) files and thus considered as a threat to system security. In order to mitigate the same to some extent, a new PDF malware classification system based on machine learning (ML) is introduced in this paper. The novelty of this system is that it will be inspecting the given PDF file both statistically and dynamically, which in turn will increase the accuracy of finding the correct nature of the document. This method is nonsignature-based and hence can possibly distinguish obscure and zero-day malware. The experiment is carried out for this system by deploying five different classifier algorithms to find out the best fit for the system. The best fit approach is analyzed by calculating the true positive rate (TPR), precision, false positive rate (FPR), false negative rate (FNR), and F1-score for each of these classifier algorithms. Comparison of this work is carried out with previously existing PDF classification systems. A malicious attack on to the proposed system is also implemented, which will in turn obfuscate the malicious code inside the PDF file by making it hidden during the parsing phase by the PDF parser. It has been inferred that the proposed approach achieved F1-measure of 0.986 by using the random forest (RF) classifier in comparison to state-of-the-art where F1-measure was 0.978. Thus, our approach is quite effective in the identification of the malwares when embedded in the PDF file in comparison to the existing systems.

1. Introduction

In the current generation of the digital world, most of the activities are centric towards the usage of the Internet, and thus, it becomes more important to safeguard our applications, data, and information in the presence of various attackers who are always trying to devise new malicious codes and attacks to compromise the resources. Hence, malware analysis becomes one of the prime concerns today as various malwares are generated by attackers and even their properties are changing very rapidly day by day. Nowadays, malware is not the same one that was there before as they change its signatures with time and thus difficult to trace. So, identification and classification of the latest malware is one of the most sought-after areas of research. There are majorly two ways for malware identification: one is a signature-based detection technique and the another one is behavior-based. The signature-based technique is quick and efficient only for identifying known malware and the behavior-based technique is able to identify unknown and complex malware to some extent using machine intelligence and other approaches, but the behavior-based technique is a complex one. None of the methods can detect all kinds of malware, especially when the count of malware is increasing day by day. In the signature-based approach, unique signature is created by using the attributes of the underlying object. The presence of digital signature is efficiently detected by scanning the object by the algorithm. In the behavioral-based approach, intended actions of objects are evaluated before such actions are carried out. This approach analyses the potential behavior of any actions carried by objects before the actual execution of behavior. The older malware was easy to detect as they were able to hide their features, but the malware uses different techniques like obfuscation [14] to hide their identity for a longer span and even they can bypass the firewall and other security checks present in the network or system. Also, multiple types of malwares are used to launch the attack, so the effects are more devastating.

There are majorly two ways for analyzing the malware: static and dynamic. In the static approach, malware is inspected without running the code it embeds, whereas in the dynamic approach, it is inspected by running its code [5, 6]. Thus, malware identification is one of the foremost steps in the malware analysis process. The static analysis does not require executing any malware samples and is very simple. There is no need to cover each phase of the process while performing static malware analysis. Dynamic analysis involves the detailed analysis for malware detection. A complete behavior of actions is thoroughly analyzed, while the process remains under execution. This analysis requires detailed monitoring for processes. Classification of malware is also important as identification is. The various categories of malware are viruses, trojan horse, worms, rootkits, ransomware, and key logger. There are various ways by which the classification can be carried out for malware such as feature-based techniques [7, 8] and image classification where the binary values are transformed to image. So, for better classification, more information will be fruitful [9, 10]. For better classification, good classifiers are required to be developed for the better and accurate classification of malware using some latest machine intelligence techniques.

One of the most widely used document formats is PDF. Despite the general public’s ignorance, it quickly emerged as a critical attack vector for computers. Hackers may take over a victim’s computer using dozens of flaws found in adobe reader. In addition, antivirus software developers have a difficult time protecting PDF files from assaults because of the file’s complex internal structure and the vast variety of obfuscation techniques already in use [11]. Most of us send attachments in the PDF format since it is recognized for its mobility and small weight. However, we have no idea what kinds of assaults these files may be used for or propagated to. The three primary forms of PDF malware are vulnerabilities, phishing, and exploitation of PDF features. Vulnerability in the PDF reader’s API is exploited by exploit kits, allowing the attacker to run an arbitrary code on the compromised machine. In most cases, JavaScript code is included in the file to do this. However, in phishing assaults, an unsuspecting file is used to trick the user into clicking on an infected link. These campaigns have just lately been uncovered, and they are significantly more difficult to recognize. A malicious program may be downloaded or a website’s login credentials may be stolen by any of these assaults.

The static analysis makes use of some techniques for identification such as file format inspection, string extraction, fingerprinting, which primarily used hash code values, antivirus scanning, and disassembly where the machine code is changed to assembly language [12]. Static methods are the time taking ones and also more based on behavior analysis of malware. But in the dynamic approach, the behavior is monitored, while the file is executing for any malware identification. So, it has more leverage to identify the malware. Similarly, detection of malware is primarily done through two ways: one is signature-based where the predefined signatures, if there are any, of malware are used for detection and the other one is a heuristic-based approach where multiple factors are used that contrasts the malicious behavior. One of the challenges in the signature-based approach is that attackers develop the malware by changing their signatures so many times that they are hard to trace. Hence, the heuristic approach is more favorable owing to its capacity to identify polymorphic and some latest attacks.

Using heuristics, sequences of code, and string comparison, signature-based algorithms may determine if a PDF is benign or malicious. However, this has not been demonstrated to be effective against stealth assaults of the present day. If one wants to find the hidden malicious behaviors of a particular file, dynamic approaches are more successful since they run the file in a supported environment and analyze the process it goes through as well as the API calls it makes and build a thorough record of its activities. One may learn a lot about a file’s characteristics by looking at the execution log. Because the attributes that are employed to identify malware vary depending on the approach, this is true for all methods of malware detection. Like in a signature-based byte sequence, all methods of malware detection,such as Dynamic link libraries (DLL), behavior-based API and system calls, heuristic used operation code, context-free grammars, and some new techniques such as mobile-based used android permissions and system calls are used [13].

The novelty of the proposed approach given in this paper is signature-less driven criteria. The suggested model will evaluate the API calls processes inside the PDF file and will thus look for the activities that will be performed throughout the file’s processing. The detection may be dependent on the system calls and JavaScript files inside the PDF file that have been evaluated. The data mining technique is used in this system to collect information from API requests. It is possible to categorise a particular file as being either “Ordinary (O)” or a “Potentially Malicious (PM)” based on the retrieved characteristics and statistics. Finally, these results are sent via the classification block which maps the gathered information with the algorithm’s findings and classifies the file as “correct” or a “malicious” file.

The main highlights of this paper are as follows:(1)It provides a novel ML-based malware identification approach for the PDF files(2)It provides the training and testing implementation of the proposed model under the various ML approaches(3)It also highlights the efficiency of the proposed approach under the simulated malicious attack

The paper is divided into the sections as per their relevance. The work already done in the context of malware identification using the ML and PDF file based has been elaborated in Section 2. The proposed techniques have been defined in Section 3, which is followed by the dataset details under Section 4. Results and the inference drawn have been defined in Section 5. Also, the comparative analysis with existing models has been done in Section 6. Conclusion is highlighted in Section 7.

In this part of the paper, the researchers’ main efforts are in detecting and classifying malware using machine learning (ML) and other approaches. Also, the work done pertaining to the file types that are utilized for the malware identification has also been expressed.

2.1. Malware Identification Related Work

Malware analysis focuses on finding the operation modus of malware and how it affects the programs and systems. Historically, signature-based identification approaches were widely used. This technique works against known malware quickly and effectively but does not work with respect to the zero-day malware properly [14, 15]. A malware identification framework oriented on the genetic algorithm (GA) and signature generators [16] was proposed by authors. While the authors claim that this methodology may identify unknown malware, the paper does not include significant information for the proposed framework, such as testing results, the amount of malware studied, and a comparison to other current studies. Fukushima et al. have defined [17] a behavior-based detection method. New and encrypted malware may be detected using the proposed approach on Windows OS. In [18], a supervised ML method is suggested. The model utilized an SVM kernel basis that weighs the frequency of each library call for the detection of Mac OS X malware.

Recently, with the advent of intelligence techniques, ML has also become one prominent way in malware analysis. Deep learning is an ML subcomponent, which is a heritage from artificial neural networks (ANNs). It is a novel method and is widely utilized for the analysis of images and autonomous cars, but is not enough for virus detection. Although it quite effectively and significantly decreases the area for features, it does not prevent assaults from evasion. Shabtai et al. [19] proposed taxonomy for malware identification by reporting certain sorts of functions and selecting features in the literature, using ML methods. They focus largely on the selection of features. In [20], author has provided a detailed survey of ML for malware analysis. They have mentioned the challenges of datasets and the ways to overcome them. Image transformation with ML is used for malware identification by the author where the convolution neural network (CNN) is utilized [21]. Similarly, the work in the direction of tools usage and framework representation for the malware analysis has been carried out by the researchers recently [2225].

2.2. File-Based Malware Identification Related Work

In [26], authors examined PDF design and JavaScript information included in PDFs from top to bottom. With regard to design and metadata, they created an extensive set of capabilities, such as the count of bytes per second, the encoding scheme, object names, catchphrases, and comprehensible strings in JavaScript. Also, when the characteristics vary, it is difficult to create antagonistic models since little changes are strong for AI calculations. They built up a classification model utilizing discovery type models keeping structures and data features to limit the danger of ill-disposed assaults. To approve the proposed model, they fabricated an adversarial attack. In [27], authors have presented an outline of the PDF; also, the current assaults are used to be carried out on PDF malware through solid assault models gathered in nature. They depicted how to play out a measurable examination of a PDF record to discover the proof of implanted malware utilizing programming strategies. They examined some of the new PDF malware detection apparatuses dependent on AI that can uphold computerized scientific examinations; recognizing dubious documents before a more profound, a more definite statistical evaluation is released. They examined the PDF constraints and other open issues, particularly regarding the misuse of their weaknesses to possibly misdirect resulting measurable investigations. At last, they recommended tips for improving the exhibition of such frameworks enduring an onslaught and sketch promising analysis. In [28], authors have focused on the malware implanted in PDF files as a delegate instance of modern-day cyber-attacks. They started by giving a scientific classification of the various methodologies used to produce PDF malware. To combat PDF malware classifiers based on learning, they have utilized an adversarial AI structure that has been shown effective. For example, this method enables us to identify existing flaws in learning-oriented PDF malware locators and to identify fresh attacks that may jeopardize such frameworks, along with the possibility of protective measures. In [29], authors have planned and executed a novel framework called AIMED, utilizing hereditary calculations to sidestep malware classifiers. Their tests proved that an opportunity to accomplish ill-disposed malware tests can be diminished up to half, contrasted with exemplary arbitrary approaches. Also, they carried out AIMED to create ill-disposed models utilizing individual malware scanners as target and tried the adversarial documents against additional classifiers from both examination and industry. The created models accomplished up to 82% of cross-avoidance rates among the classifiers.

In [30], authors have exhibited how the most pessimistic scenario conduct of a malware classifier regarding explicit vigor properties can be evaluated. Besides, they found that preparation of classifiers that fulfill officially checked vigor properties can build the avoidance cost of unbounded assailants by dispensing with straightforward assaults avoidances. They proposed another distance metric that works on the PDF tree structure and determined two classes of strength properties including subtree inclusions and erasures. They used the best in class irrefutably vigorous for preparing a strategy to construct strong PDF malware classifiers. A PDF malware classifier, PDFrate, is used by the authors later in [31] to evaluate their methods. Using data from a real network, they demonstrate that high quality classifier arrangements can make the majority of predictions. It is clear that the classifier cannot reliably predict the outcomes of most avoidance efforts, including nine focusing on imitation scenarios from two current projects. Over 100,000 PDF files as well as 100,000 Android apps are part of their evaluation. In [32], authors presented “Hidost,” the primary static AI based malware discovery framework intended to work on various file extensions. Broadening a formerly distributed and profoundly viable strategy, it consolidates the coherent design of documents with their substance for better identification precision. On account to its specific plan and general list of capabilities, it is extended to differentiate organizations whose coherent design is coordinated as a chain of command.

In [33], authors presented a novel AI framework for automating the discovery of malicious PDF files. Both of the structure and data in the PDF are extracted, and a sophisticated parsing mechanism is included. As a result, a broad range of malware may be distinguished, comprising parsing-based and non-JavaScript malware. Additionally, with a cautious decision of the learning calculation, their methodology has given an altogether higher exactness contrasted with other static examination methods, particularly within the sight of ill-disposed malware control.

To identify JavaScript-induced malware, the authors of [34] employed AI algorithms to get a sample of API references that depict the malicious code. An important application area was examined in this investigation, namely, the placement of the malicious JavaScript code in PDF files. Although their training data contained instances of malware, they demonstrated that their strategy has been able to identify new malware even when it was introduced into an existing system that had not previously been exposed to such malicious code. In [35], authors built up a framework that utilizes various feature selection and AI-induced techniques to set up the attributes of typical JavaScript code.

3. The Proposed Approach

PDF documents include a header, body, cross-reference table (CRT), and a trailer. Components in the body include information about the document itself, while the header provides the information about the document’s current version. Tables used to connect to objects are included in the CRT. The root object and the table locations of the objects in the body area are included in the trailer part.

The proposed ML-based malware categorization technique is explained here. For the most part, this system is designed to scan the PDF file being inspected, sort out its internal code, and determine if it is good or dangerous. The hacker’s attempts to obfuscate file headers have also been found to be blocked by the mechanism in place. This technique does not identify the malware family contained inside a particular file, but it does accurately classify the file’s type [36]. System’s categorization procedure of the proposed work is shown in Figure 1. Even if this is a high-level system design, it provides a better idea of how the classifier is implemented. To begin the inspection procedure, the PDF file document must be uploaded to the system. After the document is submitted, it is first analyzed for its information and structure. It is tagged for additional assessment if it follows the pattern of known harmful files. This saves a lot of time and improves speed. In other circumstances, when it does not fit the pattern of hostile instances, the feature extraction module analyses the whole file structure and derives the features from it. It is then given to the classifier component for evaluation once characteristics have been extracted from the PDF. The main ML algorithm is located in the classifier component, and it is this algorithm that is responsible for thoroughly examining the information provided by the feature extractor component [37]. The classifier will categorise the PDF file as either a “correct” or an “infected” file after doing the necessary data analysis.

There may be some suspicious API references in the code that can only be discovered via the dynamic code assessment, which can only be done through a new static analysis. The static and dynamic code inspection both employed the same monitoring method as with the PhoneyPDF to keep an eye out for any API references. SpiderMonkey and Rhino are two examples of open-source facilitators that have been used to conduct dynamic investigations in the past. The JavaScript ECMA standard is seen by these translators, but they are unable to comprehend JavaScript connections to the Acrobat PDF format, unless Adobe DOM duplication occurs [38].

A reference design is developed by selecting a subset of API reference that depicts the harmful JavaScript code. Using a collection of PDF files that are either clean or malicious, our system can automatically build a specific set. Acrobat PDF API perceives all JavaScript objects, strategy, and capacity constants as part of the “H” arrangement. This enables us to define “Φ” as the arrangement of all JavaScript objects, strategy, and capacity constants as well as constants. The total number of harmful and nonmalicious files is equal to M. The following equation depicts the characteristic set provided by all of the references.

Also, it is to be noticed that may be holding two values and signifies −1 and +1. Also, if result comes out +1, then it signifies malicious PDF. If the value is −1, then it signifies that the file is a safe one.

4. Dataset Details

An overview of the dataset is provided in this section that is used in this proposed research. Following the benign set, the malicious dataset that we analyzed was provided. A total of 1200 PDF samples, both malicious and safe, have been obtained for the investigation. An 800-sample training set has been employed, and 400 samples have been used for testing as depicted in Table 1. It must have been important to have a ratio of good files to malicious files in the training and testing sets of 1 : 1. The majority of the samples are based on genuine cyber-attacks that have been made public. Samples are gathered from a variety of locations over the Internet.

Because the JavaScript code is included in many of these PDF files, some classifiers believe they are all malicious because of the file’s large size. The approach in this work, on the other hand, does not use file size as a criterion for determining whether or not a file is harmful. To demonstrate this, the harmless JavaScript code is purposely inserted into nondangerous PDF files in order to make them seem as though they included the malicious JavaScript code. All dangerous and safe PDFs have been analyzed independently and the average size of safe and malicious files was of only approximately 800 kB difference, as shown in Table 2.

5. Implementation and Results

In this section, the various approaches that have been executed for the analysis and implementation of the proposed model are described. Here, the training and testing part is done consequently and the results inferred are discussed. A variety of methods have been used to study and implement this suggested approach, and they are all discussed in this section. This system has been trained on 800 PDFs using several ML classification techniques. With five alternative algorithms, including stochastic gradient boosting (SGB), random forest (RF), decision tree (DT), support vector classifier (SVC), and logistic regression (LR), a comparison is carried out to check how well the system performs under these algorithms.

The effectiveness of the proposed work is reflected by confusion matrix parameters obtained after classification. The confusion matrix comprises of training, testing, validation, and a combined matrix that reflects , and outcomes. These parameters are further used to calculate the performance parameters like precision, recall, and F1-score using the following equations, respectively.

Figure 2 signifies that deploying RF, LR, and DT takes the least amount of time possible, and thus, they perform the classification in a faster manner. In comparison, the SGB’s efficiency is average, whereas the SVC’s is poor.

The “True Positive Rate (TPR)” is computed by placing the various classifiers during testing, and the trend line is produced. The system should have a higher TPR score in order to be the optimal match. Figure 3 shows that the RF approach has the highest TPR score among all the options. Moreover, the SVC seems inappropriate for file functional testing since it has the lowest TFR score value.

The same holds true when this suggested system’s “Precision Score (PS)” under various classifiers was tested. Using Figure 4, it can be concluded that the RF is providing an average PS ratio of approximately 96 percent and that the SVC is providing the lowest PS at roughly 72 percent.

When calculating the “False Positive Rate (FPR)” when testing the system, it is deduced from Figure 5 that the RF has the lowest FPR score, which is preferred, and the SVC has the highest FPR score, which demonstrates its ineffectiveness.

During the calculation of the “False Negative Ratio (FNR)” score, Figure 6 shows that the RF, DT, and SGB all have the lowest FNR scores; hence, they come strongly recommended. SVC and LR, at the other hand, have a high FNR score.

In addition, RF has shown the best overall F1-score on the dataset when compared to the SGB. The DT’s F1-score remained similarly moderate. RF has shown to be the best fit for our proposed system, whereas SVC had the worst results when tested with our system.

A number of other PDF malware classification techniques, created by a variety of authors, have been tested. It is evident that our system has the best fit when utilizing the RF classifier based on previous parts. Extraction of features relies on API calls performed by the document as well as the JavaScript code included inside its contents. The F1-score of the various classifiers is determined, and it is inferred that for the proposed classification method, it is higher in contrast to other classifiers as mentioned in Table 3. The numbers (F1-score) shown in the table are derived utilizing the same dataset, through which the testing was executed earlier.

6. System Analysis under Attacks

Malicious samples are developed to resist our system after it had been built, and it is supported by developing a mechanism to recognize those types of attackers during the testing step. As a general rule, while parsing PDF files, the parser first travels to the trailer and retrieves the location of the first item in the list of items in the body. When the first object has been entirely parsed, the program returns to the cross-reference table (CRT) and receives the second item’s address. Since the harmful code is not processed or read when a PDF reader is requested, this work deleted the references to the body section objects that contain the dangerous code. Because of this, we may fool the parser into thinking that the file is secure, even if it has a harmful code inside it. If one wants to deceive the system into thinking a malicious file is safe, one may use this method. This is despite the fact that it has been tested using dynamic classifiers, which means that it can be inspected throughout the course of its execution. This code does not execute because it does not include any references to the portions of the body mentioned above. As a result, we may also send the malicious code-infected PDF file during dynamic analysis. The classification of documents under the malicious attacks is given in Table 4.

7. Conclusion and Future Scope

A ML model that can identify JavaScript and malicious API calls attacks in PDF files is provided in this paper. This work also tried out a number of alternative classifiers, including DT, RF, LR, SVC, and SGB, on the dataset to see how they performed. The RF classifiers within this work have produced the best results. A comparison of this approach with other PDF classifiers revealed that this proposed approach has a high F1-score of 0.986, making it 4 percent more efficient than the other most recent PDF classifiers. To further enhance the system’s defense against malicious code obfuscation methods, functionality is included to run an object scanner within the PDF document to identify any objects that are not being processed. Unparsed objects containing the malicious code may be easily identified and removed using this approach. Future plans include adding support for other file formats. Use an advanced data mining approach for more detailed insights of documents. The use of ML during the detection and classification phase of malware is highly useful, but it fails against evasion attacks; thus, it must be explored in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This research was supported by Taif University Researchers supporting Project number (TURSP-2020/215), Taif University, Taif, Saudi Arabia.