Big Data Analytics for Cyber SecurityView this Special Issue
Malware Detection on Byte Streams of PDF Files Using Convolutional Neural Networks
With increasing amount of data, the threat of malware keeps growing recently. The malicious actions embedded in nonexecutable documents especially (e.g., PDF files) can be more dangerous, because it is difficult to detect and most users are not aware of such type of malicious attacks. In this paper, we design a convolutional neural network to tackle the malware detection on the PDF files. We collect malicious and benign PDF files and manually label the byte sequences within the files. We intensively examine the structure of the input data and illustrate how we design the proposed network based on the characteristics of data. The proposed network is designed to interpret high-level patterns among collectable spatial clues, thereby predicting whether the given byte sequence has malicious actions or not. By experimental results, we demonstrate that the proposed network outperform several representative machine-learning models as well as other networks with different settings.
As the exponentially increasing amount of data, deep neural networks are drawing much attention in various fields such as image processing, natural language processing, sensor data processing, and speech recognition [2–6]. One of the main benefits of using the deep neural networks is that it is not necessary to define features because the networks automatically extract or compute features. In this work, we propose a novel approach using convolutional neural network (CNN) to tackle the malware detection. The contributions of this study can be summarized as follows: (1) we design a new CNN model well-suited to the malware detection on PDFs, (2) we demonstrate the performance of the proposed network by experiments using our manually labelled PDF dataset, and (3) we provide specific discussion about the experimental results.
The proposed approach does not require feature definition at all, but it is worth noting that it is still necessary to investigate or study the data structure in order to define better input data or design better network structures. As we target the PDFs in this work, we review the structure of the PDFs intensively and illustrate how we design the proposed network according to the characteristics of the input data (i.e., byte sequences). Although we conduct experiments only with the PDF documents, we expect that the proposed network is easily applicable to other formats (e.g., .rtf files). In the following section, we review previous studies and the structure of PDF documents.
2.1. Malware Detection on Stream Data
Malware is a program written to give an undesirable or harmful effect on a computer system. As the technology of malicious code generation by attackers becomes more intelligent, various researches have been conducted for detection and analysis of malicious codes. The malware can be divided into two categories: executables and nonexecutables. There have been many security programs to detect malicious actions in the form of portable executable files (e.g., Norton, Kaspersky). However, the nonexecutables (e.g., malicious actions in PDF documents) are easy to bypass some existing security programs and there is a high risk of false positives. Such document type malware is known to be more dangerous, as it is often considered as being insignificant by common users.
In order to detect the malicious documents, many studies focused on feature extraction based on the PDF structure analysis, where the features can be seen as a summary of the logical structure. In , a format-independent static analysis method, namely, a hierarchical document structure (hidost), was proposed. They adopted a structural multimap, where structural paths are represented by keys and the leaves are indicated by values. Several values of the multimap are reduced to a median value to constitute feature vectors that are used to train machine-learning algorithms. Cuan et al.  defined features using PDFiD Python script which verifies objects displayed in PDFs. As a simple trick called gradient-descent attack may bypass the proposed approach, they proposed two solutions: using a threshold for each feature and reducing the number of noncritical features. Smutz and Stavrou  extracted features from document metadata, such as the number of characters in titles and the size of images. Li et al.  developed a robust and secure feature extractor called FEPDF, which parses and extracts features from PDF documents. FEPDF consists of a matching method, detecting the PDF header, detecting all objects, detecting cross-reference, and detecting trailer. They emphasized that FEPDF can identify new malicious PDFs that have not been identified by existing feature extractors.
Note that these studies commonly require intensive feature engineering process, because it will almost determine the performance (i.e., accuracy) of the malware detection. In this paper, we designed a convolutional neural network to tackle the malware detection on nonexecutables. Although the proposed network allows getting results by just pushing the binary sequences into it without feature engineering, it is still important to investigate the structure of the target data. That is, the neural networks automatically capture features, but the features are obtained from input data which must be defined by an expert. Furthermore, understanding of the data structure helps to design better networks. In this paper, we target the PDF files because PDF-based attacks are known to be one of the major attacks recently. The following subsection provides detailed explanation about the structure of the PDFs.
2.2. Stream Data of PDF Files
2.3. Neural Networks for Malware Detection
There have been few studies thus far in applying neural networks to malicious software (malware) detection. Most recent works among them have used features extracted through dynamic analysis, so the features are extracted under the binary run in a virtualized environment. Kolosnjaji et al.  proposed a combination of convolutions and long short-term memory (LSTM)  to classify malware types based on the features of the API call sequences. Huang and Strokes  defined a manual 114 high-level features out of API calls as well as original function calls to predict malware types. This approach is essentially composed of two models, malware detection and malware type classification. The authors argued that the shared parameters of the two models contribute to improving the overall performance. These studies of dynamic analysis are performed on a certain nonpublic emulation environments, which makes it difficult to reproduce the works.
The other line of malware detection is the static analysis, in which features are obtained from the files without running them. Raff et al.  applied neural networks to the malware identification problem using the features in forms of 300 raw bytes of the PE-header of each file. This work showed that the neural networks are capable of extracting underlying high-level interpretation from the raw bytes, which in turn makes it possible to develop malware detectors without hand-crafted features. Saxe and Berlin  utilized a histogram of byte entropy values from the entire files and defined a fixed length feature vector as the input of the neural networks. Le et al.  designed CNN-BiLSTM architecture, where the rational of taking bidirectional LSTM (BiLSTM) layer on top of CNN layer is that the BiLSTM layer may interpret sequential dependencies between different pieces generated by the CNN layer. They showed that the CNN layer is effective in representing local patterns of fixed length, and the BiLSTM has a potential to capture arbitrary sequential dependencies of executables. Raff et al.  derived a feature vector from the raw bytes and designed a shallow convolutional neural network with a gated convolutional layer, a global max-pooling layer, and a fully connected output layer. They insisted that their work is the first to define a feature vector from the entire binary, and it was hard to develop deeper networks due to the extraordinarily long byte strings (e.g., 1-2M length). Their biggest contribution is that the feature vector is obtained from the entire binary, so that it may grasp the global context of the entire binary. That is, the contents of a binary may have the high amount of positional variation or can be rearranged in arbitrary ordering, so they adopted the global-level feature vectors with a very large dimension. Although their network is designed to have ability to scale well with variable length of binary strings, it essentially will not be applicable to any longer binary strings (e.g., 3-4M length). All of these studies commonly utilized raw bytes of executables.
The proposed network in this paper is designed to take a byte sequence within the nonexecutables as an input and generates an output based on high-level patterns of collectable spatial clues, which implies that the proposed network is applicable to byte sequences with variable lengths. In the following section, we illustrate how we design the proposed network according to the characteristics of the input data (i.e., byte sequences).
3. Proposed Method
To detect malicious actions without heavy feature engineering, we designed a deep learning model against stream objects and discriminate the maliciousness in the object level. The stream objects have no size limitation, and a certain part of the stream exhibits malicious actions while the other part does not. Such high-level location invariance makes it difficult to detect the maliciousness of the object. Among deep learning models, convolutional neural networks (CNN) are known as successful in detecting locally contextual patterns. The CNN models have brought dramatic performance advance in the area of image processing [21, 22], and one of its benefits is that it works with smaller amount of data, compared to other deep learning models (e.g., recurrent neural networks, fullyconnected neural networks).
A graphical representation of our proposed network is depicted in Figure 7, where the network consists of an embedding layer, two convolutional layers, a pooling layer, a fully connected layer, and an output layer. Note that the figure just shows a structure of the network, and the dimensions of the layers and the number of channels must be much larger. The denotes the embedding size and means a sequence length. Rather than directly submitting the raw byte values into convolution (i.e., using a scaled version of a byte’s value from 0 to 255), we adopt an embedding layer to map each byte to a -dimensional feature vector because the byte values do not imply intensity but convey some contextual information. That is, given a byte sequence of length , the real-valued embedding matrix is computed during the training process, so that the matrix will help to grasp a wider breadth of input patterns.
The embedding layer makes it possible to represent meaning of each byte by incorporating all the byte sequences. As discussed in , the raw values of byte sequences do not simply represent intensity, and it will be better to find an alternative way to see the values. For example, a byte value 160 does not imply ‘better’ or ‘stronger’ intensity than 130, but the two values must convey different meaning. In the ‘word embedding’ concept in the natural language processing (NLP) field, similar words (e.g., ‘hi’ and ‘hello’) are close to each other in the embedding space, whereas opposite words are far from each other. Likewise, our embedding layer interprets the contextual meaning of byte values and represents them on the embedding space.
Several locally adjacent -dimensional vectors generated from the embedding layer are then fed into the first convolutional layer followed by another convolutional layer. The first convolutional layer is designed to take a C1 × E matrix which is supposed to carry spatial clues of malicious actions, in the hope that the collected clues will be enough for the entire network to make a wise decision. In the recent work , a convolutional neural network designed for analysing the entire sequence at once was proposed, but this network will not be applicable to longer sequences. In contrast, our network collects simple local clues and generates high-level representation, which means that the proposed network is available to all sequences with variable lengths. Each convolutional layer takes one or more adjacent vectors (or values) from its previous layer as an input and generates an output value by a summation of element-wise multiplication with a filter. This filter, also known as a kernel, is computed during the training process.
One may argue that stacking more convolutional layers might be better, because the deeper network is often known to be capable of extracting more complicated patterns. This might be true, but it should be noted that the deeper network is not always better than the shallow networks. The length of network must be considered carefully by investigating the data; too complicated network will probably overfit, whereas too simple network will underfit. By examination into the sequence data, we conclude that the two convolutional layers will be enough to capture the variety of the spatial patterns of malicious actions. We also, of course, show that this structure is indeed better than deeper networks through experiments.
The high-level representation obtained from the two consecutive convolutional layers is submitted to the pooling layer that helps to focus on some representative or primary patterns. Among several types of pooling layer such as average-pooling and L2-norm pooling, we adopt the max-pooling layer as it is generally known to be effective in various tasks. Similar to the convolutional layer, the pooling layer slides from the top-left corner to the bottom-right corner with an arbitrary stride, resulting in output vectors of much smaller size. The output vectors are flattened or concatenated to form a one-dimensional vector that will be passed to the -dimensional fully connected (FC) layer. The FC layer collects the primary patterns from the pooled values, and the last output layer represents how likely the given byte sequence embeds malicious actions.
The data statistics are summarized in Table 1, where each data instance is defined as a byte-level stream object. As shown in Figure 9, for each data file, multiple byte streams are collected, each of which is manually labelled with a positive (malicious) or a negative (benign) tag. The total amount of originally collected byte sequences was 4,371, but randomly downsampled 989 negative instances.
Note that the byte streams have different lengths as described in Figure 10, but the proposed network cannot handle such sequences of variable lengths. As the network is designed to grasp high-level patterns from collectable spatial clues of malicious actions, it is not necessary to use the whole sequence of different length at a time. Rather, we padded short sequences and cut away the remaining part from long sequences, so that all sequences have the same length of 1,000. When the network is trained with these fixed length sequences, it can be applied to divided byte sequences with the same size in a target file, so that it will predict whether the target file contains malware or not.
We compared the proposed network with several representative classification methods such as support vector machine , decision tree, naïve bayes, and random forest . The brief description and parameter settings of the methods are summarized in Table 2.
We also examined various settings for our network, and the best setting was found as follows: (1) the embedding dimension E = 25, (2) C1 and C2 are set to 3, respectively, (3) pooling layer dimension P = 100, (4) K1 and K2 are 32 and 64, respectively, and (5) fully connected layer dimension F = 128, where the notations can be found in Figure 7. The two convolutional layers have strides (1, 25) and (1, 1), respectively, and the pooling layer has a nonoverlapping stride (100, 1). The sequence length S must be 1,000. Every layer takes rectified linear unit (ReLU) as an activation function except for the output layer that takes a softmax function. The cost function is defined as a cross entropy over all nodes of the output layer. Consequently, the total numbers of trainable parameters are 89,371.
We adopt Adam’s optimizer  with an initial learning rate of 0.001 to train our network. Our training recipe is as follows: (1) L2 gradient-clipping with 0.5, (2) drop-out  with a keeping probability 0.25 for the fully connected layer, and (3) batch normalization  for the two convolutional layers. We tried to use the regularization methods such as L2 regularization and decov  and observed no performance improvements, as the batch normalization and drop-out are known to have regularization effect themselves. The weight matrices of the convolutional layers, the FC layer, and the output layer are initialized by He’s algorithm , and bias vectors are initialized with zeros. In the following subsection, we demonstrate the performance of our network by experimental comparison with the other classifiers and different networks.
One unfortunate aspect of the field of malware detection is that there is no available public dataset for various reasons. The dataset easily obtainable from public is often not of a sufficient quality, so previous studies could not compare performance (i.e., accuracy) across works because of different data characteristics and labelling procedures. It is inevitably hard to compare our performance with other state-of-the-art studies for the same reason. We compare our network with some comparative machine-learning methods and CNN models with different settings. The measurement is conducted using 10-fold cross validation, where the performance values (i.e., F1 score, precision, and recall) are averaged values of three distinct trials.
The experimental results are described in Table 3, where the two values for each cell correspond to the ‘benign’ and ‘malicious’ classes, respectively. For example, the F1 scores of random forest (RF) are 96.4 for ‘benign’ and 96.1 for ‘malicious’. For the four traditional machine-learning models (e.g., DT, NB, SVM, and RF), the values of the input sequence are treated as nominal values. We tested five different structures of CNNs. The first network, Emb+Conv+Conv+Pool+FC, is the best structure which has an embedding layer, two consecutive convolutional layers, a pooling layer, and a fully connected layer followed by an output layer. The number of epoch differs from networks according to the complexity (i.e., the number of layers and parameters) of the networks and the dataset size. For instance, the first network is trained through 30 epochs, whereas training of the second network requires 25 epochs.
Among the traditional machine-learning models, the support vector machine (SVM) exhibits the best F1 scores and random forest (RF) has the comparable results. As also shown in Table 3, it seems obvious that the five CNNs outperform the traditional machine-learning models. From the results of the five networks, we can find two main observations. First, the embedding layer (Emb) seems to play a significant role in better representation of byte sequences, as the second network without the embedding layer exhibits much worse F1 scores than the other networks. Second, the stacked convolutional layers enable interpreting high-level patterns, and the optimal number of layers seems to be two. The third network having a single convolutional layer and the fifth network having three convolutional layers are worse than the first network. The fifth network especially has the worst F1 score among five networks. This indicates that more stacked layers are not always better than shallow networks.
The experimental results can be summarized in two aspects. First, the convolutional neural networks showed superior performance than the traditional machine-learning models. The F1 score of the proposed network is almost 2% greater than the SVM, which can be explained that the convolutional neural networks have better comprehensive power to analyse the underlying spatial patterns of the byte sequences. Second, it seems that the embedding layer followed by the two convolutional layers is best suited to representing high-level patterns of malicious actions. Less or more stacked convolutional layers gave worse results.
Other than the two aspects, we need to discuss about the parameter settings for training. The results of Table 3 are obtained from the network trained with the dimensions and the parameters as aforementioned in Model part. We found the optimal parameter setting by grid search, and some remarkable results with different settings and dimensions are summarized in Table 4. The first two rows correspond to the embedding size , and the last two rows are associated with the pooling size . Other remaining rows are related to the training recipe, such as drop-out, gradient-clipping, and batch normalization. The drop-out together with the batch normalization was helpful to generalize the network, and we observed no improvements with additional regularization methods. The gradient-clipping made the network more robust, as it prevented from tripping over desirable points of the gradient space.
We also checked the training time of the comparable methods. Table 5 shows the elapsed seconds for training different methods. The training of the four traditional methods is performed on a machine of Xeon E5-2620 V4 with 128GB RAM, while the CNN models are trained using a machine of i7-9700K with 64GB RAM and two RTX 2080 Ti. The training time of the CNN models strongly depends on the performance of GPUs. All methods are trained using the 1,978 instances, and the number of epochs is equally 30 for the CNN models. Note that the fourth CNN model without a pooling layer takes much longer time than the other three CNN models. The reason is that the pooling layer has an effect of reducing the filter sizes, so the fourth CNN model has greater number of parameters to train. Except for the SVM, the traditional models seem computationally more efficient than the CNN models. One of the reasons for the lagging of training SVM must be the use of Poly kernel function. We may use another kernel functions (e.g., linear kernel), of course, but it will probably degrade accuracy. As the training of random forest (RF) is much faster than the CNN models, it might be preferred to choose the RF model if we want efficiency with a small loss of effectiveness (i.e., accuracy).
The threat of malicious documents keeps growing, because it is difficult to detect the malicious actions within the documents. In this work, we proposed a new convolutional neural network designed to take a byte sequence of nonexecutables as an input and predicts whether the given sequence has malicious actions or not. We illustrated how we design the network according to the characteristics of the input data and provided discussion about the experiments using the manually labelled dataset. The experimental results showed that the proposed network outperforms several representative machine-learning models as well as other convolutional neural networks with different settings. Though we conducted experiments only with the PDF files, we expect that this approach can be applicable to other types of data if they contain byte streams. Therefore, as a future work, we will collect data of other file types (e.g., .rtf files) and perform further investigation.
We disclose our dataset as well as a code to public (https://sites.google.com/view/datasets-for-public).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by the Soonchunhyang University Research Fund (no. 20170265). This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2017R1D1A3B030360 50).
T. T. Um, F. M. J. Pfister, D. Pichler et al., “Data augmentation of wearable sensor data for Parkinson’s disease monitoring using convolutional neural networks,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, pp. 216–220, ACM, Glasgow, Scotland, 2017.View at: Google Scholar
Z. M. Kim, Y. S. Jeong, H. R. Oh et al., “Investigating the impact of possession-way of a smartphone on action recognition,” Sensors, vol. 16, no. 6, pp. 1–5, 2016.View at: Google Scholar
Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751, Association for Computational Linguistics, Doha, Qatar, October 2014.View at: Google Scholar
A. Hannun, C. Case, and J. Casper, “Deep speech: scaling up end-to-end speech recognition,” Computing Research Repository, pp. 1–12, 2014.View at: Google Scholar
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788, IEEE, Las Vegas Valley, NV, USA, July 2016.View at: Google Scholar
N. Šrndić and P. Laskov, “Hidost: a static machine-learning-based detector of malicious files,” EURASIP Journal on Information Security, vol. 2016, no. 1, p. 22, 2016.View at: Google Scholar
B. Cuan, A. Damien, C. Delaplace et al., Malware Detection in PDF Files Using Machine Learning [PhD. Thesis], REDOCS, 2018.View at: Publisher Site
C. Smutz and A. Stavrou, “Malicious PDF detection using metadata and structural features,” in Proceedings of the 28th Annual Computer Security Applications Conference, pp. 239–248, Orlando, Fla, USA, December 2012.View at: Google Scholar
M. Li, Y. Liu, M. Yu et al., “FEPDF: a robust feature extractor for malicious PDF detection,” in Proceedings of the 2017 IEEE Trustcom/BigDataSE/ICESS, pp. 218–224, IEEE, Sydney, Australia, August 2017.View at: Google Scholar
J. C. Platt, “Sequential minimal optimization: a fast algorithm for training support vector machines,” in Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge, Mass, USA, 1998.View at: Google Scholar
S. J. Khitan, A. Hadi, and J. Atoum, “PDF forensic analysis system using YARA,” International Journal of Computer Science and Network Security, vol. 17, no. 5, pp. 77–85, 2017.View at: Google Scholar
J. Zhang, “MLPdf: an effective machine learning based approach for PDF malware detection,” Security and Cryptography, 2018.View at: Google Scholar
B. Kolosnjaji, A. Zarras, G. Webster et al., “Deep learning for classification of malware system call sequences,” Lecture Notes in Computer Science, vol. 9992, pp. 137–149, 2016.View at: Google Scholar
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.View at: Publisher Site | Google Scholar
W. Huang and J. W. Stokes, “MtNet: a multi-task neural network for dynamic malware classification,” Lecture Notes in Computer Science, vol. 9721, pp. 399–418, 2016.View at: Google Scholar
E. Raff, J. Sylvester, and C. Nicholas, “Learning the PE header, malware detection with minimal domain knowledge,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 121–132, ACM, Dallas, TX, USA, 2017.View at: Google Scholar
J. Saxe and K. Berlin, “Deep neural network based malware detection using two dimensional binary program features,” in Proceedings of the 10th International Conference on Malicious and Unwanted Software, pp. 11–20, IEEE, Fajardo, PR, USA, October 2015.View at: Google Scholar
Q. Le, O. Boydell, B. Mac Namee, and M. Scanlon, “Deep learning at the shallow end: Malware classification for non-domain experts,” Digital Investigation, vol. 26, pp. S118–S126, 2018.View at: Publisher Site | Google Scholar
E. Raff, J. B. Barker, J. Sylvester et al., “Malware detection by eating a whole EXE,” in Proceedings of the in Proceedings of the Workshops of the Thirty-Second AAAI Conference on Artificial Intelligence, pp. 268–276, New Orleans, LA, USA, 2018.View at: Google Scholar
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.View at: Publisher Site | Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proceedings of the Advances in Neural Information Processing Systems 25, Lake Tahoe, NV, USA, December 2012.View at: Google Scholar
C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9, IEEE, Boston, MA, USA, June 2015.View at: Google Scholar
G. Huang, Z. Liu, L. Maaten et al., “Densely connected convolutional networks,” in Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, pp. 2261–2269, IEEE, Honolulu, HI, USA, July 2017.View at: Google Scholar
B. E. Boser, I. M. Guyon, and V. N. Vapnik, “Training algorithm for optimal margin classifiers,” in Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144–152, ACM, Pittsburgh, PA, USA, July 1992.View at: Google Scholar
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.View at: Publisher Site | Google Scholar
D. P. Kingma and J. L. Ba, “Adam: a method for stochastic optimization,” in Proceedings of the in Proceedings of the 3rd International Conference for Learning Representations, San Diego, Calif, USA, 2015.View at: Google Scholar
N. Srivastava, G. Hinton, A. Krizhevsky et al., “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.View at: Google Scholar
S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456, ACM, Lille, France, July 2015.View at: Google Scholar
M. Cogswell, F. Ahmed, R. Girshick et al., “Reducing overfitting in deep networks by decorrelating representations,” in Proceedings of the 4th International Conference for Learning Representations, San Juan, PR, USA, 2016.View at: Google Scholar
K. He, X. Zhang, S. Ren et al., “Delving deep into rectifiers: surpassing human-level performance on imagenet classification,” in Proceedings of the 15th IEEE International Conference on Computer Vision, pp. 1026–1034, Santiago, Chile, December 2015.View at: Google Scholar