Abstract

With the rapid increase in the amount and type of malware, traditional methods of malware detection and family classification for IoT applications through static and dynamic analysis have been greatly challenged. In this paper, a new simple and effective attention module of Convolutional Neural Networks (CNNs), named as Depthwise Efficient Attention Module (DEAM), is proposed and combined with a DenseNet to propose a new malware detection and family classification model. Based on the good effect of the DenseNet in the field of image classification and the visual similarity of the malware family on images, the gray-scale image transformed from malware is input into the model combined with the DEAM and DenseNet for malware detection, and then the family classification is carried out. The DEAM is a general lightweight attention module improved based on the Convolutional Block Attention Module (CBAM), which can strengthen the attention to the characteristics of malware and improve the model effect. We use the MalImg dataset, Microsoft malware classification challenge dataset (BIG 2015), and our dataset constructed by the two above-mentioned datasets to verify the effectiveness of the proposed model in family classification and malware detection. Experimental results show that the proposed model achieves 99.3% in terms of accuracy for malware detection on our dataset and achieves 98.5% and 97.3% in terms of accuracy for family classification on the MalImg dataset and BIG 2015 dataset, respectively. The model can reliably detect IoT malware and classify its families.

1. Introduction

Malware is a kind of software program designed to access a computer system and perform useless or harmful operations. It includes viruses, worms, Trojan horses, advertising software, spyware, blackmail software, and other types. These kinds of software will obtain confidential data, steal identity, hijack traffic and operating system, encrypt digital assets, and monitor users, which poses threats to users and operating systems. Malware is constantly challenging the network security situation with its continuously increasing growth rate and endless family types. According to the statistics of “malware threat situation report 2020” [1] released by Malwarebytes labs, in 2019, the detection of Windows malware on business endpoints increased by 13%. Malware detection and family classification technology is still a development direction that cannot be ignored. Similarly, the Internet of Things (IoT) devices built on different processor architectures have increasingly become targets of adversarial attacks. Although there are many ways to detect malware on the Internet of Things [2, 3], we still need to make further efforts in this field.

Traditional malware detection and family classification use two kinds of malware analysis techniques: static analysis and dynamic analysis. Static analysis disassembles executable programs and analyzes and extracts the characteristic information of code without executing malware. In [4], sequential pattern mining technology is used to detect the maximum frequent pattern (MFP) of the opcode sequence for malware detection in the Internet of Things. In [5], the behavior sequence chain of some malware families is generated, and the similarity between the behavior sequence chain and the sequence of the target process is calculated to detect and classify malware. In [6], malware is identified by combining normalized compression distance (NCD) with the compressibility rates of executables using decision forests. However, static analysis may consume a lot of time in useless code because the code analyzed is not necessarily the code of final execution. At the same time, the reliance of static analysis on disassembly technology also results in malware that can use various obfuscation techniques to hinder disassembly analysis. Some malware makes reverse engineering more complex by encrypting, packaging, and so on, which increases the difficulty of static analysis. Dynamic analysis is the extraction of feature information in the process of code execution, and the analyzed code is the actual execution code. In [7], malicious artifacts are extracted from memory through memory forensics technology, and malware detection is performed by combining the extracted malicious artifacts with the features extracted when executing malware files using dynamic analysis. In [8], the confused malware is detected by proper hook installation and real calculation of malware activity time in user and kernel. In [9], a graph repartitioning algorithm that uses the N-order subgraph (NSG) to convert API call graphs into fragment behaviors is proposed for malware detection and family classification. Besides, the “term frequency-inverse document frequency” (TF-IDF) and information gain (IG) were improved and used to extract thecrucial N-order subgraph (CNSG). However, dynamic analysis of one execution process can only obtain a single path behavior, and some malware has multiple execution paths. At the same time, dynamic analysis has certain risks due to the actual execution of the program. With the development of neural networks in recent years, static analysis and dynamic analysis are often combined with neural networks for malware detection and family classification. In [10], the bigram model is used to represent the opcode, the frequency vector is used to represent the API call, and then convolutional neural network and backpropagation neural network are used to embed features based on opcodes and APIs for malware detection and family classification. In [11], based on the behavior of malware, a classification method based on malware type was proposed, and LSTM was used for a new dataset developed for Windows operating system based on API calls.

The main purpose of this paper is to combine the attention module and convolutional neural network to better perform malware detection and family classification. In our framework, we convert the malware samples into gray-scale images and then apply DenseNet with Depthwise Efficient Attention Module to the images. In this process, DEAM can generate feature attention maps to strengthen the attention to malware features, to improve the effectiveness of detection and family classification. The main contributions of our work are as follows.

This paper proposes a new general lightweight attention module, DEAM, which can be widely used to improve the performance of CNNs, while not increasing the amount of calculation. It consists of both the Improved Efficient Channel Attention (IECA) and a new spatial attention mechanism, Depthwise Spatial Attention (DSA). We replace the SENet in the original channel attention mechanism structure in CBAM with ECA-Net to get the IECA of the proposed model. The DSA is constructed by using Depthwise Convolution. We combine the DSA module and the DenseNet for malware detection and family classification.

The proposed model performs well on the MalImg dataset, the BIG 2015 dataset, and the dataset built by us and can effectively perform malware detection and family classification.

The rest of the paper is organized as follows: Section 2 discusses the related work concerning different techniques such as visualization, the structure of the CNNs, attention mechanism in malware detection, and classification. Section 3 presents our proposed model in detail. The performance of our algorithm is evaluated in Section 4. Section 5 summarizes our work and put forward some suggestions for future work.

2.1. Malware Visualization

In this paper, we use the method proposed in the literature [12] to convert malware into gray-scale images. We convert each byte (8-bit binary or 2-bit hexadecimal) in the PE file into a pixel, and the value of each pixel is in the range of [0, 255] (0: black, 255: white). The height of the image is determined by the size of the PE file, as shown in Table 1.

Through malware images, we can find that the images of malware from the same family are visually similar, but there are large visual differences between different malware families. Besides, the difference also exists between benign software and malware, as shown in Figure 1. Converting malware into images can help us perform malware analysis. After being converted into images, different parts of the file can be easily distinguished so that we can find the functional parts of the malware.

Converting malware into images for detection and family classification has become a common practice in recent years. In [13], the memory data dump file was converted into a gray-scale image, and the histogram of gradient (HOG) extraction function was used to effectively classify the malware. In [14], a new hybrid model based on image analysis was proposed, which uses similarity mining and deep learning architecture to accurately identify and classify confusing malware. Inspired by the visual similarity between malware samples of the same family, a file-agnostic deep learning method is proposed for malware classification [15]. Through a set of discriminative patterns extracted from the visualized image of the malware, the malware is effectively divided into multiple families. In [16], based on the visual similarity between malware in the same family, a suggestion of directly performing binary texture analysis on gray-scale images of malware executable files was proposed. This technology derives a new combination of second-order statistical texture features based on the first-order and gray-level cooccurrence matrix (GLCM) on the visualized malware to perform confusion and unbalanced malware classification.

2.2. Structure of the CNNs

The Convolutional Neural Networks (CNNs) have greatly promoted the development of image classification with their excellent performance. Recently, in order to improve the performance of Convolutional Neural Networks, researchers have made many changes in three aspects: depth, width, and cardinality. Starting from LeNet [17], the pioneering work of the CNNs, and then the outbreak after AlexNet [18], the structure of CNNs has been becoming deeper and deeper to achieve richer representations. VGGNet [19] proved that increasing the depth of the network can affect the final performance of the network to a certain extent. ResNet [20] introduced a shortcut to make the network have a certain identity mapping ability, and strengthen the correlation of gradients between the layers of the network. GoogLeNet [21] proved that width is another important factor to improve model performance. DenseNet [22] further deepened the idea of ResNet, applied a shortcut to the entire network, realized the dense connection of the network, and strengthened the connection between features of each layer. Image classification technology based on DenseNet has recently been applied to various fields [2325]. However, as the model continues to be expanded in depth, width, and base, the amount of its calculation is also increasing. In order to achieve a better balance between the performance and cost of the model, it is more possible to build a universal bionic mechanism in the deep learning model than to pile up more nonlinear layers.

2.3. Attention Mechanism

The attention mechanism is a deep learning technology that originated from the study of human vision and has been widely used in natural language processing [26, 27], recommendation systems, and image classification [28, 29]. It mimics the characteristics of the human visual system that selectively focuses on the salient parts, and improves the efficiency of the model by dynamically selecting important features. It can be found from the development in recent years that the attention mechanism has become a common method to enhance the effect of CNNs. The attention map obtained by the attention mechanism from CNNs shows specific areas, which represent the features being focused on.

SENet [30] first proposed an effective channel attention learning block and achieved good performance, proving that attention can improve the expressiveness of the network by enhancing important features and suppressing unnecessary features. In [31], the malware is converted into a gray-scale image and then input into the model combined with SENet [30] and CNN for malware analysis and family classification. After that, the attention module is developed from two aspects: the enhancement of feature aggregation or the combination of channel and spatial attention. GSoP [32] introduced a second-order pool to achieve more effective feature aggregation. CBAM [33] proposed a general attention module for CNNs, which uses max pooling and average pooling to aggregate features and uses the aggregated features for sequential channel attention mechanism (using SENet [30]) and spatial attention mechanism. ECA-Net [34] improved SENet [30] according to the idea of no dimensionality reduction and lightweight and improved the effect while reducing the parameters. ADCM [35] integrates dropout into the attention mechanism according to the idea of lightweight and improves CBAM [33]. In addition, many works use improved attention mechanisms to improve the effect of CNNs [36, 37].

Based on the CBAM [33] framework, this paper improves the channel attention mechanism inside and creates a new spatial attention mechanism. A new general lightweight attention module called Depthwise Efficient Attention Module is proposed.

3. Proposed Model

In order to better perform malware detection and family classification of malware, we proposed a new method based on DenseNet and the attention mechanism. In this section, we will introduce the proposed model in detail. The proposed model is composed of DenseNet and DEAM. Based on DenseNet-121, we construct a DenseNet suitable for the proposed model. DEAM is composed of Improved Efficient Channel Attention (IECA) and Depthwise Spatial Attention (DSA). First, we introduce the architecture of DenseNet. Then the proposed IECA and the DSA are described, respectively. Finally, the entire flowchart of our model for malware detection and family classification is represented.

3.1. Structure of the DenseNet

The DenseNet model is a deep learning model developed on the ResNet. In recent years, DenseNet has achieved better results in the field of image classification. The basic idea of ResNet and DenseNet is the same; however, DenseNet establishes a dense connection between all the previous layers and the latter, and it realizes feature reuse through the connection of features on the channel. These features make DenseNet achieve better performance than ResNet with fewer parameters and computational costs and alleviate gradient vanishing problems.

The DenseNet is mainly composed of DenseBlock and Transition layer. DenseBlock adopts a radical dense connection mechanism; that is, all layers are connected to each other. Specifically, each layer accepts the output from all the previous layers as its additional input, as shown in Figure 2.

In DenseBlock, each layer has the same size and each layer is concatenated with all previous layers in the channel dimension. For a network with L layer, DenseBlock contains a total of L (L + 1)/2 connections. The input of the layer L is as follows:where L represents the number of layers. HL(...) represents nonlinear transformation, which is a combination of Batch Normalization (BN), ReLU, Pooling, and Conv operations. In this paper, the common DenseNet-B structure is utilized, and the bottleneck layer is used to reduce the amount of calculation; that is, the structure BN + ReLU + 1 × 1Conv + BN + ReLU + 3×3 Conv is adopted in this paper. Each layer in DenseBlock outputs k feature maps after convolution, that is, the number of convolution kernels. If we set the channel number of input DenseBlock as k0, then the input channel number of L layer is k0 + k(L-1). Here, the final convolution of each layer is k, and k is called the growth rate. In DenseBlock, with the increase in the number of layers, the number of input channels will be larger and larger.

Since the input size of the model after passing through a DenseBlock remains unvaried, the channel dimension will continue to increase. Therefore, dimension reduction is necessary to reduce computational complexity. The Transition layer is mainly composed of a 1 × 1 convolution and 2 × 2 AvgPooling or MaxPooling, and its structure is BN + ReLU + 1 × 1 Conv + 2 × 2 AvgPooling. It connects two adjacent DenseBlocks and reduces the dimensionality of the output of the DenseBlock.

Now the commonly used DenseNet frameworks are DenseNet-121, DenseNet-169, DenseNet-201, and DenseNet-264. The DenseNet in our proposed model is based on DenseNet-121. Table 2 shows a comparison between DenseNet-121 and the DenseNet in our proposed model.

3.2. Depthwise Efficient Attention Module

The DEAM we proposed follows the framework of CMBA [33] and consists of two parts, IECA and DSA. For an input feature map (where C denotes channel, H denotes height, and W denotes width), DEAM calculates the relationship between the channels of the feature map through IECA to obtain a 1-dimensional channel attention map to focus on important features on the image. Then, DEAM calculates the 3-dimensional spatial attention map of the feature map through DSA and pays attention to the position of the feature on the image. The calculation process in the DEAM is as follows:wheredenotes elementwise multiplication.

According to [33], in the DEAM, we connect the two attention mechanisms serially and put IECA in the front of DEAM to get the best effect. Through experimental comparisons, when DEAM is put behind the last DenseBlock of DenseNet, the system can achieve the best effect. Since DenseNet is the connection of all layers, the input of each layer is the superposition of all the previous layers. Adding a DEAM behind each DenseBlock will recalculate the calculated value and create a lot of useless overhead. In addition, each DEAM focuses on different features. If the DEAM is in front of or behind each DenseBlock, they may interfere with each other to reduce the effect of the model. Figure 3 describes the process of each attention map, and the detailed information of each attention mechanism is described below.

3.2.1. Improved Efficient Channel Attention

In the mechanism of channel attention, we consider which features we should pay attention to. Each channel of the feature map is regarded as a feature detector [38]; however, not every channel is very useful for image recognition. By calculating the probability of different channels, the channel attention will be focused on the main features of the image. Therefore, through the channel attention mechanism, we can better extract the representative features of malware images and improve the efficiency of malware detection and family classification.

The attention mechanism has been widely used to improve the performance of CNNs, among which the more representative ones are SENet [30] and CBAM [33]. However, most of the attention mechanisms are dedicated to complicating themselves to achieve better performance. ECA-Net [34] improved SENet model by lightweight without dimensionality reduction, and the important role of no dimensionality reduction for the attention mechanism is proved. ECA-Net proposes a 1-dimensional convolution to achieve a local cross-channel interaction strategy, which reduces model complexity while improving performance. The formula is as follows:where denotes 1-dimensional convolution, k denotes convolution kernel size of 1-dimensional convolution, denotes channel attention map, and denotes Sigmoid function. Meanwhile, a method of adaptively selecting the size of a 1-dimensional convolution kernel is proposed to determine the coverage rate of local cross-channel interaction. The formula is as follows:where denotes the odd number closest to t, and b are set to 2 and 1, respectively, and C denotes the channel number of the input feature map. This improvement significantly reduces the parameters of the channel attention mechanism and enhances the computational efficiency of the model.

In this paper, we use ECA-Net to replace the SENet in the channel attention mechanism structure in CBAM in order to achieve the effect without local dimensionality reduction, and then we get the IECA. We use the max pool and average pool to compress the input feature map in the channel space in order to effectively calculate the channel attention. The average pool gathers spatial information, and the max pool gathers distinctive object features. Two spatial context descriptors ( and ) which represent average pool feature and max pool feature are output from average pool and max pool, respectively. The two spatial context descriptors are combined into a feature vector by element summation, and the combined feature vector is transferred into a 1-dimensional convolution. The size k of the 1-dimensional convolution kernel is obtained by the adaptive selection, and the Sigmoid function is used for the eigenvector of a 1-dimensional convolution output to obtain a 1-dimensional channel attention map. The formula is as follows:where + denotes element summation.

3.2.2. Depthwise Spatial Attention

Different from the channel attention mechanism, the spatial attention mechanism pays attention to which position on the feature detector is meaningful and which part is a supplement to the channel attention mechanism. The spatial attention mechanism will calculate the probability of different positions on the feature map and focus on the meaningful positions on the feature map. Therefore, through the spatial attention mechanism, we can better extract the representative features of malware images and improve the efficiency of malware detection and family classification.

The spatial attention mechanism of CMBA [33] first compresses the feature map with the max pool and the average pool along the channel axis [39] and then connects their outputs to generate an effective feature descriptor. Convolution is used for the feature descriptor to generate a 2-dimensional spatial attention map. DSA in this paper is a new spatial attention mechanism that is constructed based on the idea of no dimensionality reduction in ECA-Net [34]. In ECA-Net, it has been proved that avoiding dimensionality reduction is very important for the attention mechanism. DSA uses Depthwise Convolution to calculate the 3-dimensional spatial attention map of the feature map without dimensionality reduction. Depthwise Convolution is a special form of Group Convolution when the number of groups is equal to the number of channels. Depthwise Convolution divides the input features into different groups according to the number of channels and convolves each group separately. In IECA, we only replace the part of SENet [30] to achieve the effect of no local dimensionality reduction in the channel attention mechanism. In DSA, we construct a new spatial attention mechanism through Depthwise Convolution, abandoning the dimensionality reduction of the max pool and average pool along the channel axis in the CMBA, and achieve the effect of no dimensionality reduction. Depthwise Convolution can obtain a prominent information area from each channel, which is more comprehensive than applying the convergence operation along the channel axis [39]. We will describe the detailed operation below.

We apply Depthwise Convolution on the input feature map and use the Sigmoid function on the output feature descriptor to get a 3-dimensional spatial attention map. The formula is as follows:where Depthwise Conv2D denotes Depthwise Convolution.

3.3. The Process of Detecting and Classifying Malware

The whole process of detecting and classifying malware is described as follows. First, the PE file is converted into a gray-scale image using the method in the literature [12]. Second, the converted gray-scale image is applied to a malware detection model, which consists of DEAM and DenseNet. The model is trained using the gray-scale images of known benign software samples and malware samples, as well as their corresponding labels. The trained detection model can effectively distinguish the malware from benign software. Then, the gray-scale image of malware is applied to the malware family classification model which consists of DEAM and DenseNet. The model is trained using gray-scale images of known malware samples and their labels representing the family of each malware sample. The trained family classification model can effectively identify malware families. The whole process is shown in Figure 4.

4. Experiments

4.1. Datasets and Evaluation Criterion

The datasets used for the evaluation of the classification results of malware families in this article are the MalImg dataset from [12] and the BIG 2015 dataset provided by Microsoft for the Big Data Innovators Gathering Anti-Malware Prediction Challenge. The MalImg dataset is a large-scale unbalanced Windows malware gray-scale image dataset which contains 25 malware families and a total of 9339 malware gray-scale image samples, as shown in Table 3. The malware families in MalImg dataset include worm, Trojan horse, backdoor, and rogue software.

We only use the labeled training set of the BIG 2015 dataset, which contains 9 malware families and a total of 10868 malware samples, as shown in Table 4. Each sample of the dataset has a hexadecimal representation of its binary content and its corresponding assembly file. Both the MalImg dataset and the BIG 2015 dataset are benchmark datasets used in many recent works.

Due to the lack of public datasets for detection, we use our own constructed dataset for indirect comparison with the work of others. We merged the MalImg dataset and the BIG 2015 dataset and randomly selected the same number of malware samples from each of the 34 malware families in the merger. Then, we constructed a 1 : 1 detection dataset with the extracted 1087 malware samples and the collected 1087 benign software samples. The diversity of malware families in the dataset ensures the generalization ability of the detection model, so as to avoid the reduction of model generalization performance caused by overfitting when using unbalanced data to train the model.

In order to test the generalization performance of our model, we divided the dataset into the training set, validation set, and test set at a ratio of 6 : 2:2 and repeated each experiment 5 times to reduce experimental errors. The training set is used to train the model, the validation set is used to adjust the performance of the model, and the test set is put into the trained model to test the performance of the model. Our experiment is based on the TensorFlow 2.0 framework. The Adam optimizer and categorical_crossentropy are used in our experiments. Besides, such parameters as accuracy, recall, and F1 score are also used as performance indicators to select the best model in detection and family classification. This is because when there is an imbalance between different classes, the accuracy rate can only reflect the overall prediction level. It ignores the prediction ability of a small number of classes. Sometimes, it can still get a higher level of classification accuracy when there are errors in a small number of classes or key classes. The precision is relative to the prediction results and indicates the correct number of the samples whose predictions are positive. The recall rate is relative to the sample, that is, how many positive samples are correctly predicted. The F1 score combines the precision and recall results.

The precision is calculated as follows:where TP is the true positive number and FP is the false positive number.

The recall is obtained as follows:where FN is the false negative number.

The F1 score is a weighted harmonic mean of precision and recall, as follows:

We derive the values of the binary classification task (detection) normally. For multiclassification tasks (family classification), we obtain the precision, recall, and F1 scores of each family separately. After that, the macro-precision, macro-recall, and macro-F1 are calculated by averaging the sum of the evaluation indicators of each family (the macro-average gives each family the same weight), respectively.

4.2. Malware Detection

We conduct malware detection experiments on the constructed dataset. Tables 5 and 6 show the obtained detection results in the form of a 2×2 confusion matrix, as well as the precision, recall, and F1 score values of each family. For the proposed model, the results of the accuracy, precision, recall, and F1 score are all 99.3%. The DenseNet without DEAM has an accuracy of 99.0%, a precision of 99.1%, a recall of 99.0%, and an F1 score of 99.0% on our constructed dataset. It can be found that DEAM almost does not improve the performance of CNNs in terms of detection, but the numbers of wrong predicted samples in the two experiments are only 3 and 4, respectively. Therefore, we believe that DenseNet itself already has high detection capabilities, and adding DEAM can no longer improve the performance of CNNs. Figure 5 gives an indirect comparison between our model and recent works, [7, 10, 40, 41] which proves that the proposed model is superior in detection compared with existing methods.

4.3. Malware Family Classification
4.3.1. MalImg Dataset

Our model is trained by reducing the image size in the MalImg dataset to 192 192 pixels. The smaller image size cannot retain all important information (that is, loss of discriminative information about a family), while a higher value will only increase the calculation time while not improving the overall accuracy. Tables 7 and 8 give the classification results in the form of a 25 × 25 confusion matrix, as well as the precision, recall, and F1 score values of each family. The proposed model achieves very good results: the accuracy is 98.5%, the precision is 96.9%, the recall is 96.6%, and the F1 score is 96.7%. These performance indicators show that the method achieves better classification and lower misclassification. On the MalImg dataset, DenseNet without DEAM has an accuracy of 97.9%, a precision of 95.5%, a recall of 94.7%, and an F1 score of 94.6%. Figure 6 shows the comparison between our model and recent work on the MalImg dataset. Experimental results show that our model has the same accuracy as that of in the literature [16], but other performance indicators are slightly lower than those in the literature [16]. Compared with other recent works, [14, 15, 42] our model has improved performance in malware family classification and robustness in classification imbalance.

The MalImg dataset contains many samples processed through obfuscation techniques, such as packaging and encryption. Among them, Yuner.A, VB.AT, Malex.gen!J, Autorun.K, and Rbot!gen families use the same packaging technology, UPX, which makes them have similar structures, and it is difficult to distinguish them. However, our model classifies the Yuner.A with 100% accuracy; the F1 scores of Malex.gen!J and Rbot!gen are 97.4% and 99.9%, and the F1 scores of VB.AT and Autorun.K are also 93.6% and 95.7%. Allaple encrypts the code part in several layers using a random key. Our model classifies Allaple.A and Allaple.L at a rate of 100%. This proves that our model is robust in both packaging and encryption. Meanwhile, Swizzor.gen!E and Swizzor.gen!I that belong to the same family variants are also classified with the accuracy of 100%.

It can be seen from the comparison of Tables 7 and 8 that our model slightly reduces the classification effect in each category compared with DenseNet without DEAM. For example, the F1 score of Fakerean has dropped from 1 to 98.4%. However, it greatly improves the classification accuracy of Wintrim.BX, Lolyda.AA2, and Lolyda.AA3. The F1 score of Wintrim.BX is raised from 75% to 88.1%, the F1 score of Lolyda.AA2 is raised from 45.4% to 81.4%, and the F1 score of Lolyda.AA3 is raised from 66.6% to 76%, thus improving the overall classification accuracy. This proves that our proposed DEAM has an improving effect on CNNs. Besides, in the entire model, the original CBAM block has 4,935 parameters, while our DEAM has only 1,935 parameters, almost reduced to one-third of CBAM parameters. For the 346,293 parameters in the entire DenseNet model, there is almost no increase in computational consumption.

On the MalImg dataset, the model using CBAM took Swizzor.gen!I for Obfuscator. AD with a rate of 100% in 5 experiments causing a significant drop in the classification effect, which reduced the performance of DenseNet.

4.3.2. BIG 2015 Dataset

The processing method of the BIG 2015 dataset is similar to that of the MalImg dataset. We downsample the gray-scale images. Tables 9 and 10 give the obtained classification results in the form of a 9×9 confusion matrix, as well as the precision, recall, and F1 measurement values of each family. The accuracy of our model on the BIG 2015 dataset is 97.3%, the precision is 95.3%, the recall is 95.4%, and the F1 score is 95.4%. DenseNet without DEAM has an accuracy of 96.3%, a precision of 94.2%, a recall of 91.4%, and an F1 score of 92.6% on the BIG 2015 dataset. Figure 7 shows a comparison of our model with recent works [15, 43] on the BIG 2015 dataset. It can be seen from Figure 7 that our model has improved precision, recall rate, and F1 score compared with [15], which proves that our model can better classify malware families. Our DEAM and CBAM have basically the same classification effect on the BIG 2015 dataset. However, the parameters used by DEAM are much less than those of CBAM, which can effectively improve the calculation efficiency. Based on the experimental results on the MalImg dataset and the BIG 2015 dataset, we can prove that the proposed DEAM has a better effect than CBAM. Through the comparison between Tables 9 and 10, we can see that the addition of DEAM has increased the classification effect of the model on multiple classes, especially the F1 score on Simda increasing from 71.4% to 87.5%. Therefore, it is further verified that DEAM improves the performance of CNNs. There is a certain gap between the effect on BIG 2015 dataset and that on MalImg dataset. We think that this is due to the larger texture gap between the same family samples in BIG 2015. In this paper, we only use the global image and do not further process the gray-scale image of malware.

5. Conclusion

This paper proposes a new lightweight and effective convolutional neural network attention module that is defined as DEAM, and combines it with the DenseNet for malware detection and family classification. The proposed method, which is first used in the detection model, converts executable files into gray-scale images, and then the detected malware is used in the family classification model to distinguish different malware families. Experimental results show that the number of DEAM parameters is only one-third of the CBAM parameters, so the DEAM can reduce the attention module parameters and improve the computational efficiency of the model. Besides, it is better than CBAM in performance, which helps to improve the performance of CNNs. The presented model performs well in malware detection and family classification, and it also shows robustness to code confusion and class imbalance problems.

Although the proposed method has good performance in malware detection and family classification, it still needs improvements. For example, our method directly uses the original gray-scale image of the malware in the model and does not process the gray-scale image of the malware. In the future, we will explore these issues to further improve performance.

Data Availability

The MalImg dataset can be obtained from http://vision.ece.ucsb.edu/∼lakshman/malware_images/album/. The BIG 2015 dataset can be obtained from https://www.kaggle.com/c/malware-classification/data.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grants No. 61572170 and No. 61672206, Natural Science Foundation of Hebei Province of China under Grant No. F2019205163, Science Foundation of Returned Overseas of Hebei Province of China under Grant No. C2020342, Science Foundation of Department of Human Resources and Social Security of Hebei Province under Grant No. 201901028, and Natural Science Foundation of Hebei Normal University under Grant No. L072018Z10.