Abstract

Edge computing is a feasible solution for effectively collecting and processing data in industrial Internet of Things (IIoT) systems, and edge security is an important guarantee for edge computing. Fast and accurate classification of malicious code in the whole lift cycle of edge computing is of great significance, which can effectively prevent malicious code from attacking wireless sensor networks and ensure the stable and secure transmission of data in smart devices. Considering that there is a large amount of code reuse in the same malicious code family, making their visual feature similar, many studies use visualization technology to assist malicious code classification. However, traditional malicious code visual classification schemes have the problems such as single image source, weak ability of deep-level feature extraction, and lack of attention to key image details. Therefore, an innovative malicious code visual classification method based on a deep residual network and hybrid attention mechanism for edge security is proposed in this study. Firstly, the malicious code visualization scheme integrates the bytecode file and assembly file of the malware and converts them into a four-channel RGBA image to fully represent malicious code feature information without increasing the computational complexity. Secondly, a hybrid attention mechanism is introduced into the deep residual network to construct an effective classification model, which extracts image texture features of malicious code from two dimensions of the channel and spatial to improve the classification performance. Finally, the experimental results on the BIG2015 and Malimg datasets show that the proposed scheme is feasible and effective and can be widely applied used in various malicious code classification issues, and the classification accuracy rate is relatively higher than the existing better-performing malicious code classification methods.

1. Introduction

In recent years, the fast expansion of the Internet of Things (IoT) has led to the industrial IoT (IIoT). Edge computing as a feasible solution for efficient collection and processing of data in IIoT has received great attention from academia, industry, and government departments and has been widely used in industries such as power, transportation, manufacturing, and smart cities. With the continuous deepening of the digital transformation process of the industry, the evolution of the edge computing network architecture will inevitably lead to an increasing number of security attacks on edge computing nodes, and edge security issues have become one of the obstacles restricting the development of the edge computing industry [1]. Nowadays, a large number of smart devices and sensors constitute a large wireless sensing network that can monitor, sense, and collect information from various monitored objects, while computing and storing these massive data. However, malware running on these ubiquitous sensors and smart devices can affect data security and cause other potential threats to data and IIoT devices [2]. Currently, the rapid increase in the types and quantity of malicious code not only brings property and economic losses but also gradually threatens national security [3]. For example, in May 2017, a computer ransomware called WannaCry spreads in more than 100 countries around the world. Many universities were infected and severely spreads to large public service areas such as airports, customs, and public safety networks [4]. In view of the weak security protection mechanism and limited computing resources of edge computing nodes, the detection and prevention of malicious code in the entire life cycle of edge computing is of great significance [5]. Malicious code classification is the key to preventing malicious code from running and improving information security and provides an important basis for malicious code detection, control, and removal.

Despite the continuous advancement of malicious code detection and classification technologies, malicious code has continuously evolved to generate new variants to avoid detection and quickly copied and spread, resulting in frequent security incidents in recent years. In most cases, the malicious code is generated or improved in an automated or semiautomated manner, and its core modules are reused during the generation process. Since the vast majority of new malicious codes are derived from the known malicious code mutations, there are generally less than 2% code differences between malicious codes of the same family [6]. This provides information security researchers with the basis for malicious code classification, that is, the detection and classification of different malicious code families can be achieved through visual feature similarity detection of malicious code core modules. The current mainstream malicious code detection and classification methods mainly include static analysis methods [7] and dynamic analysis methods [8]. The former refers to the analysis of malicious code without executing binary programs, which often fails to effectively solve the impact of packing and obfuscation technologies, while the latter refers to the use of program debugging tools to track and observe malicious code when it is executed and to verify the static analysis results according to the working process of the malware. This method is often inefficient and has a single execution path when dealing with large amounts of malicious code. Limited by computing power and resource consumption, traditional solutions perform poorly invariant similarity analysis of large-scale malicious code family samples. With the rapid development of deep learning technology and the increase in types and quantities of malicious code, researchers gradually began to convert the malicious code classification problem into an image classification problem. Malicious code visualization scheme based on deep learning has become a current research hotspot [911].

Malicious codes of the same family have similarities in visual features, but different families are different, which can be used as the basis for malicious code detection and classification. Thus, a malicious code visualization scheme transforms the problem of malicious code classification into an image classification problem and applies deep learning technology to solve it. Since different malicious code images reflect the differences in code data structure and information volume, the generation method of malicious code images is very important for malicious code classification. Currently, most of the existing malicious code visualization schemes only use bytecode files or assembly files and convert them into grayscale or RGB images for classification [10]. Some visualization schemes choose to calculate information entropy to enhance image information to further improve classification accuracy [11]. However, these methods have problems such as the single source of malicious code images and the large computational complexity of enhanced information, which increase the classification difficulty and reduce the classification accuracy to a certain extent. In addition, malicious code often exists in the local location of the program, manifesting as local image features. In malicious code visualization scheme, the commonly used convolutional neural network (CNN) pays more attention to the global image features and does not consider the detailed image features of the key regions. Therefore, it is necessary to introduce an attention mechanism to assign different weights to different regions in the image, so that the neural network can fully exploit and utilize the local detailed feature information of the malicious code image. In this way, the key image feature information is extracted through the attention mechanism, thereby improving the accuracy of subsequent malicious code detection and classification.

Based on the above analysis, this study proposes an innovative malicious code visualization classification method to further improve the classification accuracy and efficiency and then supports the detection and prevention of malicious code in edge computing. On the one hand, this scheme uses both the bytecode file and assembly file of the malware to visualize the malicious code as an RGBA image without additional calculation of code information entropy, which makes up for the defects of a single source of malicious code image information, insignificant image features, and excessive calculation. On the other hand, the hybrid attention mechanism is combined with the deep residual network to build a more accurate classification model. The deep residual network improves the classification accuracy while using shortcut connections to alleviate the gradient disappearance problem, accelerate model convergence, and improve the model’s discriminative ability. Especially, each residual unit adopts a hybrid attention mechanism to extract more critical deep features from the two dimensions of channel and spatial to further improve the classification accuracy.

This study is organized as follows: Section 2 reviews the related work on malicious code classification. Section 3 introduces the core method, showing the detailed implementation of the proposed malicious code classification method from the malicious code visualization module and classification module, respectively. Section 4 presents the related experimental verification and performance analysis. The last section is the conclusion and future work.

As mentioned above, edge security is an important guarantee for edge computing, wherein malicious code detection and prevention in the entire life cycle of edge computing is of great significance [1]. At present, malicious code visualization schemes have been developed on the basis of static analysis and dynamic analysis. Researchers have conducted extensive exploration and research on classification methods based on malicious code visualization. The key to improving the classification accuracy lies in how to extract reasonable and effective feature images to represent the program features of original malicious code as much as possible.

Conti et al. [12] pointed out that the malicious sample visualization method can help security analysts to quickly identify malicious code files. On this basis, Nataraj et al. [13] proposed a complete visual classification scheme for malicious samples, which mapped the malicious code to a grayscale image and extracted GIST features from it and finally implemented the malicious code classification through the nearest neighbor (KNN) algorithm. They [14] also pointed out that the malicious code images of the same family have similar texture features, while the texture features of malicious code images of different families are quite different. Kornish et al. [15] found that appropriate improvements to images can improve the malware classification accuracy. Since then, malicious code visualization schemes have been enriched, the source of image information was no longer limited to bytecode files, and RGB images [16, 17] and RGBA images [18] were widely used. Wang et al. [16] divided the binary sequence of the malicious code file into RGB three-color channel values and converted the malicious sample into RGB images. Meanwhile, Sun et al. [17] used ASCII character information and PE structure information to convert malicious samples into RGB images and used VGG16 model to train and predict malicious code images. Chen et al. [18] used the bytecode file and local information entropy to convert the malware into RGBA images with larger information capacity, but this scheme increases the amount of calculation, and the image information source is single. These malicious code visualization schemes based on image features make up for the shortcomings of static analysis methods that are difficult to solve the problem of sample packing and confusion, as well as the long feature extraction time of dynamic analysis methods. Most of the aforementioned visualization schemes still follow Nataraj’s grayscale scheme, using only bytecode files or assembly files, converting them to grayscale or RGB images for classification, or choosing to calculate information entropy to enhance image information to improve the classification accuracy. However, there are still the problems of single source of code images and high computational complexity. The feature information of malicious code images is not fully utilized, and the classification accuracy and efficiency still need to be improved.

As mentioned before, deep learning technology has power feature learning and expression ability, which makes it has outstanding advantages in extracting global features and contextual information of images. Currently, deep learning technology is widely used in various classification and prediction problems in different fields, such as hyperspectral image classification, IIoT security, and malicious code classification and detection [1922]. Cheng et al. [9] explored an ensemble interpretable framework for automatic and efficient malicious code detection based on the knowledge graph of malware. Peng and Lu [23] proposed a discriminative extreme learning machine with supervised sparsity preserving (SPELM) model and verified the effectiveness of this model on four widely used image benchmark datasets. Pitolli et al. [24] proposed a novel approach for malware family identification based on an online clustering algorithm, which efficiently updates clusters as new samples are fed without rescanning the entire dataset. Cakir and Dogdu [25] used a shallow neural network based on the Word2Vec vector space model to represent the malicious code and finally applied the gradient search algorithm to classify the malicious code. Turnip et al. [26] proposed the eXtreme Gradient Boosting (XGBoost) to identify Android malware types. Liu et al. [27] combined graph neural networks with expert knowledge to realize smart contract vulnerability detection. Choi [28] proposed a malicious PowerShell detection method using GCN, which increased the detection rate of malicious PowerShell by approximately 8.2%. Wu et al. [29] proposed an attack-agnostic method based on cascaded self-supervised learning models [30] and achieved effective defense performance. With the development of the IIoT technology [31, 32], more and more users are beginning to use smart mobile terminal devices. Jaigirdar et al. [33] proposed the Prov-IoT model to maintain the data security of IoT devices. Zhou et al. [34] proposed a security defense system to protect the security of intelligent systems. However, the Android system is often attacked by malware due to its open source. Multimodal deep learning (MDL) performs well in complex scenes chosen to detect Android Malware by Kim et al. [35] and Vasu and Pari [36]. Ghouti and Imam [37] used principal component analysis (PCA) to extract the category and structural features of the malicious code and then used an optimized SVM to achieve malicious code classification. However, due to the structural characteristics of deep neural networks, such as focusing on global features and ignoring local details, some emerging research needs to be introduced to compensate for structural defects to comprehensively extract malicious code features and further improve the classification accuracy.

Since using the attention module in the CNN can focus on key information and improve the representation ability of convolution [38], more and more researchers [39] introduce attention mechanism into the field of malicious code classification and detection. Yakura et al. [40] built an ACNN malicious code detection model by combining CNN and attention mechanism to reduce the workload of analysts. Wang et al. [41] proposed a Depthwise Efficient Attention Module (DEAM) and combined it with a DenseNet to propose a new malware detection and family classification model. However, these schemes did not conduct in-depth research on the classification of malicious code families; there is still a huge potential research space for the application research of attention mechanism in malicious code visualization-based classification schemes.

3. Malicious Code Visual Classification Method

3.1. Method Overview

In order to solve the above-mentioned problems in existing malicious code classification methods, this study proposes a malicious code visualization classification method based on a deep residual network and hybrid attention mechanism to achieve the accurate and efficient classification of malicious code. The overall flowchart is shown in Figure 1, and the details are as follows: (1)Malicious code visualization module: to compensate for the single source of malicious code image information, insignificant image features, and excessive computational complexity, a malicious code visualization scheme is proposed, which combines malware bytecode files and assembly files to form the RGBA images to enhance image information. This method converts the bytecode file into a grayscale image with a specified pixel size and also converts the assembly file into an RGB image of the same size to facilitate subsequent image fusion. Then, the value of the grayscale image as the transparency channel value is merged with the RGB image to form an RGBA image, so as to realize the visualization of malicious code while enhancing the effective information of the image without increasing the complexity of information calculation(2)Malicious code classification module: in order to fully consider the key features of malicious code images and further improve the classification accuracy, a malicious code classification method combining a hybrid attention mechanism and a deep residual network is proposed. This method uses ResNet50 as the backbone network since the residual network can increase the accuracy by increasing the considerable network depth. The internal residual module uses shortcut connection to alleviate the problem of gradient disappearance caused by increasing depth of the network. Then, the channel attention module and the spatial attention module constitute a hybrid attention module, which is added to the residual unit of each convolution part of ResNet50 to improve the representation ability of the convolutional network. Combine the two to build a classification model, train the malicious code image dataset, and finally, realize the effective classification of malicious code

3.2. Malicious Code Visualization Module
3.2.1. Visual Problem Analysis

In an image, as the carrier of the malicious code file, each pixel contains a lot of code file information, and different malicious code images have different malicious code data structures and information amounts. For example, the images of malicious code of the Kelihos_ver1 family, the Vundo family, and the C2LOP.gen!g family are shown in Figure 2. It can be seen that there are visual similarities between the malicious code images corresponding to the malicious code variants of the same family, while there are obvious visual differences between the malicious code images corresponding to different family variants. This difference in visual features shows that malicious code classification based on image similarity is feasible and effective. Thus, the generation method of malicious code images is very important for malicious code classification.

Generally, bytecode is a complied intermediate binary code that is independent of specific machine code and implementation platform. The assembly code is a low-level hardware-related assembly instruction compiled from the source code, which has poor cross-platform performance but relatively high execution performance. Bytecode and assembly code reflect different information about the code, but there is a close correlation between them. As a low-level language, assembly code has high scalability and lengthy code, so the assembly file of the same malicious code is longer and more informative than the bytecode file. Therefore, in order to comprehensively use image information to support the effective and accurate classification, in the malicious code visualization module, we innovatively choose to use both the bytecode file and assembly file information to convert the malicious code into an RGBA image for subsequent classification. Compared with the grayscale image with only one channel, an RGBA image is an image with four channels by adding a transparency channel to an RGB image with three channels. Consequently, the effective information amount of RGBA images is 4 times that of grayscale images, which can provide more comprehensive and accurate image features for subsequent detection. Furthermore, RGBA images can not only carry more channel features but also can effectively fuse bytecode and assembly files.

3.2.2. RGBA Image Generation

Considering the differences in the amount of information between the assembly file and the bytecode file and the composition of the RGBA image, firstly, the assembly file and the bytecode file are converted to the same size ( pixels) RGB image and grayscale image, respectively. Then, the grayscale image value is used as the transparency channel and merged with the RGB image generated by the assembly file to form an RGBA image. The RGBA image contains 4 channels, which are red, green, and blue color channels and transparency channel. Each channel has 8 bits and a total of 256 color levels. The malicious code file is read according to the binary data stream, and each 8-bit length ranges from 0 to 256, which exactly matches the length of each channel. In this case, the bytecode and assembly files are not added or deleted, and the bytecode and assembly features are similar in each local detail of the RGBA image after they are converted to images and fused. Thus, RGBA images can not only carry more channel features but also can effectively fuse bytecode and assembly files.

Based on the above analysis, suppose that the malware’s .byte file is .byte = (..., 01101110, 10011100, 11010011, ...) = (..., 110, 156, 211, ...), and the .asm file is asm = (..., 01101100, 10011101, 11010010 ...) = (..., R:108, G:157, B:210, ...). The RGBA image generation flowchart and algorithm are shown in Figure 3 and Algorithm 1, and the specific steps are as follows: (1)Read the malware’s .bytes file by reading a binary data stream, and every 8 bits is converted to an unsigned integer vector. The value range of 8-bit unsigned integer is 0~255, which exactly corresponds to the pixel gray value 0~255. According to length : width equal to 1 : 1, to generate a grayscale image , and then, scale the original grayscale image to gray image (2)Read the malware’s .asm file by reading the binary data stream as well, each 8 bits corresponds to the , , and values of a pixel (; ; ), according to length : width equal to 1 : 1 to generate RGB image and then scale the original RGB image to RGB image (3)The gray value of the gray image is used as the transparency channel of the RGBA image, and it is merged with the RGB image generated by the .asm file to form an RGBA image

When the malicious code file is converted into an image, the zero-padded operation is performed instead of intercepting part of the file content, which ensures the source integrity of the malicious code image information to a certain extent. The parameter of the resize() function is set to Image.ANTIALIAS, which will perform high-quality compression on the image to ensure that the image quality will not be reduced when the image size changes. In this way, the RGBA image contains 4 channels, and the effective information contained is 4 times that of the grayscale image, which can provide more potential malicious code features and effectively support the subsequent malicious code classification.

Input: The bytecode file and assembly file of the malicious code;
Output: The final training dataset RGBA images .
For each sample :
     Calculate the width of the image ;
     Calculate the gray value corresponding to each pixel, , form a grayscale image;
     Scale the original grayscale image to pixel size gray image .
End
For each sample :
    Calculate the width of the image ;
    Calculate the R, G, B value of each pixel, ; ; , form an RGB image;
    Scale the original RGB image to pixel size RGB image .
End
For each image and :
    The gray value of is used as the value of the RGBA image;
    Merged with to form an RGBA image .
End
3.3. Malicious Code Classification Module

After obtaining the malicious code image dataset through the above malicious code visualization module, the next step is to build an accurate malicious code classification model. In the malicious code classification module, we propose an innovative malicious code classification model based on the deep residual neural network - ResNet50 [42] and the attention module structure of Woo et al. [38], which combines a hybrid attention mechanism with the deep residual network to further improve classification accuracy. On the one hand, the deep neural network is used to improve classification accuracy by increasing the structural depth. Meanwhile, the residual structure can effectively avoid the problem of gradient disappearance through the shortcut connection. On the other hand, a hybrid attention mechanism is applied and injected into the residual network to effectively capture the key features of malicious code images and assign different learning weights, so that the model can learn the image features that need to be focused to further improve the classification accuracy. Moreover, the application of the attention mechanism adds less parameters and calculation amount, which can ensure the classification effect of the model without affecting the classification efficiency.

The overall network architecture of the classification model Mcs - ResNet is shown in Figure 4, containing 5 convolution parts (conv1~conv5). Among them, conv2_x, conv3_x, conv4_x, and conv5_x are formed by adding a hybrid attention module to the residual unit of the convolution part, to ensure the full integration of the hybrid attention module and the deep residual network to further enhance the mining of deep features. Moreover, the detailed parameter information of the model is shown in Table 1. Here, a 50-layer ResNet model with 3 layers of bottleneck blocks is chosen as the base network for malicious code classification. Therefore, the model complexity is about 3.8 billion FLOPs (floating-point operations) and so is the parameter size.

The convolutional layer implements the feature extraction and feature mapping, weight sharing, and local connection of the input image through the convolution filter in CNN. Generally, in the convolution process, the convolution filter often has multiple channels, and the filters of multiple channels usually perform feature extraction at the same time. For example, when the input image is , that is, the image size is and the channel is , the convolution processing is shown in where represents the bias of the neural network and represents the weight of convolution filters with a size of .

As shown in the lower part of Figure 4, the hybrid attention module is composed of a channel attention module and a spatial attention module to simultaneously obtain the channel feature weights and spatial feature weights of the malicious code image, thereby enhancing the obtained important features. After that, the enhanced features and the original input image features are connected through the shortcut connection structure to obtain the final output features. The channel attention module and the spatial attention module emphasize the special regions of the malicious code image to enhance the accuracy of malicious code image classification. The following describes the residual module and hybrid attention module and their combination in detail.

3.3.1. Residual Module

The deep learning model is usually composed of multiple layers, and its deep structure has powerful learning capabilities and efficient feature expression capabilities to automatically learn features from a large amount of data. It is widely used in image recognition, speech recognition, and other fields, and has become an important part of computer vision technology. The network depth of a deep learning model determines whether it can extract deeper features, but as the network depth continues to deepen, it will cause network degradation and gradient disappearance problems. The residual network proposes a shortcut connection technique to solve the above problems. The input is transferred across layers and added to the result of the convolution, and the identity mapping is added, as shown in Figure 5(a). When the network input is , the learned feature is , that is, the unit input and output are directly added, and then activated by the ReLU activation function. This network structure does not add additional parameters, which facilitates the subsequent network optimization and greatly improves the training efficiency. Based on these characteristics of the residual network, the attention module is injected into the residual network to construct a residual attention network to simultaneously utilize the advantages of both, as shown in Figure 5(b).

Here, the proposed malicious code classification model Mcs-ResNet uses the ResNet50 residual network as the backbone network. ResNet50 is a deep residual network formed by adding a shortcut connection mechanism on the basis of the VGG19 network. The network structure of the traditional CNN model is directly stacked, which is equivalent to multiplication calculation. In this ResNet model, the network structure is connected through a shortcut connection, and the calculation is changed from multiplication to addition. The feature calculation under this structure will be more stable, so the original feature information in the malicious code image and the key feature information processed by the attention module will flow to the next layer more stably, and the malicious code image classification will be more efficient.

Based on the above analysis, the expression of the RGBA image processed by the residual module is as follows: where represents operations such as feature mapping, activation, and attention weighting; is the spatial attention weight; and is the channel attention weight. The specific calculation of and will be described in the next section. At this time, the features of the RGBA image are not compressed, so that the channel and spatial features can be learned more fully after adding the attention module. This ensures that more critical deep features in the two dimensions flow more stably to the next layer.

3.3.2. Hybrid Attention Module

As mentioned earlier, the use of attention mechanism in CNNs can focus on key information and improve the convolutional representation ability. Therefore, an attention module is added after the residual module to focus on key information and weaken the useless information. The one-dimensional channel attention feature matrix and the two-dimensional spatial attention feature matrix are derived in turn, and then, the generated attention feature matrix is multiplied with the original input feature matrix to form the output feature matrix, which enables the classification model to focus on key areas with higher correlation with malicious behaviors for more accurate classification.

(1) Channel Attention Module. Compared with grayscale images or RGB images, RGBA images contain richer information and more channels. Using the CNN with channel attention for classification can assign different weights to each channel, thereby effectively improving the classification accuracy of malicious code. In the CNN, the two-dimensional malicious image will generate an image feature matrix (, , ) after the convolution kernel operation, where represent the image height and width, and represents the image feature channel. Introducing the channel attention mechanism into the malicious code classification model can effectively strengthen the model’s extraction of global texture features of malicious code images. The channel attention module can pay attention to the importance of different feature channels of the input image. By modeling the importance of each feature channel, assign different weights to the channel features, and strengthen or suppress different channels according to the degree of correlation with malicious behavior.

The operation process of the channel attention module is shown in Figure 6, and the specific steps are as follows: firstly, the output feature matrix of the previous layer of convolution is used as the intermediate input feature. Then, the intermediate feature matrix obtains two-channel descriptions in the form of through average pooling and maximum pooling based on spatial dimensions to compress the spatial dimensions of the input feature matrix and gather spatial information. The feature information is extracted from different angles, the importance of each feature channel is modeled, and the channel features are assigned weights, thereby effectively utilizing the special interaction relationship between the channels of the intermediate feature matrix obtained after convolution. Afterwards, through the adjustment of the shared network multilayer perceptron, the output vector dimension should match the number of channels of intermediate feature matrix, and the adjusted vector elements are added together and activated by the Sigmoid function, realizing the enhancement or suppression of different channels as needed. The expression of the channel attention module is shown in formula (5). where denotes the Sigmoid function, and , is the compression ratio. Note that and are weights of the multilayer perceptron (MLP), shared by the input features and the ReLU activation function of . and represent the spatial matrix generated by the average pooling and the maximum pooling.

Finally, the channel attention module output matrix and the input intermediate feature matrix are weighted and summed channel by channel to complete the channel attention calculation of the output feature matrix. On the basis of the residual module, combined with the channel attention module, it can retain more global texture information in the input malicious code image, greatly improving the malicious feature representation ability.

(2) Spatial Attention Module. Since most new malicious codes are derived from existing malicious code mutations, their core modules are repeatedly rewritten to generate new malicious code. Hence, the key to malicious code variant detection is how to extract the core module feature information and how to assign different weights to different regions in the image to focus on key feature information to improve the detection and classification accuracy of malicious code variants. The spatial attention module focuses on the importance of different feature spatial locations, generates spatial attention weights for the output feature map, and enhances the spatial location features with higher correlation with malicious behavior according to the feature weights.

The operation process of the spatial attention module is shown in Figure 7, and the specific steps are as follows: firstly, take the feature matrix processed by the channel attention as the intermediate input feature, and perform average pooling and maximum pooling, respectively, based on channel dimensions to obtain two spatial description matrices in the form of . This will not only consider the contribution of local malicious code image space but also can capture the contribution of global space. Next, the two spatial description matrices are merged into a feature matrix, and a two-dimensional spatial attention map is generated through the convolutional layer to better fit the spatial complexity correlation. Thus, adding a spatial attention module to the classification model can improve the learning ability in key regions with higher correlation with malicious behavior and complements the channel attention, thereby further improving the classification accuracy. Finally, the spatial attention map can be generated after the activation of the Sigmoid function. The spatial attention module is shown in formula (6), where represents the sigmoid function, represents the convolution operation, and the size of the convolution kernel is . and also represent the matrices generated by the average pooling and maximum pooling.

(3) Hybrid Attention Mechanism. The classification model proposed in this study extracts malicious code features by fusing channel attention and spatial attention. The channel attention module focuses on the global feature information between each channel, and the spatial attention module focuses on the local feature information within the channel. The combination of the two forms a hybrid attention mechanism, which supports the learning of key features and further improves classification accuracy. Woo et al. [38] proved that the channel attention module and the spatial attention module can be arranged in parallel or sequentially, but the sequential arrangement has better performance, and the model performance with the channel attention module priority is slightly better than the spatial attention module priority. The reason is that the channel attention focuses on “what” is critical and meaningful in an input image, and the spatial attention focuses on “where” is an informative part, which is complementary to the channel attention. Therefore, the priority order of the channel attention module is used in the proposed classification model.

3.3.3. Classification Model Structure

Based on the above analysis, the classification model structure that combines the residual module and hybrid attention mechanism is shown in Figure 8. Firstly, perform a convolution operation on the features generated in the previous layer to generate the input feature . passes through the channel attention module to obtain the importance of each feature channel, so that the model pays more attention to the channel related to malicious behavior with high weight and suppresses the channel with low correlation, so as to obtain the channel attention feature Mc. The corresponding matrix elements are multiplied by and Mc to extract the features from the spatial dimension, improve the classification model’s ability to extract local texture features, and obtain the new feature . Then, the is used as the input feature of the spatial attention module to obtain the spatial attention feature Ms. Ms and are multiplied by the corresponding matrix elements to obtain the mixed feature . Finally, is added to the features generated in the previous layer to generate feature as the input of the next module.

The whole attention calculation process is shown in formulas (7)–(9). This process strengthens the feature information between channels in the global features of the malicious code and the local location information within the channels, thereby improving the classification performance.

In order to fully learn the image characteristics of malicious code and improve the performance of the attention module, a hybrid attention module is added after each residual unit instead of just adding it once. Therefore, when the next module performs the deep convolution operation, the features learned by the attention module in the previous module will be retained to continue learning. Moreover, although channel attention and spatial attention are arranged sequentially, they are also connected by identity mapping, which can prevent information of different dimensions from interfering with each other.

4. Experiment and Performance Analysis

4.1. Experimental Preparation
4.1.1. Experimental Dataset

The experimental dataset used in this study are the BIG2015 dataset (https://www.kaggle.com/c/malware-classification/data) and the Malimg dataset (https://www.kaggle.com/keerthicheepurupalli/malimg-dataset9010). The BIG2015 dataset is a 500 G malware file dataset released by Microsoft on Kaggle during its malware classification challenge in 2015, which includes assembly files and bytecode files of more than 20,000 malware samples. In addition to providing services in Kaggle competitions, the BIG2015 dataset has become a standard benchmark for studying malware behavior modeling. So far, it has been cited by more than 50 research papers. Therefore, this dataset is used here to verify the performance of the proposed malicious code classification model. In order to facilitate the performance verification, the labeled training dataset which consists of 10868 malware samples from 9 families is selected as the experimental dataset, as shown in the upper part of Table 2, and divided into a training dataset and test dataset according to the ratio of 8 : 2.

The Malimg dataset is released by the Advanced Visualization Research Project of the Visual Research Laboratory under the University of California-Santa Barbara. They first proposed a malicious code visualization method for malicious code detection and classification. In 2011, this team constructed the Malimg dataset and published the code visualization method to promote software security research. This dataset contains a total of 9342 samples from 25 family categories, as shown in the lower part of Table 2. Furthermore, the Malimg dataset is composed of grayscale images converted from malware bytecode files.

4.1.2. Experimental Settings

The experimental environment is shown in Table 3.

The stochastic gradient descent (SGD) algorithm with momentum can effectively suppress the oscillation of SGD and accelerate the convergence speed. The data distribution of the model in this paper is relatively uniform and can be well adapted to the SGD algorithm for model optimization. Therefore, the experiment uses the SGD algorithm with momentum optimization to update the model parameters to improve the computational efficiency, and the momentum is set to 0.9. A total of 2000 epochs are trained, and the training batch samples are 16. The dynamic attenuation learning rate is used, and the initial learning rate is set to 0.01, and the classification function is softmax.

Since the classification of malicious code families is a multiclassification problem, in order to facilitate comparison with other models and better measure the classification performance, the arithmetic average of the accuracy of various malicious code families is taken as the standard for performance evaluation. Here, TP is defined as the number of malicious samples classified as malware, TN is defined as the number of benign samples classified as benign, FP is the number of benign samples classified as malware, and FN is the number of malicious samples classified as benign. Thus, accuracy (Acc) is defined as follows:

4.2. Ablation Experiment and Analysis

Two sets of ablation experiments are conducted to verify the feasibility and effectiveness of the proposed malicious code visualization and classification module, which includes visualization scheme validity verification and hybrid attention module performance analysis. The comparative experiments all use the classification accuracy (Acc) as the evaluation index to facilitate comparison.

4.2.1. Visualization Scheme Validity Verification

The first experiment applied the malicious code images obtained from different visualization schemes to the classic classification models and the proposed classification model - Mcs-ResNet for comparative analysis. The classic classification models include VGG16, VGG19, and ResNet50 pretraining models for feature extraction. The KNN model is used as the classifier, where is 5. This group of experiments uses the bytecode files and assembly files provided by the BIG2015 dataset for verification. In addition to using the proposed visualization scheme to convert it into RGBA images, only bytecode files or assembly files are converted into grayscale and RGB images, and the image size is changed to show the classification effect based on different visualization schemes.

The experimental results are shown in Table 4, wherein the grayscale image and RGB image generated from the bytecode file are marked as Byte-gray and Byte-RGB, respectively, and the grayscale image and RGB image generated from the assembly file are marked as ASM-gray and ASM-RGB, respectively, and the RGBA images are RGB images that contain transparency information, and represents the image size. And according to the experimental results, the following conclusions can be drawn: (1)According to the experimental results of nos. 1-4, the classification effect of converting bytecode files into grayscale or RGB images is almost the same, while the classification effect of converting assembly files into RGB images is better than that of grayscale images. For example, in the classification model ResNet50+KNN5, the accuracy of the bytecode file converted into the two types of images is 89.17% and 89.98%, respectively, with a difference of only 0.81%, while the accuracy of RGB image converted from the assembly file is 2.27% higher than that of grayscale image. The reason is that the assembly file size of the same malware code is much larger than that of the bytecode file. The grayscale image can effectively represent the bytecode file but not the assembly file. Therefore, the classification effect of the bytecode file converted into two kinds of images is similar, and the classification effect of the assembly file converted into RGB image is better.(2)From the comparison of nos. 5-8 and nos. 1-4, it can be seen that the classification effect of prescaling the image to a uniform size ( pixels) with high quality is significantly better than that of directly inputting the original pixel size image. In the classification model composed of VGG19 and KNN5, the accuracy on the original images are 87.33%, 88.41%, 90.77%, and 90.19%, while the accuracy on the corresponding pixel images improved by 6.51%, 4.11%, 2.72%, and 4.22%, respectively. Meanwhile, in the Mcs-ResNet model, it also improves 7.69%, 1.15%, 7.86%, and 6.66%, respectively, all of which are significantly improved. The reason is that when the image is scaled in the preprocessing, the parameter of the resize() function is set to Image.ANTIALIAS. This is an operation for high-quality image compression, and the original image features can be preserved to the greatest extent. But the original image is directly compressed to pixel size when the data is loaded. This is low-quality processing, resulting in the loss of a large amount of effective information in the original image and poor classification effect. In addition, the performance of the Mcs-ResNet model is better than other models, i.e., in the no. 7 and no. 8 experiments; its accuracy is 3.49% and 2.29% higher than that of the VGG19+KNN5 model, respectively.(3)It can be seen from no.8 and no. 9 that the classification effect based on RGBA images is better than those based on other images. All four models achieved the best classification accuracy on RGBA images. Especially, the classification accuracy of the proposed model is 97.21%, which is 2.95%, 2.8%, and 2.51% higher than the previous three models based on RGB images (no.8). It is also 2.52%, 1.98% and 2.09% higher than the previous three models based RGBA images (no.9). The information source of RGBA image is composed of bytecode files and assembly files. Therefore, the information source is richer, and the amount of information contained is larger than the grayscale image and RGB image, which can better describe the features of malicious code images.

Based on the above analysis, the proposed visualization scheme is feasible and effective and shows good classification performance in different classifiers. Thus, when fusing bytecode files and assembly files, it is a reasonable choice to use bytecode files as the data source of the transparency channel and the assembly files as the R, G, and B channel data sources. This operation can deeply exploit and utilize the feature information of malicious code images and effectively support the accurate classification of malicious codes.

4.2.2. Hybrid Attention Module Performance Analysis

In order to verify the classification performance of different attention mechanisms on malicious code image classification, four sets of comparative experiments are conducted here, including the ResNet model without an embedded attention module; the ResNet model with only the channel attention module embedded (Mc-ResNet); the ResNet model with only the spatial attention module embedded (Ms-ResNet); and the ResNet model with the hybrid attention module (Mcs-ResNet), as shown in Figure 9.

Except for the embedded attention module, the other model parameters are consistent with those shown in Section 4.1.2. The experimental results are shown in Table 5. It can be seen that: (1)The Acc value of the ResNet model without the attention module is significantly lower than the other three residual neural network models embedded with the attention module. The ResNet model has the lowest accuracy rate on the Byte-Gray dataset of 78.32%, and the average accuracy rate is 87.39%, which is the lowest among the four models, and 3.7%, 4.87%, and 6.01% lower than the other three models, respectively. It also works best on RGBA images in each dataset, with an accuracy rate of 95.14%, and it is still 0.64%, 1.95%, and 2.07% lower than other models.(2)The average Acc value of the Mc-ResNet model and the Ms-ResNet model is 3.7% and 4.87% higher than that of the ResNet model. The experiments show that the introduction of the attention mechanism is helpful to the extraction of key image features and can effectively improve classification accuracy. Similarly, the above two models also work best on RGBA images.(3)The average Acc value of the proposed Mcs-ResNet embedded with the hybrid attention module is 6.01% higher than the ResNet model. Moreover, it is 2.31% and 1.14% higher than the Mc-ResNet model and Ms-ResNet model which are embedded with a single attention module. In general, Mcs-ResNet, which embeds both the channel attention module and spatial attention module, achieves the best classification performance even on different visualization schemes. Especially on RGBA images, its classification accuracy is still the best with an accuracy rate of 97.21%. As shown in the 9th experiment, when using RGBA images, the model after embedding hybrid attention improves the classification accuracy by 2.07% compared to the model with only residual network. Moreover, the classification accuracy is also improved compared with the model embeds channel or spatial attention alone. This further verifies the effectiveness of the proposed malicious code visualization scheme.

The running times of the above four models are shown in Table 6. It can be seen that with the introduction of channel attention and spatial attention mechanisms, the training time and prediction time of the model are longer. But given the improvement in accuracy, the increase in prediction time is small and acceptable within the expected range. In addition, the training time on the two datasets are relatively close, 21.7059 seconds and 22.3302 seconds, while the prediction time are 13.5375 seconds and 5.7758 seconds, respectively, and the residual network model parameter scale is close to 3.8 billion FLOPs. Therefore, the experimental results show that the proposed model can achieve better prediction results within a reasonable running time.

Based on the above analysis, the attention mechanism can improve the classification effect, and different attention mechanisms have different effects on the model. Especially, the introduction of the hybrid attention mechanism in the deep residual network can effectively improve the classification accuracy. The reason is that a single attention mechanism is not enough to fully characterize key features. If channel attention is ignored, the ability to extract global texture features will suffer. If the spatial attention is ignored, it will have an impact on local texture feature learning, thus ignoring the local texture information. The hybrid attention mechanism can learn different weights from the two dimensions of channel and spatial, extract the deep texture feature of malicious code images from the whole and local perspective, and strengthen the model’s ability to extract key features. Overall, the fusion of channel and spatial attention mechanism enables the deep features of malicious code images to be fully represented, enabling the classification model to have better classification accuracy for different malicious families.

4.3. Overall Performance Experiment and Analysis

In this section, two groups of experiments are conducted to analyze the overall performance of the model: (1) performance analysis on different datasets: verify the general applicability of the proposed classification model on different experimental datasets; (2) performance comparison analysis: compare the proposed scheme with other malicious code classification schemes to verify the superiority of the proposed scheme.

4.3.1. Performance Analysis on Different Datasets

In order to verify the general applicability of the proposed classification model, the BIG2015 dataset and the Malimg dataset are selected for comparative experiments, and the experimental results are shown in Figure 10 and Table 7. The BIG2015 dataset provides both bytecode files and assembly files that can be directly used in the proposed classification model, while the Malimg dataset provides the image format of the malicious code after gray-scale processing. Since the malicious code file is truncated and other operations that cause information loss, reverse processing can be performed to completely restore the image file to a bytecode file. And then use the IDA PRO disassembly tool (https://www.hex-rays.com/products/ida/) to analyze the bytecode files to obtain the corresponding assembly file and finally use the Mcs-ResNet model to classify malicious code.

It can be concluded from the experimental results that when the epoch is 250, the validation accuracy of the BIG2015 dataset is 81.13%, and the validation accuracy of the Malimg dataset is 67.19%, so in the initial stage of training, the optimization of the training effect on the BIG2015 dataset is slightly faster. When the epoch is 500, the validation accuracy of these two datasets reaches 89.53% and 90.25%, respectively; now, the prediction effect on the Malimg dataset is slightly better. And both datasets can reach stability at 1250 epochs, and the accuracy is basically close to the maximum. Finally, it achieved an average classification accuracy of 97.21% on the BIG2015 dataset and 98.06% on the Malimg dataset.

In addition to the accuracy rate, precision, recall, -score, and confusion matrix are also used to evaluate the model performance; the formula and experimental results are shown below (Table 8).

It can be seen that the proposed model performs well under different evaluation metrics. The number of malicious code families Skintrim.N and Swizzor.gen!E in Malimg dataset is only 80 and 128, and the classification effect on these two families is unstable and is largely affected by the data imbalance. Data imbalance is not the focus of this paper and will be studied in the follow-up work. In summary, the proposed classification model shows good generalization performance and is not limited to a specific dataset, which can achieve better classification results on different datasets while ensuring the classification efficiency (Figure 11).

4.3.2. Performance Comparison of Malicious Code Classification Schemes

The last set of experiments compares the proposed Mcs-ResNet model with several models that currently perform well in malicious code classification to verify its classification performance. The experimental results on two datasets are shown in Table 9. Among them, Nataraj et al. [13] convert the bytecode file into a grayscale image, extract the GIST features, and use the KNN algorithm for classification; Wang et al. [16] convert the bytecode file into an RGB image and use VGGNet model for classification; Cui et al. [43] convert the malicious code into grayscale images and use CNN for classification; Cakir and Dogdu [25] use the assembly file of malicious code, extract features based on Word2Vec, and then, use Gradient Boosting Machine (GBM) for classification; Ma et al. [39] use both bytecode files and assembly files of malicious codes and classify them based on SVM.

It can be seen from Table 9 that the proposed Mcs-ResNet model reaches 97.21% and 98.06% classification accuracy on the two datasets, respectively. Compared with other methods that only use .bytes files, such as Nataraj et al. [13] and Cui et al. [43], the classification accuracy rate is increased by 1~3%. Compared with methods that only use .asm files, such as Cakir nad Dogdu [25], the classification accuracy rate is improved by 1.07%. Therefore, the experimental results of using both .byte files and .asm files are better than using only one of them, indicating that more file types can provide more information and further improve the subsequent classification accuracy. And compared with the Ma et al. [39] method that uses both .bytes files and .asm files, the classification accuracy rate is also improved by 1.12%. Ma et al. only use the global attention mechanism to extract weights of each assembly statement, without considering the key channels and regions of intermediate feature maps of the classification model. Therefore, the hybrid attention module composed of channel attention and spatial attention outperforms the global attention mechanism used by Ma et al. Overall, the proposed method has certain advantages in classification accuracy on different datasets over other malicious code classification methods. The reason is that this method uses both bytecode files and assembly files to form RGBA images to obtain more malicious code information, and the proposed hybrid attention mechanism pays more attention to the extraction of key regions and local features, which further improves the classification accuracy.

5. Conclusion and Future Work

Edge security is an important guarantee for edge computing, and it is of great significance to classify malicious code quickly and accurately in the entire life cycle of edge computing. Therefore, a malicious code visualization classification method based on a deep residual network and hybrid attention mechanism for edge security is proposed to effectively support the detection and accurate classification of malicious code. The main contributions are as follows: (1)A visualization scheme that converts malicious code into RGBA images is proposed to improve the deep feature representation ability of malicious code images. This scheme effectively integrates the bytecode file and assembly file of the malware, deeply exploits and utilizes the image feature information, and solves the problem of a single source of code images in other visualization solutions without adding additional computational complexity(2)A classification model - Mcs-ResNet that combines a hybrid attention mechanism and deep residual network is proposed to accurately classify malicious code. Due to its powerful feature extraction capability and shortcut connection architecture, the deep residual network improves classification accuracy while alleviating the problem of model degradation and gradient disappearance. The hybrid attention mechanism including channel and spatial attention can effectively extract the key feature information of malicious code images. The combination of the two can further improve the classification accuracy and effectiveness

The experimental results on the BIG2015 and Malimg datasets demonstrate the feasibility and effectiveness of the proposed visualization scheme and classification model. Compared with the existing malicious code classification methods, the proposed model performs better in classification accuracy and generalization performance. Future work will start with the serialization of malicious code. Consider associating the bytecode file of the malware with the assembly file and extracting the features of the sequence information in the vertical direction and the associated information in the horizontal direction to fully utilize the malicious code information. How to better combine the attention mechanism with malicious code classification is also the focus of the future work.

Data Availability

The datasets used in the experimental part include the BIG2015 dataset and the Malimg dataset, from the following websites: https://www.kaggle.com/c/malware-classification/data and https://www.kaggle.com/keerthicheepurupalli/malimg-dataset9010.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors appreciate the support from the Zhejiang Provincial Natural Science Foundation of China (LY20F020015 and LY21F020015), Zhejiang Province Key R&D Project (2021C02012), the National Science Foundation of China (61902345, 61972121, 61902099, 61702517, and 61802101), the Defense Industrial Technology Development Program (no. JCKY2019415C001), and the Open Project Program of the State Key Lab of CAD&CG (grant no. 2109), Zhejiang University.