Abstract

Software vulnerabilities are one of the important reasons for network intrusion. It is vital to detect and fix vulnerabilities in a timely manner. Existing vulnerability detection methods usually rely on single code models, which may miss some vulnerabilities. This paper implements a vulnerability detection system by combining source code and assembly code models. First, code slices are extracted from the source code and assembly code. Second, these slices are aligned by the proposed code alignment algorithm. Third, aligned code slices are converted into vector and input into a hyper fusion-based deep learning model. Experiments are carried out to verify the system. The results show that the system presents a stable and convergent detection performance.

1. Introduction

Software vulnerability detection is crucial to cybersecurity defenses. In recent years, the number of software vulnerabilities reported has grown rapidly with the development of the software industry. According to Common Vulnerabilities and Exposures (CVE) [1], security vulnerabilities were published in 2016, and this number has increased to in the first half of 2020. These vulnerabilities have led to many cybersecurity incidents. An effective tool to handle these software vulnerabilities is vulnerability detection. For this reason, many scholars and businesses have devoted much of their energies to the research of vulnerability detection.

There are a wide variety of vulnerability detection methods for discovering vulnerabilities in software. They can be broadly divided into dynamic analysis and static analysis methods. The former identify vulnerable behaviors in the process of analyzing and executing software [2, 3]. Despite a low false alarm rate, this type of methods may miss some vulnerabilities due to the difficulty of analyzing the entire program’s behavior [4]. The latter detect software vulnerabilities by analyzing a code without executing it. They have high convergence and, thus, may be more popular than the former. Some static analysis-based detectors explore code similarity to detect the vulnerabilities caused by code cloning [57]. However, these approaches are hard to generalize to other types of vulnerabilities. In contrast, pattern-based, especially machine learning-based, detectors can use a large number of software codes to learn the patterns of universal vulnerabilities [8, 9]. Recently, deep learning models have been introduced to extract vulnerability features from program fragments [1012] and achieved considerable accuracy.

The method based on code metrics [13] selects appropriate metrics to predict vulnerabilities. This method can complete the detection quickly, but with a low accuracy. The graph-based deep learning vulnerability detection method in [14] extracts the graph of the code to predict vulnerabilities. This method has a high detection effectiveness; however, the detection speed is too slow for large-scale code detection. In order to balance the detection speed and accuracy, we use a similar token sequence-based deep learning model as those in [1012, 15]. These methods extract the token sequences related to the vulnerability information. Some of them extract token sequences from source codes [10, 11], while some use assembly codes. Experimentally we find that, due to the limitations of employing a single model, these schemes do not perform very stably when facing some vulnerabilities, thus reducing their performances. In view of this, we combine the source code model and assembly code model to enhance the detection ability.

Multimodal machine learning has the ability of processing and understanding multisource modal information. Multimodal fusion is one of the earliest concerns for multimodal machine learning. It processes data from different modalities and obtains more characteristics by extracting multimodal data. Multimodal fusion usually achieves better results than only using a single mode. In audio-visual speech recognition (AVSR), the scheme in [16] fuses visual and audio cues to improve the performance relative to audio-only recognition. The scheme in [17] uses video features fusion with spectral and prosodic speech information for multimodal laughter detection. Compared to using only video features, performance can be improved by up to . In multimodal emotion recognition, the scheme in [18] considers video only, audio only, and audio visual for the purpose of emotion recognition. The scheme in [19] uses the early fusion of information from facial expressions, body movements, gestures, and speech to recognize emotions. These methods successfully improved the recognition rate of emotions on the original basis. As a result, we believe that the multimodal fusion of code information could also be usable for vulnerability detection.

In this paper, a vulnerability detection system is proposed by fusing the assembly code and source code models (code and data are available: https://github.com/onstar99/VulnerabilitySystem). Given the source code of a software program, it is compiled to obtain the corresponded assembly code. Then, code slices are extracted from both the source code and the assembly code. After that, a code alignment algorithm is suggested to connect each source code statement with its corresponded assembly code statements. Then, the aligned codes are combined to form a fused slice. These new slices, together with the original source code slices and assembly code slices, are fed into the hyper fusion-based deep learning model for vulnerability detection. The main contributions of this paper include the following:(1)We improve the performance of the vulnerability detection by fusing the models of assembly codes and source codes(2)We suggest a simple but effective alignment method between source codes and assembly codes to quickly align the data slices(3)We collect a vulnerability dataset composed of source and assembly codes, which can be used to train and verify the proposed multimodal-based vulnerability detection method

2.1. Code Representation for Vulnerability Detection

In order to better express vulnerability characteristics in a deep learning-based vulnerability detector, code representation is usually introduced as an intermediate representation. Code representation commonly used includes code metrics, token sequences, abstract syntax trees, and graphs. Token sequences are directly obtained by statistics at the code level, so they have a strong correlation with vulnerabilities. The scheme in [23] scans the character stream of a code and marks its important information according to the lexical rules of a programming language. Then, the character stream of the code is converted into a token sequence representation. Finally, through vectorization processing such as one-hot and word2vec [24], vectors can be obtained and used as the input of a machine learning model.

2.2. Vulnerability Detection by Source Code

Source codes are computer program files that have not yet been compiled. Many vulnerability detectors are implemented by parsing the source code. Li et al. [10] extracted Syntax-based Vulnerability Candidates (SyVCs) from the source code and converted it to Semantics-based Vulnerability Candidates (SeVCs) through a suggested algorithm. Then, word2vec [24] was used to convert SeVC into vectors that facilitate deep learning. Li et al. [11] used LLVM IR technique and suggested a fine-grained deep learning-based vulnerability detector to locate the vulnerability accurately. Cao et al. [20] extracted abstract syntax tree (AST), control flow graph (CFG), and data flow graph (DFG) from the source code. Then, they are used to generate the code composite graph (CCG), which can be combined with a bidirectional graph neural network for vulnerability detection.

Source codes may have more intuitive information than assembly codes or executable codes. This information can better help detect vulnerabilities. For example, source codes can provide more syntactic and semantic information, which makes it easier to know how the data and inputs drive the paths of execution [25].

2.3. Vulnerability Detection by Assembly Code

Assembly codes are program files obtained by compiling the source codes. Practically, assembly codes might be easier to obtain in comparison with source code files. Grieco et al. [12] developed and implemented VDiscover. This system extracts different feature sets including static features of an assembly code and dynamic features during execution to solve large-scale vulnerability discovery. Xu et al. [21] presented a neural network-based cross-platform method to detect the similarity between assembly codes, which is an aid to detect vulnerabilities. Liu et al. [22] employed the attention mechanism on top of a bidirectional long short-term memory to detect assembly code vulnerabilities. Tian et al. [15] extracted code slices from assembly codes in their vulnerability detection system. A summary of the above studies is presented in Table 1, where we have highlighted the key differences of different works.

It has been found that vulnerability detection on the assembly code level can better solve cross-architecture problems [26]. Moreover, assembly codes are sometimes more sensitive to the semantic errors of a program [27]. This means that the assembly code can be used as a powerful supplement to improve the detector’s performance.

3. Designed System

3.1. System Overview

The proposed method analyzes the source code and assembly code of software to obtain more comprehensive information. First, we give the definitions of the source code slice and assembly code slice.

Definition 1 (source code slice). A source code slice. is a snippet of a semantically related multiline source code, denoted as , where , , is the th line in .

Definition 2 (assembly code slice). An assembly code slice. is a snippet of a semantically related multiline assembly code, denoted as , where , , is the th line in .
Figure 1 shows an example of the source code and corresponded assembly code, where assembly code is compiled from source code by using the GCC compiler. in corresponds to in .
The workflow of our system is depicted in Figure 2. It has two phases: the training phase and the testing phase. In the training phase, the system first extracts source code slices from training codes and adds labels to them. Second, the system compiles training codes to assembly codes, from which assembly code slices are extracted and labeled. Third, source code slices and assembly code slices are aligned for further multimodal fusion. Fourth, the system converts source code slices and assembly code slices to vectors. Finally, the system trains a deep learning model by using these vectors. In the testing phase, the system tests the trained model using targeted codes and evaluates its performance.

3.2. Multimodal Network Structure

The designed system fuses assembly codes and source codes to predict the outcome. We use the multimodal hybrid fusion strategy [28] to conduct its network, as shown in Figure 3. It first combines assembly code slices and source code slices via early fusion. Then, these combined slices, together with the source code slices and assembly code slices, are fed into three individual networks. At last, the results of the three networks are used to give the final decision through late fusion.

3.2.1. Code Representation

We use code slices to represent different codes. Given a snippet of the source code, Joern [29] is first used to analyze it and generate a control flow graph (CFG), which represents the real-time execution of a process in the form of a graph. After that, we get the program dependency graph (PDG) from the source code, which is a graphical representation of the control dependency and data dependency among program statements. Finally, CFG and PDG are traversed forward and backward to extract all affected statements. These statements are combined to form a code slice, providing data for vulnerability detection. Figure 4 shows a running example of how we generate a code slice from CFG and PDG of a source code. All the affected statements of each function are captured through CFG and combined into a source code slice based on PDG.

Besides the assembly and source code slices, we further construct fused code slices by applying early fusion to them. We combine each source code slice and the corresponded assembly code slice to derive a new slice . Moreover, redundant information in caused by data fusion is removed, yielding a new set of code slices as part of the network input.

3.2.2. Network Model

The three types of slices are input into three individual networks having the same structure. Each of them consists of five layers: input layer, bidirectional LSTM (BLSTM) layer, dense layer, softmax layer, and output layer, as shown in Figure 5. The input layer is responsible for inputting vector data. The BLSTM layer extracts vulnerability characteristics from vulnerability samples. It contains 300 LSTM units in a bidirectional form (altogether 600 LSTM units). The dense layer reduces the dimensionality of the vector. We set up two dense layers to obtain better detection results. To prevent overfitting, we apply dropout with the value of 0.5 after the dense layers. The softmax layer represents and formats the classification results. At last, the output layer gives a decision. In our experiment, the loss function used is categorical cross-entropy, and the optimizer used is Adamax [30]. During the training phase, we set the batch size with 64 and the epoch number with 30, which gives a good performance. More settings are detailed in Table 2.

3.2.3. Late Fusion Model

The decisions of the three networks may be different. As a result, at the end of our system, these decisions are passed through a voting layer to get the final result. Majority voting is employed in this layer. It is known that a late fusion model usually gets better results than those from a single model. It is expected to avoid erroneous decisions in individual models.

3.3. Training Phase

The training phase is highlighted in Figure 2. It has 5 steps:

Step 1. Source Code Processing.(1)Extract source code slices: extract source code slices from a training code according to some vulnerability syntax characteristics.(2)Add labels: give each source code slice a label as the indicator of vulnerabilities (“1” means the presence of vulnerabilities, and “0” means the absence of vulnerability). The method on how to determine the vulnerability of a code slice is detailed in Section 3.6.Step 2. Program Compiling(1)Generate an assembly code from the training code using the GCC compiler.Step 3. Assembly Code Processing(1)Extract assembly code slices: use the code slice alignment algorithm detailed in Section 3.5 to analyze the assembly code, and extract assembly code slices corresponding to source code slices.(2)Add labels: similar to the processing of source code slices, each assembly code slice is assigned a label indicating whether it contains vulnerabilities (“1” means yes and “0” means no).Step 4. Convert Source Code Slices and Assembly Code Slices into Vectors(1)As shown in Figure 6, the system uses word2vec to convert source code slices, assembly code slices, and hybrid code slices into vector forms, which will be input into the deep learning model.Step 5. Model Training(1)Use the data generated from Steps 1–4 to train the network presented in Section 3.2. The parameters and design of the model have been detailed in Section 3.2.

3.4. Testing Phase

As highlighted in Figure 2, the testing phase has 5 steps.

Steps 1–4. Similar to Steps 1–4 in the training phase, except that the label adding is in no need.

Step 5. Automatic Vulnerability Detection.

(1)Input the data generated from Step 1 to Step 4 into the trained model, and the system will give a detection result. Result “1” means that vulnerabilities exist in this code slice, and “0” means there is no vulnerability.

3.5. Code Alignment

In this section, we provide the algorithm employed in data alignment, . The top level of the algorithm is listed in Algorithm 1.

(1)functionALIGN###XXX###5F;DATA()
(2)for in do
(3)  IDA###XXX###5F;PRO ()// in corresponds to in
(4)  
(5)  for in do
(6)   ifSTATEMENT###XXX###5F;TYPESTATEMENT###XXX###5F;TYPEthen
(7)    
(8)   end if
(9)  end for
(10)end for
(11)
(12)whiledo
(13)  HUNGARIAN ()
(14)  GEN###XXX###5F; SLICE ()
(15)  SIMILARITY ()
(16)end while
(17)return
(18)end function

The whole algorithm can be summarized into three stages, namely, (1) pseudocode generation, (2) collection of candidate sets of matched assembly codes, and (3) best match finding. At stage (1), we use IDA Pro [31] to generate pseudocodes, , from the assembly code (i.e., Line 3 in Algorithm 1), where each statement in corresponds to several statements in . At stage (2), we search for the candidate set of assembly code statements that match (i.e., Lines 4–9 in Algorithm 1). For each , if its statement type (e.g., loop statement and assignment statement) is as same as ., the corresponded can be considered as candidate matches of . At stage (3), the Hungarian algorithm [32] is used to get a slice () from the candidate set . It is considered as a potential match for and combined into . Then, we use string and integer constants, function and library calls, and function declaration information to compute the similarity between and . If this similarity is bigger than a threshold, a satisfactory match is achieved. Otherwise, repeat this stage until a satisfactory match is found (i.e., Lines 11–16 in Algorithm 1).

3.6. Slice Labeling

A simple tool is developed here to give each code slice a label indicating the presence of vulnerabilities. All the vulnerability functions have been defined and given by the NIST Software Assurance Reference Dataset (SARD) [33]. This paper manually populates these vulnerability functions according to their vulnerability codes provided by SARD. The procedure of the tool is described as follows:(1)Obtain the file names, vulnerability locations, vulnerability types, and other information from the vulnerability file.(2)Traverse the training codes and the vulnerability information to obtain the vulnerability code and vulnerability function names. Collect this information to populate the vulnerability function library.(3)Traverse each code slice and vulnerability function library to determine whether the code slice has vulnerability functions or call vulnerability functions.(4)Divide each code slice into two categories:(1)If a code slice has vulnerability functions or call vulnerability functions, it will be labeled with “1.”(2)Otherwise, it will be labeled with “0.”

4. Experimental Results

Our experiments are conducted on a computer with an NVIDIA GeForce GTX 2080 Ti GPU and an Intel Xeon v3 CPU operating at 2.50 GHz. The employed program compiler is GCC 9.3.0.

4.1. Evaluation Metrics

In this paper, we use five standard indicators suggested in [34] to measure the performance of the vulnerability detection system. We denote TP as the number of vulnerable samples that are detected as vulnerable (i.e., true-positives), FP as the number of samples that are not vulnerable but are detected as vulnerable (i.e., false-positives), TN as the number of samples that are not vulnerable and are not detected as vulnerable (i.e., true-negatives), and FN as the number of vulnerable samples that are not detected as vulnerable (i.e., false-negatives). Then, the five indicators, namely, accuracy (A), false positive rate (FPR), false negative rate (FNR), precision (P), and F1-measure (F1), are defined as

4.2. Dataset Preparing
4.2.1. Input Preparing

In this paper, we collect a large number of source codes from the NIST Software Assurance Reference Dataset (SARD) [33]. Then, these codes are compiled with GCC on Windows X64 to obtain the corresponded assembly codes. We randomly select 80% of the source codes and corresponded assembly codes as the training dataset, and the remaining 20% are used for the testing.

To granularly evaluate the proposed system, we divide the test cases into the following four categories:(1)Library/API function call (FC), namely, syntax vulnerability characteristics related to library/API function calls(2)Array usage (AU), namely, vulnerability characteristics related to array element access(3)Pointer usage (PU), namely, syntax vulnerability characteristics related to pointer usage(4)Arithmetic expression (AE), namely, syntax vulnerability characteristics related to improper arithmetic expressions

4.2.2. Extracting Code Slices and Adding Labels

(1) Extracting Code Slices. Source code slices are extracted first. We obtain CFG and PDG by parsing the source code. After that, all the statements affected by the library/API function calls are extracted based on CFG and PDG. These statements form source code slices. Then, we use Algorithm 1 to collect assembly code slices corresponding to the source code. In total, 42938 source code slices and assembly code slices are extracted.

(2) Add Labels. We obtain file names, vulnerability locations, vulnerability types, and other information by analyzing program documents in the SARD project. Then, they are used to build a vulnerability function database, where all the code slices are labeled by the tool developed in Section 3.6. In total, we label 16478 code slices with “1,” and 26460 code slices with “0.”

To ensure effective training of the detection system, 16478 code slices are randomly selected from the 26460 code slices labeled with “0.” They are then combined with all the code slices labeled with “1” to form a dataset in each training/testing round, of which 13182 code slices labeled with “0” and 13182 code slices labeled with “1” are randomly selected to train the model, and the rest are used for testing.

4.2.3. Convert Datasets into Vectors

The neural network only accepts vectors, thus we need to convert the data into the vector form, which is carried out by word2vec. Since the effectiveness of our system depends largely on the quality of the word embeddings produced, we first use the constructed dataset as a corpus to train word2vec. It can give a trained word2vec model more suitable for our task. After that, code slices are divided into a series of tokens. Each token is converted through the trained word2vec to obtain a fixed-length vector that can be recognized by the network. In this experiment, the length of each code slice is 50 and the dimension of the word embedding is 100.

4.3. Training and Results
4.3.1. Network Training

In this experiment, we use Keras with TensorFlow to implement the system. The length of the input layer is fixed to be 200, and the number of nodes of each BLSTM is 300. Two fully connected layers with LeakyReLU are employed as its activation functions. The number of nodes is 300, and the dropout is 0.5. The model is optimized by Adamax [30] with a learning rate of 0.002.

4.3.2. Result

We compare the proposed system with the approaches presented in [35] (denoted as VulDeePecker) and [15] (denoted as BVDetector). VulDeePecke detects vulnerabilities at the source code level, while BVDetector detects vulnerabilities at the assembly code level. The training accuracy and training loss curves for these systems are compared in Figure 7. It can be observed that all the compared systems converge fast. The average accuracy of compared systems is close to 90%, while the accuracy of our early fusion system can reach 97%.

Table 3 lists the number of marked code slices by different schemes. It can be observed that some vulnerabilities can be detected by the model based on source codes, but not by the model based on assembly codes. In contrast, some vulnerabilities can only be detected by the model based on assembly codes. Take the code fragment shown in Figure 8 as an example, which is a vulnerability related to out-of-bounds array access. The system, which uses source code slices, may incorrectly detect the vulnerability as normal pointer arithmetic because it is difficult to detect out-of-bounds array access in a calculation. By contrast, the system using assembly code slices can easily detect this vulnerability via memory addresses. As a result, fusing source codes and assembly codes in our model can accurately detect a wider range of vulnerabilities.

The comparison results on the test set are shown in Tables 4 and 5, where a representative static vulnerability mining tool on the market, Flawfinder [36], is also included for comparison. It indicates that our system achieves better results in all aspects compared to other systems. Compared with VulDeePecke and BVDetector, our hybrid fusion-based system can raise the F1 scores by . The score of our system is also well above the one obtained by Flawfinder. This confirms the effectiveness of the proposed vulnerability detection system.

5. Conclusion

This paper proposes a system for detecting vulnerabilities in software through deep learning. It combines the vulnerability features at the source code level and assembly code level to improve the ability of vulnerability detection. The system extracts code slices from both the source code and the assembly code of a program. Then, a code alignment algorithm based on string, integer constants, function and library calls, etc., is suggested to align these code slices. The hyper fusion-based deep learning model considers early fusion and late fusion. Early fusion combines aligned source code slices and assembly code slices to generate a new dataset, while late fusion combines the decisions of the deep learning models on the source dataset, assembly dataset, and fused dataset. We implement a prototype and perform systematic experiments. The experimental results show the effectiveness of the proposed system. In future research, we can extend this method to cross languages and cross platforms. Moreover, we could consider additional data besides source codes and assembly codes.

Data Availability

The source code and dataset have been deposited in the GitHub repository (https://github.com/onstar99/VulnerabilitySystem).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Key R&D Program of Guangdong Province (Grant no. 2019B010136003), National Natural Science Foundation of China (Grant nos. 61802145 and 61932010), Natural Science Foundation of Guangdong Province, China (Grant no. 2019B010137005), Science and Technology Program of Guangzhou, China (Grant no. 202007040004), Fundamental Research Funds for the Central Universities, Opening Project of State Key Laboratory of Information Security, and Opening Project of Guangdong Key Laboratory of Intelligent Information Processing and Shenzhen Key Laboratory of Media Security.