Abstract

Vulnerability detection on source code can prevent the risk of cyber-attacks as early as possible. However, lacking fine-grained analysis of the code has rendered the existing solutions still suffering from low performance; besides, the explosive growth of open-source projects has dramatically increased the complexity and diversity of the source code. This paper presents HGVul, a code vulnerability detection method based on heterogeneous intermediate representation of source code. The key of the proposed method is the fine-grained handling on heterogeneous source-level intermediate representation (SIR) without expert knowledge. It first extracts graph SIR of code with multiple syntactic-semantic information. Then, HGVul splits the SIR into different subgraphs according to various semantic relations, which are used to obtain semantic information conveyed by different types of edges. Next, a graph neural network with attention operations is deployed on each subgraph to learn representation, which captures the subtle effects from node neighbors on their representation. Finally, the learned code feature representations are utilized to perform vulnerability detection. Experiments are conducted on multiple datasets. The F1 of HGVul reaches 96.1% on the sample-balanced Big-Vul-VP dataset and 88.3% on the unbalanced Big-Vul dataset. Further experiments on actual open-source project datasets prove the better performance of HGVul.

1. Introduction

The explosive growth of open-source projects has made their code security faces severe challenges. In 2020, more than 60 million new repositories were created on GitHub, and the number of new contributions exceeded 9.1 billion [1], which led to the number of attacks against open-source projects continuing to increase. Besides, open-source is not only a key fuel for digital innovation, but also an ideal target for software supply chain attacks. Supply chain attacks on open-source projects surged by 650% in the last years, inflicting many severe damages, e.g., SolarWinds, Codecov, and Kaseya events [2]. Furthermore, software code vulnerabilities are the main foundation for launching supply chain attacks. Code vulnerabilities often act as “door openers” allowing attacks to gain lateral movement and deploy malware for the disruptive attacks. Detecting the vulnerability of source code efficiently is significant for locating software security problems as early as possible, ensuring stable operation of software systems, and securing confidential information from theft. Many classical approaches have been used to detect vulnerabilities, such as static analysis [35], symbolic execution [68], and fuzzy testing [911]. However, these are still inefficient in practical detection, and the false-positive and false-negative are still high due to the lack of processing the subtle syntax-semantic information of the source code. The efficiency of the source code vulnerability detection still needs to be further improved.

Finding vulnerabilities in the software code has always been a challenging task. Vulnerability pattern-based detection approaches are widely used in the industry, but they strongly rely on the artificially constructed vulnerability pattern library [4, 12, 13], which makes them unable to cope with a large amount of the emerging open-source code. Symbolic execution [68] and fuzzy testing [911] are also commonly used vulnerability detection approaches, but the huge overhead of computing makes their detection performance low in real-world use. The data-driven approach based on machine learning (ML) provides an alternative way to identify vulnerability, which can be further divided into traditional ML-based approach and Deep Learning- (DL-) based approach. Traditional ML-based approaches are first proved to be feasible in vulnerability detection [1416], which takes features extracted from the code as input and detects vulnerabilities based on ML algorithms. The quality of code features is critical to the approaches, but it entirely depends on the expert experience, and in addition, extracting features is a generally time-consuming and error-prone task. In contrast to the traditional ML-based approach, the DL-based approach has the stronger ability to learn vulnerability feature representations, which can automatically extract feature representations from data without manual intervention. Besides, the large amount of open-source code can provide sufficient corpus for DL, which accelerates this approach being applied to vulnerability detection task [3, 5, 1721].

The data-driven approach provides a profitable way to detect the vulnerability, in which the key is capturing the syntax and semantic information of code. Some approaches treat the code as the flat sequence, such as function call sequence [22, 23] and the different traversal sequence based on the SIR of code [20, 24, 25], and then extracting vulnerability information based on recurrent neural networks (RNN) [20, 25, 26] or convolutional neural networks (CNN) [17, 24]. The code itself has complex structural properties, yet treating the code as just a sequence does not represent its syntax and semantic information well. It can lose code structure properties, which are often crucial for vulnerability detection. Therefore, to better capture vulnerability characteristics from the structural properties of the code, algorithms that can learn directly on the complex structure are required.

Graph neural networks (GNN) can meet such needs, and some works are already using GNN for vulnerability detection [19, 27]. Devign [19] expands the Abstract syntax tree (AST) of the function code into a graph structure and uses a variant GNN to identify vulnerability based on the expended graph structure. BGNN4VD [27] further improves the SIR of function code, which treats it as a bidirectional graph, and then uses a variant of GNN to detect vulnerability. Despite the advance, the graph-based approach still struggles to improve efficiency and performance, which is essential for real-world detection. At present, the syntactics and semantics processed by the graph-based approaches are relatively coarse-grained, and the vulnerability information hidden in the code cannot be fully utilized, thus leading to a high rate of false positives and false negatives. Capturing fine-grained syntactic-semantic information can yield more valuable vulnerability information since the vulnerable code only accounts for a very small portion of the entire function.

This paper presents HGVul, a source code-oriented vulnerability detection method based on heterogeneous source-level intermediate representation graph. It has the capability to improve detection effectiveness because HGVul can capture more subtle syntactic-semantic information. First, HGVul focuses on the function-level code with appropriate granularity [18, 19, 24, 28], because most of the vulnerabilities-related codes only involve part of the code of a single function [29]. And HGVul characterizes function source code with SIR graphs; that is, it combines code property graph and natural code sequence (CPG+), which contain abundant syntactic-semantic information. Second, HGVul treats the CPG+ as a heterogeneous graph with multiple types of edges and extracts subgraphs by different edge types. For each type of subgraph, the node feature representation is generated based on GNN with the attention mechanism as a way to catch the slight effect by different neighbors on semantics. Third, HGVul merges the corresponding node representation of each type subgraph and read out the whole graph representation as the function feature, which further captures the subtle semantic information since each type of relation conveys different semantic information. Therefore, through meticulously processing the SIR of function, HGVul can acquire more valuable information hidden in the code, which can improve the performance of vulnerability detection. The main contributions of this paper are as follows:(i)A source code vulnerability detection framework based on heterogeneous SIR is designed to extract valuable information of function code. It provides better code information representation capability than existing methods.(ii)A method for deriving fine-grained syntactic-semantic information of codes is proposed. It not only distinguishes different semantic information of multiple edges, but also captures the different effects of internode in SIR.(iii)We implement the prototype and evaluate the effectiveness of HGVul on multiple datasets. The experimental results show that HGVul has better balanced performance, the F1 of HGVul is the best on balanced and unbalanced datasets as 96.1% and 88.3%, respectively, and it has the ability for detecting practical open-source projects.

The rest of this paper is organized as follows: Section 2 reviews the previous related work. The preliminaries for vulnerability detection are presented in Section 3. Section 4 introduces the details of the methodology. The experimental evaluation is given in Section 5. Finally, Section 6 concludes this paper.

Vulnerability detection has been a key concern in the field of cyberspace security. Targeting software source code draws a large number of researchers’ attention because it can avoid potential vulnerability security threats as early as possible. Existing source code-oriented approaches can be categorized as pattern-based matching approach, code similarity-based analyzing approach, and learning-based detection approach.

2.1. Pattern-Based Approach

This approach identifies vulnerability relying on a large vulnerable code pattern rule database. The predefined pattern database allows the approaches to quickly detect known vulnerabilities; hence, it has a widespread use by code scanners, such as RATS [30], Flawfinder [31], and Checkmarx [32]. The vulnerability-related pattern is crucial for this approach, and researchers are also exploring different methods of pattern extraction [4, 12, 13]. However, vulnerability can exhibit multiple variants, and the pattern of complex vulnerability is challenging to construct; building a sufficient comprehensive pattern dataset is a laborious and unachievable task. So, it can only detect what exists in the pattern database and cannot cope with unknown vulnerabilities. Compared with such approaches, HGVul does not need to build the pattern rule database based on expert knowledge, which can dramatically reduce labor costs; besides, it has the potential to find unknown vulnerabilities.

2.2. Similarity-Based Approach

It discovers the vulnerability based on the similarity of code. It discovers the vulnerability based on the similarity of code. Instead of using the original code directly for similarity comparison, this approach usually extracts the abstract representation of code or the corresponding semantic syntax properties for similarity analysis. Redebug [33] detects vulnerabilities by extracting basic tokens from the source code and comparing the similarity of the token sets. VUDDY [34] calculates the hash value of the string sequence and compares the hash value to achieve fast vulnerability identification. Some researchers [35, 36] extract complex metrics for calculating similarity with vulnerability code. A suitable code abstract representation or code metric is the key to this approach. Therefore, it is easily susceptible to obfuscation techniques and weak in detecting unknown vulnerabilities. HGVul has better robustness because its feature representation is learned from abundant data.

2.3. Learning-Based Approach

Learning-based approach combines ML algorithms to learn vulnerability information hidden in the code data. The early learning-based approach uses code feature as input for vulnerability prediction, e.g., different lengths of sequence code [37, 38], features from the function call sequence [16, 39]. Feature extraction is a time-consuming and error-prone work, while the DL is proven to have the ability to generate features automatically [4043], so the DL-based approaches are gradually being applied in vulnerability detection. Russell et al. [24] form source code token extracted by a lexical parser as an image and then identify vulnerability using CNN algorithm. More researchers [17, 20, 25, 26] consider that the sequence of code contains more information, such as function call sequence and different traversal sequence of code representation, and use RNN algorithms to detect vulnerability. How to better capture the syntactic-semantic information hidden in code is the key to learning-based approach.

Because the graph-structured representation of code can well represent the syntactic-semantic properties of the code, some researchers [19, 27, 44] began to explore using GNN to detect vulnerabilities based on the SIR of source code. Zhou et al. [19] extended the code graph representation structure of the code based on AST and used a variant GGNN network to implement vulnerability detection. Wu et al. [44] extract simplified Code property graph (CPG) from code and then use GNN to identify vulnerabilities. Cao et al. [27] combine the AST, Control flow graph (CFG), and Data flow graph (DFG) of the code into Code Composite Graph (CCG). The authors believe that the valuable backpropagation information on CCG is also worth tackling, and they employ GNN to learn the representation of vulnerabilities. Compared with the existing GNN-based approach, HGVul not only distinguishes the heterogeneous features of SIR, but also applies attention mechanisms in each semantic information subgraph to obtain fine-grained code semantic information, which in turn improves the efficiency of vulnerability detection.

3. Preliminaries for Vulnerability Detection on SIR

3.1. Problem Formulation

The goal of the proposed method is to determine whether the function-level code is vulnerable or not. The sample of data is represented as , where is a series of functions, is the set of corresponding labels in which 0 denotes the not vulnerable and 1 otherwise, and is the number of samples, so the target of HGVul is to find the optimal mapping . We extract the graph-based SIR of function, which can be formulated as follows:

Definition 1. (Function) A function can be symbolized by its SIR as , where is the set of nodes, is the set of edges, and is the set of all node attributes.
In particular, the SIR used in this paper is extracted based on multiple semantic information, which we regard as a directed heterogeneous graph with multiple edge types. The heterogeneous graph can be formally described as follows:

Definition 2. (Heterogeneous Graph) The heterogeneous graph can be represented as , is the node set, is the edge set, and and denote the set of all node types and the set of all edge types, where . Specifically, the SIR in this paper has multiple types of edges, i.e., .
Therefore, it searches the optimal mapping by minimizing the loss function and can be defined as follows:where is the cross entropy loss function, is the adaptive weight, and is a regularization.

3.2. Source-Level Intermediate Representation of Code
3.2.1. Abstract Syntax Tree (AST)

AST is an ordered tree representation of the abstract syntactic of code. Each node in the AST represents the smallest lexical unit, and each edge of AST denotes the parent-child relationship between nodes.

3.2.2. Control Flow Graph (CFG)

CFG is a graph representation of code, which accounts for all possible paths during its execution [19]. The nodes of a CFG represent basic blocks that can be statements or conditions. The edges of CFG indicate the transfer of control through directed connections.

3.2.3. Program Dependence Graph (PDG)

PDG is a program representation that makes data dependencies and controls dependencies explicitly [45]. It comprises two types of relationships: data dependency (DD) and control dependency (CD). The edges of data dependency are used to represent the relevant data flow relationship. The control-dependent edges are utilized to denote the essential control flow relationship.

3.2.4. Code Property Graph (CPG)

CPG merges the AST, CFG, and PDG into a single joint data structure [12]. The node of CPG is the same as AST, and the edge of CPG is combined with other SIRs. So, the CPG is a heterogeneous graph that contains multitype edges.

3.2.5. Natural Code Sequence (NCS)

NCS connects all token nodes of AST by the natural sequential order of the source code. It reflects the programming logic of the function from the order in which the code appears in the function code. The nodes of NCS are the leaf node of AST, and the edges of NCS connect them according to the natural sequential order.

Besides, there are various extended forms of basic SIR, e.g., the SIR combining AST and NCS (we call it AST+ for convenience), the SIR integrating CPG and NCS (called CPG+ for convenience). This paper chooses CPG+ as the SIR of the function code because it contains more syntactic-semantic information. A visual example of CPG+ is shown in Figure 1.

4. Methodology

4.1. Overview of HGVul

The overall framework of the method is shown in Figure 2, including three major processes: Preparing SIR, Learning Representation, and Detecting Vulnerability. Preparing SIR collects function code from open-source project, then extracts the SIR of code at the function-level granularity, and initializes the primary features of each node in SIR. Learning Representation takes the graph structure corresponding to the SIR of function as input and outputs the feature representation of the function. It starts by using GNN to update node representations based on the different edge-typed subgraphs, which can distinguish semantic information from different types of edges, then merges it, and reads it out as the function representation. While updating the node representation, the attention operation is employed for each node to separate the influence of different neighbors. Detecting vulnerability takes the function representation as input. It trains the detection model in the training state and uses the trained model in the detection state to determine whether the function is vulnerable.

4.2. Preparing SIR of Function

This process mainly transforms the original code into a graph structure with node attributes. It includes two steps:

4.2.1. Extracting SIR of Function

For function-level code, this paper treats the entire function as a basic processing unit and extracts its corresponding SIR as the handle object. Specifically, HGVul takes CPG as the prototype and combines it with NCS to constitute a more comprehensive graph representation of code (called CPG+). CPG+ contains a variety of edge types with abundant syntactic-semantic information.

4.2.2. Embedding the Code Statement as Node Initial Representation

This step is to transform the code of the nodes into quantifiable vectors and use it as the initial features of the nodes. Firstly, HGVul uses a lexical analyzer to obtain the basic tokens in the node code. Then, the function and variable names in tokens are mapped to symbolic names (e.g., “FUN,” “VAR”) to prevent them from interfering with the initial feature of the node, because the user-defined function and variable names contain program-specific naming characteristics. Next, HGVul uses a pretrained word2vec model to obtain the primary embedding of each node. For the presence of multiple tokens in the node code, the average of each dimension of the multiple token vectors is calculated to form a new vector as the node primary embedding. The corpus of the pre-rained word embedding model consists of the mapped tokens of all the training samples, and the dimension of the token is set as 100. Finally, to capture the feature type hiding information of the nodes, we encode each type as an integer and concatenate the encoding of node types and the obtained node embedding as the feature representation of the node.

4.3. Learning Representation

The process is to acquire feature representation of the function by taking function-level SIR with node features as input. There are the step of node representation updating and the function representation generating step.

4.3.1. Learning Node Representation from Edge-Typed Subgraph

In this step, the node aggregates the neighbor information along the edges in the SIR and updates its own feature representation with it. HGVul extracts subgraphs according to different edge types and then performs the node learning process on each subgraph separately. Hence, the SIR of a function is represented as denotes the edge type. The initial representation of node in subgraph is set as . So, the representation of node at state is , and represents it at state, which aggregates along the edge in the subgraph .where denotes the neighbors of node in the subgraph .

When updating the feature representation of nodes in each subgraph, this paper employs the attention operation to distinguish the impact of neighbors. Firstly, the correlation coefficient between the nodes and their direct neighbor in is calculated. For a specific node , the correlation coefficient with neighbor is calculated aswhere is a shared parameter; it increases the embedding dimensionality for generating enhanced node representation. The operation is to concatenate the transformed features of and . The maps high-dimensional embedding to an actual number. Then, calculate the attention coefficient of each neighbor node relative to .where denotes the activation function. After obtaining the attention coefficients, the linear transformation is performed on the node initial representation and then updates the node representation by combing the attention coefficients. In fact, we adopt the multihead scheme to ensure the stability of the attention operation.where denotes the k-th head of . corresponds to the k-th head of .

In addition, for extending the receptive field of nodes learning neighboring features, HGVul updates the node feature representation by repeating steps (3)–(5) to aggregate the information from multistep neighbors of the nodes. And there are the training state and the detection state in this step. In the detection state, the GNN model trained in the training state is directly used to obtain the node feature representation.

4.3.2. Merging and Readout Graph Representation as Function Feature Representation

The feature representation of the function is generated in this step through reading out the nodes feature on SIR. Because the node representation is learning on different edge-typed subgraphs, it is necessary to merge the representation on an entire graph. Common merge operations include average, maximum or minimum, summation, and concatenate; this paper chooses to average.where represents the updated feature representation of node by aggregating features from neighbors with different edge types.

Then, read out the feature representation of function from the whole SIR since each node of the SIR represents a basic block with syntactic and semantic information. Here, HGVul reads out the feature of function by averaging each node feature in the SIR.

represents the feature representation of the sample function.

4.4. Detecting Vulnerability

This process performs graph-level classification to determine whether a function is vulnerable. It takes the feature representation of the function as input and trains a classifier for output whether the function is vulnerable or not. The classifier employs a linear transformation on the function feature representation to extract function-level abstract features further. The proposed method uses multilayer perception (MLP) to further extract the function-level features and choose the sigmoid function for classification.where is the final detection result, and is the feature representation of the function.

Similarly, it includes both training and detection states. The classifier is training at the training state, and it is used directly at the detection state.

5. Evaluation

5.1. Experimental Setup
5.1.1. Datasets

Obtaining enough high-quality function samples with vulnerability labels is essential for both training and evaluating the model, but it is never trivial. This paper collected 3 different datasets for validating the model performance, evaluating its efficiency on the actual project, and testing the ability to detect the functions corresponding to CVEs.

Dataset I: this paper trains and evaluates the models based on the Big-Vul dataset [46], which has a lot of sample functions with vulnerability labels and can be used publicly in entirety. Since the distribution of positive and negative samples is very uneven in the dataset, this paper extracts two datasets with balanced and unbalanced samples on Big-Vul for better validating HGVul. The balanced dataset is composed of the vulnerability function and its corresponding patch function called Big-Vul-VP. The unbalanced dataset is the original Big-Vul dataset. The experiment uses Joern [47] to extract the basic SIR of function, and HGVul uses only the samples that can be handled right by Joern. For convenience, we refer to the two datasets as Dataset I, whose details are shown in Table 1.

Table 1 lists the number of positive (Vulnerable) and negative (non-Vulnerable) samples in the two datasets. In the experiments, we performed the 5-fold cross-validation to conduct the experiments in Big-Vul-VP. For the larger Big-Vul dataset, it divided the dataset into the training set, validation set, and test set in the ratio of 2 : 1 : 1.

Dataset II: to test the actual detection effect of the model, this paper extracted the actual test functions from 6 open-source projects based on the D2A dataset [48], which include ffmpeg, openssl, libav, httpd, nginx, and libtiff. Each function extracted from the D2A dataset has a “touched_by_commit” flag, so it is regarded as vulnerable when its flag is set to “true” and regards the correspondingly repaired function as not vulnerable. In particular, the D2A not only contains multiple versions of code for each project, but also produces interprocedural analysis, so that one vulnerability may contain multiple vulnerable functions. We strictly removed the duplicate functions and rigorously confirmed the number of vulnerable functions. Again, only samples that Joern could handle were used in the experiment. Specific information about the data in each project is shown in Table 2.

Dataset III: in addition, to further test the ability of the HGVul to detect the functions corresponding to CVEs and explore whether it has the potential to detect unknown vulnerable functions, we manually scraped the latest 10 open CVEs of the six projects from CVE Details [49], so it obtains 60 CVEs containing 73 vulnerable functions. It should be noted that some projects do not disclose the details of the latest CVE such as httpd, and we try to collect the latest public vulnerability functions as much as possible, but there are still some outdated vulnerabilities. The used CVEs of dataset III are shown in Table 3.

5.1.2. Baselines

We compared HGVul against 6 different approaches that cover vulnerability analysis tools, similarity-based approach, sequence learning-based approach, and graph learning-based approach: (1) RAT, a well-known static analyzer [30]; (2) Flawfinder, a widely utilized vulnerability analyzer [31]; (3) VUDDY, a similarity-based approach [34]; (4) VulDeePecker, a sequence learning-based approach [20]; (5) BGNN4VD, a variant graph learning-based approach [27]; (6) Devign, a graph learning-based approach [19].

5.1.3. Evaluation Metrics and Implementation

This paper uses 6 widely used metrics to evaluate the performance of the HGVul, including Accuracy (ACC), Precision (P), Recall, F1-measure (F1), False positive rate (FPR), and False negative rate (FNR).

This paper chose the open-source tool Joern [47] to construct the basic SIR of the function. DGL [50] v0.6 package is using to store and deal with the graph-based data. The GNN-based vulnerability detection model is implemented by using Pytorch [51] v1.8.1. All experiments are performed on a multicore server with a 20-core 2.2 GHz Intel Xeon CPU and an Nvidia Tesla V100 GPU.

5.2. Experimental Results
5.2.1. Comparing with the Different Approaches

In this experiment, we applied the 7 different methods on dataset I for comparing the efficiency of HGVul and the others. Moreover, to observe the stability of approaches, the experiment performed the 5-fold cross-validation on balanced Big-Vul-VP and averaged the final result. For the unbalanced Big-Vul dataset, we conducted 10 independent experiments, which shuffled and divided the training/validation/test set into 2 : 1 : 1 before each experiment, and exhibited the average value with max-min bars. Figure 3 reports the experimental results.

Figure 3 exhibits the evaluation results of the seven methods on the two datasets, respectively. It can make the following observations. First, on both the Big-Vul-VP and Big-Vul, the performance of RATs and Flawfinder is worse whose Recall and F1 are both lower than 25%. It still has a high FNR and FPR due to the limitations of the vulnerability pattern dataset, while HGVul does not suffer from the limitations and is therefore significantly better than such approaches. Secondly, VUDDY has the highest precision and the lowest FPR. But it has the highest FNR, which is caused by the nature of the clone-based method based on code similarity. VUDDY is unable to cope with changing forms of exploit code even with slight changes, while HGVul has the ability to work with variant code. Thirdly, the sequence-based approaches have better performance than vulnerability pattern-based approaches because they combine DL techniques to extract more complex information. The FNR of these approaches is obviously lower; especially the FNR of VulDeePecker is only 14.3% on Big-Vul-VP. But, it can not suppress the FPR well because it fails to effectively exploit the semantic information in the code. Lastly, the graph-based approaches have better performance than the others, and the better F1 value indicates that BGNN4VD, Devign, and HGVul all have a more balanced detection effect. In the three GNN-based approaches, HGVul still has the best performance. The FNR and FPR of HGVul are both below 5% on the Big-Vul-VP dataset, and the FNR is still the best 15.1% on the Big-Vul with extremely unbalanced samples. Compared with BGNN4VD and Devign, HGVul uses heterogeneous GNN to collect different semantic information and applies attention mechanism to obtain subtle code information in each semantic subgraph, while the other two approaches generate code representation on the raw graph without being able to better distinguish the subtle heterogeneous features of the code. Therefore, it can get the following finding that HGVul is more effective than the state-of-the-art vulnerability detection methods.

5.2.2. Performance of the Different SIR

This experiment tested separately on different SIRs to compare the vary influence on the effectiveness of vulnerability detection, including AST, CFG, PDG, CPG, AST+, and CPG+. We chose the gated graph neural network (GGNN) [52] to generate functional feature representations, which do not have attention operations that can affect the results. The experiment was also performed on dataset I and performed the ablation setting to reduce the influence from other factors. In other words, the settings of the network were the same except that the SIR of the inputs was different. The experiment results are shown in Tables 4 and 5.

Tables 4 and 5 list the detection results by using different SIR as input. Each SIR of the code induced different detection performance on both Big-Vul-VP and Big-Vul. The detection results for the experiments based on CPG+ are better than those using the other SIR as input. On the Big-Vul-VP, the Accuracy, Precision, Recall, and F1 were higher than 92% when using CPG+ for detection. And the FNR and FPR were relatively low, with FPR being the best compared to the others at 8.4%. Recall and FNR were the best when using AST+ as input with 92.9% and 7.1%, respectively. On the Big-Vul, although the detection result is decreased due to the large bias of the samples, the performance is still maintained at a good level when using CPG+ as SIR. Its Accuracy is 98.2%, Precision is best 89.0%, F1 is 83.4%, and its FPR is only 0.6%, which is significantly better than those of the method using the other SIR as input. When using CPG as input, its Recall and FNR reach the best 81.3% and 18.7%, respectively. Therefore, it can make a conclusion that the detection performance is greatly influenced by the different SIR of code. And the vulnerability detection performance is better when using CPG+ as input.

5.2.3. Different Influence of Internode in SIR

This experiment is to verify that the node representation is differentially affected by neighbors on SIR and prove the positive impact of the attention operation on the vulnerability detection. We controlled different GNNs for the ablation experiment, so all other experimental settings were the same. The three networks GCN [53], GGNN, and GAT [54] were chosen for the experiments, where GAT contains attention mechanism. Specifically, the experiment focused only on the differences in CPG+. The detection results are listed in Tables 6 and 7.

Tables 6 and 7 show the detection performance of different methods in which learning node representation is based on different GNN. It exhibits better results on both Big-Vul-VP and Big-Vul when considering the different influence of internode for node representation. On the Big-Vul-VP, the detection result based on GAT is better than that of the method of learning node feature representation using GCN or GGNN, its F1 is the best 94.2%, the Recall is 93.9%, and the FNR and FPR are also obviously lower than those of the other two methods, which are both 6.1%. On the Big-Vul, it is also obviously better than other methods when considering the nodes to be influenced differently by different neighboring nodes of SIR. Regarding the Acc and P of the method in which learning node representation based on GAT is higher than 90%, its Recall and F1 are also best at 80.8% and 85.2%, respectively. FNR and FPR also remain low, with its FPR being only 0.5%. Therefore, the experimental results can prove that capturing the different impacts of node representation from the different neighbors in the SIR can enhance the vulnerability characterization of node features, which can improve the performance of vulnerability detection.

5.2.4. Improvement of Heterogeneous SIR

This experiment is performed to examine whether it has improved on detection that treats SIR of function as a heterogeneous graph with multiple types of edges. We compared the effect of treating SIR as a heterogeneous and homogeneous graph. To reduce the influence of other factors on the results, it still only controlled the graph neural network part and chose GAT as a comparison. Besides, the experiment was performed on only two types of graph AST+ and CPG+, because these two SIRs contain different types of edges with more detailed semantic information. The comparison results of the experiment are displayed in Tables 8 and 9.

Tables 8 and 9 list the experiment results of whether paing attention to the heterogeneous edge delivers different information in SIR. Compared to the methods that combine only attention mechanisms, the performance is better than that of the method that can capture the heterogeneous nature of the edges in SIR on both Big-Vul-VP and Big-Vul. On the Big-Vul-VP dataset, it can be directly found that the two SIRs (AST+, CPG+) show higher detection performance in Acc, P, Recall, and F1 when considering different types of edges conveying different information; correspondingly, FNR and FPR are obviously lower. The best is the method that treats CPG+ as a heterogeneous graph with multiple types of edge. Its FNR and FPR are below 5%. On the Big-Vul, the method is based on the heterogeneous graph, with multiple edge types still having better detection performance, and the method that processes CPG+ as a heterogeneous graph can obtain the best detection effect. Compared with the method using CPG+ as input and updating node representation based only on GAT, its detection effect is obviously better, with Recall and F1 being 84.9% and 88.3%, respectively. Therefore, we can conclude that considering the different information delivered by different types of edges can obtain more subtle code information. Thereby, it can enhance the representation of function features and improve the performance of vulnerability detection.

5.2.5. Performance on the Open-Source Projects

We applied the trained models on 6 open-source projects to examine its ability of actual function-level vulnerability detection. This experiment used dataset II, whose functions are grouped by project. The functions of each project are input into the trained model for detection. Table 10 shows the details of the detection results.

Table 10 lists the detection results of seven methods on six practical open-source projects. The experimental results show that the proposed method can detect most of the function-level vulnerabilities, which is better than the other methods. HGVul can detect 3004 vulnerable functions in a total of 3696 samples with vulnerabilities, and the average F1 can reach 69.7%. Moreover, we find that most of the undetected vulnerable function samples are integer overflow type vulnerabilities, and the proposed method shows poor detection performance for this type of vulnerabilities. The possible reason for this situation is that integer overflow vulnerabilities are closely related to the type of variables and rely on runtime input characteristics, which makes it hard to be detected by graph-based model. So, it can get the finding that the proposed method has the feasibility to detect the practical vulnerable functions.

5.2.6. Performance for the Functions of Actual CVEs

In addition, we further explored the detection capability of the proposed method for the vulnerable functions of actual CVEs. In this experiment, the trained models were applied on dataset III containing some functions of the new CVEs\enleadertwodots. Detection results are shown in Table 11.

Table 11 lists the detection results of the methods for the latest 10 CVEs vulnerability functions. Limited by the scale of the vulnerability pattern library, the detection of RATs and Flawfinder is less effective. VUDDY has a better detection effect on openSSL, libav, and httpd because the latest version of VUDDY is updated with the newest vulnerabilities in these projects. However, emerging vulnerabilities on ffmpeg are not updated to VUDDY that results in a dramatic decrease in its detection ability, which indicates that VUDDY relies heavily on the clone template database and its severe detection delay. BGNN4VD and Devign can detect more vulnerable functions than VulDeePecker since they obtain the function code features from the source-level graph structure representation. Among the 73 vulnerable functions, HGVul can identify 60 functions with threats, and the average F1 of 6 projects can achieve 63.3%; it is better than the other methods because the fine-grained ones are handled on function code. Thus, it indicates that HGVul still performs better for detecting vulnerable functions of actual CVEs.

6. Conclusion

This paper presents the HGVul, a novel function-level source code-oriented vulnerability detection method based on heterogeneous SIR. To cope with the increasing complexity and diversity of code caused by the surge of open-source projects, HGVul fine-grained processes the SIR of function code. It captures the syntax and semantic information implied by the code from different types of subgraphs. A set of experiments shows that HGVul outperforms 6 existing methods with significantly improving both FNR and FPR. In the future, we will improve our study in many ways, including further enhancing vulnerability detection, extending the scope of vulnerability detection, and providing interpretable vulnerability detection models.

Data Availability

The data sets used in this paper are public, free, and available at https://github.com/ZeoVan/MSR_20_Code_vulnerability_CSV_Dataset, https://github.com/IBM/D2A, and https://www.cvedetails.com/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This work was supported in part by the National Key Research and Development Program under Grant 2019QY1400; in part by the National Natural Science Foundation of China under Grant U2133208; in part by the Sichuan Youth Science and Technology Innovation Team under Grant 2022JDTD0014; and in part by the Basic Research Program of China under Grant 2020-JCJQ-ZD-021.