Abstract

Research on software defect prediction has achieved great success at modeling predictors. To build more accurate predictors, a number of hand-crafted features are proposed, such as static code features, process features, and social network features. Few models, however, consider the semantic and structural features of programs. Understanding the context information of source code files could explain a lot about the cause of defects in software. In this paper, we leverage representation learning for semantic and structural features generation. Specifically, we first extract token vectors of code files based on the Abstract Syntax Trees (ASTs) and then feed the token vectors into Convolutional Neural Network (CNN) to automatically learn semantic features. Meanwhile, we also construct a complex network model based on the dependencies between code files, namely, software network (SN). After that, to learn the structural features, we apply the network embedding method to the resulting SN. Finally, we build a novel software defect prediction model based on the learned semantic and structural features (SDP-S2S). We evaluated our method on 6 projects collected from public PROMISE repositories. The results suggest that the contribution of structural features extracted from software network is prominent, and when combined with semantic features, the results seem to be better. In addition, compared with the traditional hand-crafted features, the F-measure values of SDP-S2S are generally increased, with a maximum growth rate of 99.5%. We also explore the parameter sensitivity in the learning process of semantic and structural features and provide guidance for the optimization of predictors.

1. Introduction

Software defect is an error in the code or incorrect behavior in software execution, also defined as failure to meet intended or specified requirements. Software reliability is regarded as one of the crucial problems in software engineering. Thus, the models used to ensure software quality are required, and the software defect prediction model is one of them. Defect prediction can estimate the most defect-prone software components precisely and help developers allocate limited resources to those bits of the systems that are most likely to contain defects in testing and maintenance phases [1].

As we all know, in software life cycle, the earlier you find the defect, the less it costs to fix [2]. Therefore, how to detect defects quickly and accurately is always an open challenge in the field of software engineering and has attracted extensive attention from industry and academia.

Typical defect prediction is composed of two parts: features extraction from source files and classifiers construction using various machine learning algorithms. Existing methods are dominated by traditional hand-crafted features, namely, source code metrics (e.g., CK, Halstead, MOOD, and McCabe’s CC metrics). Unfortunately, these metrics generally overlook some important information implied in the code, such as semantic and structural information. Meanwhile, extensive machine learning algorithms have been adopted for software defect prediction, including Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), etc.

Programs have well-defined syntax and rich semantics hidden in the Abstract Syntax Trees (ASTs), which have been successfully used for programming patterns mining [3, 4], code completion [5, 6], and code plagiarism detection [7]. For example, Figure 1 shows two Java files, both of which contain an assignment statement, a while statement, a function call, and an increment statement. If we use traditional features to represent these two files, they are identical because of the same source code characteristics in terms of lines of code, function calls, raw programming tokens, etc. However, they are actually quite different according to semantic information. In other words, semantic information as new discriminative features should also be useful for characterizing defects for improving defect prediction.

At present, deep learning has emerged as a powerful technique for automated feature generation, since deep learning architecture can effectively capture highly complicated nonlinear features. To make use of its powerful feature generation ability, some researchers [8, 9] have already leveraged deep learning algorithms, such as Deep Belief Network (DBN) and Convolutional Neural Network (CNN) in learning semantic features from programs’ ASTs, and verified that it outperforms traditional hand-crafted features in defect prediction.

As demonstrated by researchers [9], CNN is superior to DBN because of CNN’s powerful efficiency to capture local patterns. Hence, CNN is capable of detecting local patterns and then conducting defect prediction. Since slight difference in local code structure, such as the code order difference illustrated in Figure 1, may trigger huge variance in the global program, we apply CNN instead of DBN to the construction of the defect prediction model.

However, the abovementioned studies still overlook the globally structural information among program files which can lead to more accurate defect prediction, although they consider the fine-grained semantic information in the program files. In order to better represent the global structure of software, previous studies [1012] have successfully abstracted a software as a directed dependency network using complex network theory, usually termed as software network (SN), where software components such as files, classes, or packages are nodes and the dependency relationships between them are edges. Furthermore, using network analysis technologies, they have demonstrated the effectiveness of network structure information in improving the performance of defect prediction.

Unfortunately, network features the above authors used in defect prediction modeling, such as modularity, centrality, and node degree, still belong to the traditional hand-crafted features. As an emerging deep learning technology, network representation learning becomes a novel approach for automatically learning latent features of nodes in a network [13] and receives much attention. Therefore, using representation learning to extract the structural information from code files and further apply the learned features to defect prediction may effectively improve the performance of existing prediction models.

Unlike the existing studies, in our work, instead of using traditional hand-crafted metrics, we introduced deep learning technologies to automatically extract the semantic (local fine-grained) and structural (global coarse-grained) features of code files for defect prediction modeling and seek empirical evidence that they can achieve acceptable performance compared with the benchmark models. Our contributions to the current state of research are summarized as follows:(i)We further demonstrated that the automatically learned semantic features can significantly improve defect prediction compared to traditional features(ii)In terms of improving the performance of defect prediction, we also validated that the contribution of structural features extracted from software network by representation learning is comparable to that of semantic features on the whole(iii)Interestingly, we also found that the combination of semantic and structural features has greater impact on the improvement of prediction performance

The rest of this paper is organized as follows. Section 2 is a review of related work on this topic. Sections 3 and 4 describe the preliminary theories and the approach of our empirical study, respectively. Section 5 is the detailed experimental setups and the primary results. Some threats to validity that could affect our study are presented in Section 6. Finally, Section 7 concludes the work and presents the agenda for future work.

2.1. Software Defect Prediction

Software defect prediction technology has been widely used in software quality assurance and can effectively reduce the cost of software development. It uses the previous defect data to build a predictor and then employs the established model to predict whether a new code fragment is defective. At present, conventional software defect prediction can be roughly divided into two steps. The first stage is feature extraction, which makes the representation of defects more efficient by manually designing some features or combining existing features. The second is the classification by machine learning methods, specifically, by using the learning algorithm to establish an accurate model, so as to provide better prediction.

Most defect prediction techniques leverage features that are composed of the hand-crafted code metrics to train machine learning-based classifiers [14]. Commonly used code metrics include static code metrics and process metrics. The former include McCabe metrics [15], CK metrics [16], and MOOD metrics [17], which are widely examined and used for defect prediction. Compared to the above static code metrics, process metrics can reveal much about how programmers collaborate on tasks. Moser et al. [18] used the number of revisions, authors, past fixes, and ages of files as metrics to predict defects. Nagappan and Ball [19] proposed code churn metrics and showed that these features were effective for defect prediction. Hassan [20] used entropy of change features to predict defects. Other process metrics, including developer individual characteristics [21] and collaboration between developers [22, 23], were also useful for defect prediction.

Meanwhile, many machine learning algorithms have been adopted for defect prediction, including Support Vector Machine (SVM) [24], Bayesian Belief Network [25], Naive Bayes (NB) [26], Decision Table (DT) [1], neural network [27], and ensemble learning [28]. For instance, Kumar and Singh [24] evaluated the capability of SVM with combinations of different feature selection and extraction techniques in predicting defective software modules and tested on five NASA datasets. In [25], the authors predicted the quality of a software by using the Bayesian Belief Network. Arar and Ayan [26] proposed a Feature Dependent Naive Bayes (FDNB) classification method to software defect prediction and evaluated their approach on PROMISE datasets. He et al. [1] examined the performance of tree-based machine learning algorithms on defect prediction from the perspective of simplifying metric. Li et al. [28] proposed a novel Two-Stage Ensemble Learning (TSEL) approach to defect prediction using heterogeneous data. They experimented on 30 public projects and showed that the proposed TSEL approach outperforms a range of competing methods.

In addition, to overcome the lack of training data, a cross-project defect prediction (CPDP) model was proposed by some research studies. To improve the performance of CPDP, Turhan et al. [29] proposed to use a nearest-neighbor filter for target project to select training data. Nam et al. [30] proposed TCA+, which adopted a state-of-the-art technique called Transfer Component Analysis (TCA) and optimized normalization process. They evaluated TCA+ on eight open-source projects, and the results showed that TCA+ significantly improved CPDP. Nam et al. [21] also presented methods for defect prediction that match up different metrics in different projects to address the heterogeneous data problem in CPDP.

2.2. Deep Learning in Software Engineering

Representation learning has been widely applied to feature learning, which can capture the highly complex nonlinear information. Recently, deep learning algorithms have been adopted to improve research tasks in software engineering. Yang et al. [31] proposed an approach to generate features from existing features by using Deep Belief Network (DBN) and then used these new features to predict whether a commit is buggy or not. This work was motivated by the weaknesses of Logistic Regression (LR) that LR cannot combine features to generate new features. They used DBN to generate features from 14 traditional features and several developer experience-related features. Wang et al. [8] also leveraged DBN to automatically learn semantic features from token vectors extracted from programs’ Abstract Syntax Trees (ASTs) and further validated that the learned semantic features significantly improve the performance of defect prediction. Similarly, Li et al. [9] used convolution neural network for feature generation based on the program’s AST and proposed a framework of defect prediction. To explore program’s semantics, Phan et al. [32] attempted to learn new defect features from program control flow graphs by convolution neural network.

However, these studies still ignore the structural features of programs, such as the dependencies between program files. Prior studies [12, 3335] have demonstrated the effectiveness of network structure information in improving the performance of the defect prediction model. Nowadays, the node of a network can be represented as a low-dimensional vector by means of network embedding. A large number of network embedding algorithms have been successfully applied in network representational learning, including DeepWalk [36], Node2vec [37], and LINE [38]. Through the representational learning of software networks formed by various dependencies between code files, in this paper, we extract structural features of program files, so as to supplement the existing semantic features for defect prediction.

2.3. Software Network

In recent years, software networks (SN) have been widely utilized to characterize the problems in software engineering practices [39]. For example, some complexity metrics based on software networks are proposed to evaluate the software quality. Gu et al. [40] proposed a metric of cohesion based on SN for measuring connectivity of class members. From the perspective of social network analysis (SNA), Zhang et al. [41] put forward a suite of metrics for static structural complexity, which overcomes the limitations of traditional OO software metrics. Ma et al. [42] proposed a hierarchical set of metrics in terms of coupling and cohesion and analyzed a sample of 12 open-source OO software systems to empirically validate the set. Pan and Chai [43] leverage a meaningful metric based on SN to measure software stability.

In addition to complexity metrics, software network-based measures for stability and evolvability have also been presented by some researchers. Zhang et al. [41] analyzed the evolution of software networks from several kinds of object-oriented software systems and discovered some evolution rules such as distance among nodes increase and scale-free property. Gu and Chen [44] validated software evolution laws using network measures and discussed the feasibility of modeling software evolution. Peng et al. [11] constructed the software network model from a multigranularity perspective and further analyzed the evolutions of three open-source software systems in terms of network scale, quality, and structure control indicators, using complex network theory.

Besides, for software ranking task, Srinivasan et al. [45] proposed a software ranking model based on software core components. Pan et al. constructed [46] a novel model ElementRank based on SN, which leverages multilayer complex network to rank software. In addition, SN is also applied to analyze the structure of software structure [47]. Furthermore, a generalized k-core decomposition model [48] is leveraged to identify key class.

3. Preliminaries

3.1. Overview of Software Defect Prediction

Software defect prediction plays an important role in reducing the cost of software development and ensuring the quality of software. It can find the possible defective code blocks according to the features of historical data, thus allowing workers to focus their limited resources on the defect-prone code. Figure 2 presents a basic framework of software defect prediction and has been widely used in existing studies [1, 8, 12, 18, 19, 24, 25].

Most defect prediction models are based on machine learning; therefore, it is a first step to collect defect datasets. The defect datasets consist of various code features and labels. Commonly used features are various software code metrics mentioned above. Label indicates whether the code file is defective or not for binary classification. In the setting, predictor is trained using the labeled instances of project and then used to predict unlabeled (“?”) instances as defective or clean. In the process of defect prediction, the instances used to learn classifier is called training set and the instances used to evaluate classifier are called test set.

3.2. Convolutional Neural Network

Convolutional neural network (CNN) is one of the most popular algorithms for deep learning, a specialized kind of neural networks for processing data that have a known gridlike topology [49]. Compared with traditional artificial neural network, CNN has many advantages and has been successfully demonstrated in many fields, including NLP [50], image recognition [51], and speech recognition [52]. Here, we will use CNN for learning semantic features from software source code through multilayer nonlinear transformation, so as to replace the manual features of code complexity. Moreover, the deep structure enables CNN to have strong representation and learning ability. CNN has two significant characteristics: local connectivity and shared weight, which are helpful to extract features for our software defect prediction modeling.

Compared with the full connection in feedforward neural network, the neurons in the convolutional layer are only connected to some neurons of adjacent layer and generate spatially local connection. As shown in Figure 3, each unit in the hidden layer i is only connected with 3 adjacent neurons in the layer i − 1, rather than with all the neurons. Each subset acts as a local filter over the input vectors, which can produce strong responses to a spatially local input pattern. Each local filter applies a nonlinear transformation: multiplying the input with a linear filter, adding a bias term, and then applying a nonlinear function. In Figure 3, if we denote the k-th hidden unit in layer i as , then the local filter in layer acts as follows:where and denote the weights and bias of the local filter, respectively.

In addition, sparse connectivity has regularization effect, which improves the stability and generalization ability of network structure and can effectively avoid overfitting. At the same time, it reduces the number of weight parameters, is beneficial to accelerate the learning of neural network, and reduces the memory cost in calculation.

Parameter sharing refers to using the same parameters ( and ) for each local filter. In previous neural networks, when calculating the output of a layer, the parameters of each unit are different. However, in CNN, the same filter should share the same weight and bias . The reason is that a repeating unit can identify feature regardless of its position in the receptive field. On the other hand, weight sharing enables us to conduct feature extraction more effectively.

3.3. Construction of Software Network

In software engineering, researchers in the field of complex systems used complex networks theory to represent software systems by taking software components (such as package, file, class, and method) as nodes and their dependency relationships as edges, named as software network. The role of SN in software defect prediction, evolution analysis, and complexity measurement has been confirmed in the literature [11, 12, 3335].

Files are key software components in the software system, and they are gathered up by interactions. SN at file level can be defined as in Figure 4: Every file is viewed as a single node in SN, and the dependency and association relationships between files are represented by edges (directed or undirected). Let represents the software network, where each file can be treated as node . The relationships between every pair of files, if exist, form a directed edge .

3.4. Network Embedding

Network embedding (EM) is to map information networks into low-dimensional spaces, in which every vertex is represented as a low-dimensional vector. Such a low-dimensional embedding is very useful in a variety of applications such as node classification [3], link prediction [10], and personalized recommendation [23]. So far, various network embedding methods have been proposed successively in the field of machine learning. In this paper, we adopt Node2vec algorithm to embedding learning of the token vector.

Node2vec performs a random walk on neighbor nodes and then sends the generated random walk sequences to the Skip-gram model for training. In Node2vec, a 2nd-order random walk with two parameters p and q are used to flexibly sample neighborhood nodes between BFS (breadth-first search) and DFS (depth-first search).

Formally, given a source node , we simulate a random walk of fixed length . Let denote the ith node in the walk, starting with . Node is generated by the following distribution:where is the unnormalized transition probability between node and , and is the normalized constant.

As shown in Figure 5, the unnormalized transition probability sets to , where represents the shortest distance between node and :

4. Approach

In this section, we elaborate our proposed method of software defect prediction via semantic and structural features from source code files (SDP-S2S). The overall framework of SDP-S2S is presented in Figure 6. It mainly consists of three parts: the first part is the generation of semantic features from source codes and will be detailed in Section 4.1. The second part will be explained in Section 4.2, which focuses on the extraction of structural features from software network by network embedding learning. The last part refers to combining the semantic and structural features obtained in the first two steps into new features and used for software defect prediction.

4.1. Generation of Semantic Features

In order to achieve semantic features for each source code file, we should first map the source code files into ASTs and parse them as real-valued token vectors. After that, the token vectors are encoded and preprocessed, and then, the resulting token vectors are fed to CNN for feature learning to generate the semantic features. The generation process is described in detail in the following three steps.

4.1.1. Parsing AST

In order to represent the semantic features of source code files, we need to find the appropriate granularity as the representation of the source code. As previous study [8] has shown, AST can represent the semantic and structural information of source code with the most appropriate granularity. We first parse the source code files into ASTs by calling an open-source python package javalang. As treated in [9], we only select three types of nodes on ASTs as tokens: (1) nodes of method invocations and class instance creations, which are recorded as their corresponding names; (2) declaration nodes, i.e., method/type/enum declarations, whose values are extracted as tokens; and (3) control flow nodes, such as while, if, and throw, are recorded as their node types. Three types of selected nodes are listed in Table 1.

We call javalang’s API to parse the source code into an AST. Given a path of the software source code, the token sequences of all files in the software will be output. As described in Algorithm 1, first traverse the source code files under path P, and each file is parsed into an AST via the PARSE-AST function. For each AST, we employ the preorder traversal strategy to retrieve the three types of nodes selected in Table 1 and receive the final token sequence.

Input: path p of the software source code
Output: token sequences for all file
(1)function EXTRACT (Path p)
(2)F = the set of source code files under path p;
(3)for each do
(4)  create sequence ;
(5)  ;
(6)  return ;
(7)end for
(8)end function
(9)function PARSE-AST (File f)
(10) create sequence ;
(11) root  javalang.parseFile2AST(f);
(12)for all do
(13)  if then
(14)   record its name and append to ;
(15)  else if then
(16)   record its declared value and append to ;
(17)  else if then
(18)   record its type and append to ;
(19)  end if
(20)end for
(21)return
(22)end function
4.1.2. Token Sequence Preprocessing

Since CNN only accepts inputs as numerical vectors, the token sequences generated from ASTs cannot be directly fed to CNN. Therefore, to get the numerical token vectors, it is necessary for the extracted token sequences to be converted into integer vectors. To do this, we give each token a unique integer ID, and the ID of the recurring token is identical. Note that, because the token sequences of all files are of unequal length, the converted integer token vectors may differ in their dimensions. Also, CNN requires input vectors to have the same length; hence, we append 0 to each integer vectors, making their lengths consistent with the longest vector. Additionally, during the encoding process, we filter out infrequent tokens which might be designed for a specific file and not generalized for other files. Specifically, we only encode tokens occurring three or more times, while denote the others as 0.

In addition, for software defect prediction, the class imbalance of defect dataset often exists. Specifically, the number of clean instances vastly outnumbers that of defective instances. Assuming that the number of clean instances is labeled as and the number of defective samples is , the imbalance rate (IR) [53] is used to measure the degree of imbalance:

The larger the IR, the greater the imbalance, and vice versa. Imbalance data will degrade the performance of our model. To address this issue, we duplicate the defective files several times until the IR index is close to 1.

4.1.3. Building CNN

In this paper, we adopt classic architecture of CNN for feature learning. After encoding and preprocessing token vectors, exclude the input and output layers; we train the CNN model with four layers, including an embedding layer (turn integer token vectors into real-valued vectors of fixed size), a convolutional layer, a max-pooling layer, and a fully connected layer. The overall architecture is illustrated in Figure 7.

ReLu activation functions are used for training, and the implementation is based on Keras (http://keras.io). The output layers are activated by sigmoid function and used only for the parameters of the neural network weight matrix, to optimize the learning features. In addition, in this paper, Adam optimizer based on the improved stochastic gradient descent (SGD) algorithm is employed. Adam optimizer dynamically adjusts the learning rate for each parameter by calculating the first- and second-order moment estimations of the gradient. Compared with other optimization algorithms, Adam can ensure that the learning rate is distributed in an explicit range after each iteration, so that the parameter changes smoothly.

Given a project P, suppose it contains n source code files, all of which have been converted to integer token vectors and of equal length l by the treatments described previously. Through the embedding layer, each token will be mapped to a d-dimension real-value vector. In other words, each file becomes a real-value matrix . As the input of convolutional layer, a filter is applied to a region of tokens to produce a new feature. For example, a feature is generated from a region of tokens by

Here, is a bias term and is a nonlinear hyperbolic tangent function. Each possible region of tokens in the description applies filter to produce a feature map:where . Then, a 1-max-pooling operation is carried out over the mapped features and the maximum value is taken as the feature corresponding to this particular filter . Usually, multiple filters with different region sizes are used to get multiple features. Finally, a fully connected layer further generated the semantic features.

4.2. Generation of Structural Features

Before applying network embedding to represent structural features of source codes, it is necessary to build a software network model according to source files. As we did in the previous studies [11, 12], we use DependencyFinder API to parse the compiled source files (.zip or .jar extension) and extract their relationships using a tool developed by ourselves. With the directed software network, we further perform embedding learning using the Node2vec method. For more details on Node2vec, please refer to the literature [37].

4.3. Feature Concatenation

So far, we have got the semantic and structural features of source code files, respectively. Here, we label semantic feature as and structural feature as . In order to verify the effectiveness of code semantic and structural features on software defect prediction, in addition to analyzing the impact of each type of generation feature on defect prediction, we also explore the impact of their combination. We directly concatenate the semantic feature vectors with structural feature vectors via Merge operator in Keras, and the resulting feature vectors presented as .

5. Experiment Setup

5.1. Dataset

In our study, 6 Apache open-source projects based on Java are selected (https://github.com/apache) and a total of 12 defect datasets available at the PROMISE repository (http://promise.site.uottawa.ca/SERepository/datasets-page.html) are picked for validation. Detailed information on the datasets is listed in Table 2, where #Avg. (files) and #Avg. (defect rate) are the average number of files and the average percentage of defective files, respectively. An instance in the defect dataset represents a class file and consists of two parts: independent variables including the learned features (e.g., the CNN-learned semantic features) and a dependent variable labeled as defective or not in this class file.

5.2. Evaluation Measures

The essence of defect prediction in this study is a binary classification problem. Note that a binary classifier can make two possible errors: false positives and false negatives. In addition, a correctly classified defective class file is a true positive and a correctly classified clean class file is a true negative. We evaluate the classification results in terms of Precision, Recall, and F-measure, which are described as follows:

False positive refers to the predicted defective files that actually have no defects, and false negative refers to the actually defect-prone files predicted as clean. Precision and Recall are mutually exclusive in practice. Therefore, F-measure, as a weighted average of Precision and Recall, is more likely to be adopted. The value of F-measure ranges between 0 and 1, with values closer to 1 indicating better performance for classification results.

5.3. Experiment Design

First, to make a comparison between the traditional hand-crafted features and automatically learn features in our context, four scenarios will be considered in our experiments.(i)SDP-base represents software defect prediction based on the traditional hand-crafted features(ii)SDP-S1 represents software defect prediction based on the semantic features (iii)SDP-S2 represents software defect prediction based on the structural features (iv)SDP-S2S represents software defect prediction based on the semantic and structural features

Second, we will further explore prediction performance under different parameter settings for CNN and network embedding learning. For each project, note that we use the data from the older version to train the CNN model. Then, the trained CNN is used to generate semantic and structural features for both the older and newer versions. After that, we use the older version to build a defect prediction model and apply it to the newer version.

5.4. Experimental Results
5.4.1. Impact of Different Features

For each type of feature, Table 3 shows some interesting results: except for few cases (Poi), the F-measure values of SDP-S1, SDP-S2, and SDP-S2S are greater than those of the benchmark SPD-base, implying a significant improvement in accuracy. For example, for Camel, the growth rate of performance is more than 21.7%, when using the learned semantic and/or structural features. Especially when semantic and structural features are used comprehensively, the advantage is more obvious, indicated by the 99.5% performance growth. Additionally, note that for Xerces, although the growth rates of performance are slightly lower than that of Camel, it is still considerable, around 30%. For Lucene, Synapse, and Xalan, the corresponding maximum growth rates are 27.9% (0.7564), 22.6% (0.5204), and 10% (0.2406), respectively. The majority of positive growth rates suggest the feasibility of our proposed method of automatically learning features from source code files.

In Table 3, the results also show that SDP-S2S performs better than SDP-S1 and SDP-S2, indicated by more F-measure values in bold. Specifically, compared to the other two methods, SDP-S2S achieves the best performance on projects Camel, Lucene, and Xerces. In order to better distinguish their influences on defect prediction, we make further comparisons in terms of the Wilcoxon signed-rank test (-value) and Cliff’s effect size from a statistical perspective. In Table 4, the Wilcoxon signed-rank test highlights that there is no significant performance difference among the three predictors, indicated by the . However, when it comes to the Cliff’s effect size delta, the negative values show that their effect size is different. Specifically, SDP-S2 outperforms SDP-S1, whereas SDP-S2S outperforms SDP-S2.

With the evidences provided by the above activities, the approach of feature learning proposed in this paper is validated to be suitable for defect prediction.

5.4.2. Parameter Sensitivity Analysis

(1) Parameter Analysis of CNN. When using CNN to represent semantic features, the setting of some parameters of the network layer will affect the representation of semantic features and thus affect prediction performance. In this section, according to the key parameters of CNN, including the length of filter, the number of filters, and embedding dimensions, we tune the three parameters by conducting experiments with different values of the parameters. Note that, for other parameters, we directly present their values obtained from previous studies [9]: batch size is set as 32 and the training epoch is 15. By fixing other parameters, we analyze the influence of the three parameters on the results, respectively.

Figures 810, respectively, present the performance obtained under different filter lengths, different number of filters, and different embedding dimensions. It is not hard to find that all six curves reach the highest performance when the filter length is set to 10. The optimal number of filters is 20, where the performance generally reaches the peak. Interestingly, for project Xerces, when the number of filters is set as 100, the performance becomes particularly bad. With regard to the embedding dimensions, six curves on the whole are very stable, which means that the dimension of representation vector has a very limited impact on the prediction performance.

(2) Parameter Analysis of Software Network Embedding. For the generation of structural features, in Node2vec, a pair of parameters p and q controlling random walk will affect the learning. That is, different combinations of p and q determine the different ways of random walk in the process of network embedding and then generate different structural features. Therefore, we further analyze the two parameters.

Take Poi and Synapse, for example, we construct 25 groups of (p, q) and let . With different combinations (p, q), the results are as shown in Figure 11 and the effect of different combinations is different. For example, when the combination (p, q) is set as (4, 2) in Poi, the best performance 0.789 is achieved, and yet the suitable combinations (p, q) is (0.5, 0.25) in Synapse, and the F-measure value is 0.5204. Therefore, for each project in our context, we give out the optimal combination (p, q), shown in Table 5, so as to learn the defect structural information and generate corresponding structural features better.

6. Threats to Validity

To evaluate the feasibility of our method in defect prediction, we constructed four kinds of predictors according to different features and compared their performance. In this paper, although we do not explicitly compare with the state-of-the-art defect prediction techniques, SDP-S1 is actually equivalent to the method proposed in the literature [13]. Since the original implementation of CNN is not released, we have reproduced a new version of CNN via Keras. Throughout, we strictly followed the procedures and parameters settings described in the reference, such as the selection of AST nodes and the learning rate when training neural networks. Therefore, we are confident that our implementation is very close to the original model.

In this paper, our experiments were conducted with defect datasets of six open-source projects from the PROMISE repository, which might not be representative of all software projects. More projects that are not included in this paper or written in other programming languages are still to be considered. Besides, we only evaluated our approach in terms of different features and did not compare with other state-of-the-art prediction methods. To make our approach more generalizable, in the future, we will conduct experiments on a variety of projects and compare with more benchmark methods.

7. Conclusion

This study aims to build better predictors by learning as much defect feature information as possible from source code files, to improve the performance of software defect predictions. In summary, this study has been conducted on 6 open-source projects and consists of (1) an empirical validation on the feasibility of the structural features that learned from software network at the file level, (2) an in-depth analysis of our method SDP-S2S combined with semantic features and structural features, and (3) a sensitivity analysis with regard to the parameters in CNN and network embedding.

Compared with the traditional hand-crafted features, the F-measure values are generally increased, the maximum is up to 99.5%, and the results indicate that the inclusion of structural features does improve the performance of SDP. Statistically, the advantages of SDP-S2S are particularly obvious from the perspective of Cliff’s effect size. More specifically, the combination of semantic features and structural features is the preferred selection for SDP. In addition, our results also show that the filter length is preferably 10, the optimal number of filters is 20, and the dimension of the representation vector has a very limited impact on the prediction performance. Finally, we also analyzed the parameters p and q involved in the embedding learning process of software network.

Our future work mainly includes two aspects. On the one hand, we plan to validate the generalizability of our study with more projects written in different languages. On the other hand, we will focus on more effective strategies such as feature selection techniques. Last but not least, we also plan to discuss the possibility of considering not only CNN and Node2vec model but also RNN or LSTM for learning semantic features and graph neural networks for network embedding, respectively.

Data Availability

The experimental data used to support the findings of this study are available at https://pan.baidu.com/s/1H6Gw7UHb7vfBFFVfDBF6mQ.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (2018YFB1003801); the National Natural Science Foundation of China (61902114); Hubei Province Education Department Youth Talent Project (Q20171008); and Hubei Provincial Key Laboratory of Applied Mathematics (HBAM201901).