Abstract

Software defect prediction studies usually build models without analyzing the data used in the procedure. As a result, the same approach has different performances on different data sets. In this paper, we introduce discrimination analysis for providing a good method to give insight into the inherent property of the software data. Based on the analysis, we find that the data sets used in this field have nonlinearly separable and class-imbalanced problems. Unlike the prior works, we try to exploit the kernel method to nonlinearly map the data into a high-dimensional feature space. By combating these two problems, we propose an algorithm based on kernel discrimination analysis called KDC to build more effective prediction model. Experimental results on the data sets from different organizations indicate that KDC is more accurate in terms of -measure than the state-of-the-art methods. We are optimistic that our discrimination analysis method can guide more studies on data structure, which may derive useful knowledge from data science for building more accurate prediction models.

1. Introduction

Defect-prone software modules prediction is very critical for the high-assurance and mission-critical systems. It tries to estimate a functional relationship between the features of the software modules and the quality of the modules. Many software engineering researchers apply data mining methods on different software data sets. However, there are rarely researchers analyzing the inner structure of the data sets, either because it needs a good technical background on data science or the modules developed belong to strange domains for the local companies. Building prediction models requires solving binary classification problem as many pattern recognition applications. Many pattern recognition approaches are applied to build predictors but have different performances on different data sets. As pointed out by Khoshgoftaar et al. [1] and Menzies et al. [2], majority of the defects in a software system are located in a small percentage of the program modules; software defect data sets are highly class-imbalanced. Since then, many specific approaches to handling class-imbalanced problem were proposed in software defect prediction, such as sampling approaches, cost-sensitive approaches, feature selection approaches, and ensemble approaches. In addition to the class-imbalanced property, we think that software data sets have another property, that is, nonlinear separability. On data sets with better separability, most methods will yield good performances, while on data sets with worse separability, most methods will perform poorly. But, to the best of our knowledge, very few studies focused on the prediction model based on the inherent property of the software data sets.

This paper makes the following contributions. We newly introduced the kernel based discrimination analysis on software data sets, to gain insight into the inherent property of the data used in defect prediction. The results of the analysis suggested that data sets used in this field have a nonlinearly separable property, which may require nonlinear algorithm to build predictors to improve the performance. By comparing the transformation results of the linear discrimination analysis with kernel discrimination analysis, we proposed a kernel based algorithm to build defect predictor, which addressed the nonlinearly separable and class-imbalanced problems. We conducted our experiments on data sets drawn from different projects and different companies. The experimental results show that the proposed algorithm gives better performance on all the data sets when compared with the state-of-the-art methods.

The rest of this paper is organized as follows. Section 2 briefly reviews the background of the discrimination analysis techniques and software defect prediction algorithms. Based on the theories of linear and kernel discrimination, Section 3 presents our algorithm for building the defect predictor. Section 4 describes the software defect data sets, performance metrics used in this study, and shows the experimental results with discussions. Section 5 finalizes the paper with conclusions and future works.

2.1. Discrimination Analysis

Most recently, ignoring the dependence among the variables, Menzies et al. [2] proposed a famous method based on naive Bayes classifier to build defect prediction model. But Fan et al. [3] hold that the theoretical misclassification rate of the naive Bayes classifier is higher than that of linear discrimination analysis method. Linear discriminant analysis (LDA) [4] is a classical multivariate technique for supervised learning, especially for classification problems which need projecting data vectors that belong to the real classes. It has been widely used in many applications such as traffic incident detection [5], face recognition [6], document classification [7], speech recognition [8], and image classification [9]. The linear discrimination analysis methods can find a compact representation of the original data when the data form a linear subspace. However, in the distribution of some data such as face images, which is highly nonlinear and complex, it cannot find this compact representation. It is therefore reasonable that when linear discrimination analysis methods fail to provide reliable results, we should try these nonlinear methods to achieve robust performances. A number of nonlinear methods have been developed to deal with these shortcomings of the linear discrimination analysis methods such as kernel-based approaches.

Kernel based discrimination analysis (KDA) has good performances in many applications such as face recognition [10], information retrieval [11], image classification [12]. Most recently, the discriminant analysis method is used to combat the class-imbalanced problem, which exists within the colon cancer data, lymphoma data, lung cancer data, breast cancer data, and gene-imprint data [13]. In order to find the defective modules, we also emphasize the importance of considering the class imbalance during software quality modeling. We found that the kernel based discrimination analysis has good performance on software defect prediction.

2.2. Software Defect Prediction

Software defect prediction is to predict the defect-prone modules for the next release of software or cross project software, as shown in Figure 1. With the software metric research advance, more and more researchers apply machine learning methods to predict defective software modules, such as interpretable models [14], J4.8 decision tree [15], Bayesian nets [16], ensemble method [17], transfer learning [18], asymmetric learning [19, 20], active learning [21. These articles compare the performance of learning methods and endorse the use of static code attributes for predicting defective modules. There are also a few articles reporting that the further progress in learning defect predictors may not come from better algorithms but come from more information content of the training data, such as [22].

The studies [23, 24] used the PCA and LDA to predict the fault-prone module directly, without analyzing the separability of data sets. We not only analyze the nonlinear separability, which is the property of the software data sets but also consider class-imbalanced problem which were widely studied recently [17].

Most recently, the kernel methods were used in software engineering to estimate the software effort [25]. But rare articles report the performance of predictor based on kernel methods for software defect prediction. In this paper we focus on nonlinear separable and class-imbalanced problems in software defect prediction. Based on the kernel discrimination analysis, a new classifier is proposed to provide the technique, which transforms low-dimensional input space into a high-dimensional feature space so as to make the software data separable in the new space and then calculates the local mean distances using the class distribution information to find the minority of defective modules.

3. Defect Prediction Based on Discrimination Analysis

In this section, we introduce the linear discriminant analysis technic and describe the kernel discrimination analysis based on the linear version. Then, we propose a kernel discrimination classifier, which is more suitable for building software defect predictor based on the inherent property of the software data sets.

3.1. Discrimination Analysis on Software Data Sets

Since software data sets used in the software defect prediction are drawn from varied systems which are written in different language, developed by different company, applied in different domains, the individual data sets appear to have quite different structure. The discrimination analysis theory provides a good method to give insight into the data distributions. Since the concept of discrimination analysis belongs to the experts’ knowledge of artificial intelligence and knowledge engineering field, we should describe this technic to migrate the knowledge from data science to software engineering.

Here, we would like to predict the defective modules and nondefective modules in software by solving binary classification problem. Therefore, we show the discriminant analysis method constrained to two classes. Firstly, we calculate the between-class scatter matrix and the within-class scatter matrix for training data. These two matrixes are shown as (1) and (2), using the symbols in Table 1: Then, in order to find the maximum points of , we set derivative of (3) equal to zero. This means that when is an eigenvector of , the separation will be equal to the corresponding eigenvalue. By substituting (1) and (2) into (4), we get as follows: Suppose , then in the subspace; the original data points to be discriminated are projected as follow.

Set the dimension of equal to the dimension of the training data, then we can obtain the classifier with the threshold as prior probability , as in [26]. We can see that the objective of linear discriminant analysis approach is to maximize the ratio of between-class variance to withinclass variance. Therefore, we can exploit it to analyze the software data so as to give insight into the separability of the defective and nondefective classes. We will see that the software data are nonlinearly separable as shown in Section 4.

3.2. Kernel Discrimination Classifier

To deal with the software data sets which are nonlinearly separable, we perform nonlinear mapping to transform the input vectors to a higher dimensional feature space. Then, a new classifier based on kernel discrimination analysis (KDC) is proposed to deal with the nonlinearly separable and class-imbalanced problem, which are often inherent in software defect prediction. In the kernel discrimination analysis (KDA) [27], the between-class scatter matrix is as follows:

And the within-class scatter matrix is where is class-conditional mean vector, is mean vector of total instances, is the th instances in the th class, and is the number of instances of the th class. Then, the modified objective function is given as follows: To maximize , (9) can be transformed into a nonlinear eigenvalue problem [28]. Then, we can find the maximum eigenvalues of , where is a regularizing diagonal term introduced to improve the numerical stability of the inverse computation as described by Shawe-Taylor and Cristianini [27].

After the analysis as described above, we can get eigenvector vectors corresponding to the maximum eigenvalues of this eigenvalue problem. Finally, the original features can be projected to the new spaces by transformation matrix as follows:

In order to combat the class-imbalanced problem, we propose a kernel discrimination classification (KDC) based on local mean vector. Considering the correlation between transformed features and the class distribution, KDC retrieves the loss caused by the class-imbalanced problem. Firstly, we compute the , neighbors of class for every transformed instances : where is the th nearest neighbor. When , KNN has the special form 1-NN rule. Then, we calculate the mean distances of each class: Assign to the class if the distance between the local mean vector for and the query pattern is minimum:

KDC is summarized as in Algorithm 1. It originates from the need to combat the nonlinearly separable and class-imbalanced problem in the classification. It not only balances the distribution of data sets, but it also inherits the advantage of kernel method, which can conduct quite general dimensional feature space mappings.

Require:
 Normalized training data, ;
 Labels vector of the training data, ;
 Normalized test data, ;
 Kernel type, kernel;
Ensure:
 KDC classifier, ;
(1)  ;
   ;
(2)   ;
(3)   ;
(4) ;
(5)   ;
(6)   ;
(7)   ;
(8)   ;
(9)   ;
(10) ;
(11) ;
(12) ;
(13) , ;
(14) Collecting the neighbourhoods of using (12);
(15) Calculate the mean distances using (13);
(16) Classify as ;
(17) return   ;

4. Experiments

In this section we evaluate KDC algorithm empirically. First of all, we use two types of discrimination methods LDA and KDA to analyze the well-known data sets used in the software defect prediction. And then, based on the analysis result, we investigate the performance of our method compared with the other three methods. We focus on the visualization and interpretation of the multivariable data so as to analyze the inherent property of the data sets used in the software defect prediction.

4.1. Data Set

In this study, twelve well-known data sets in software defect prediction are analyzed, including the eight sets used in [2] as well as four additional data sets used in [29], as shown in Tables 2 and 3. They are from NASA projects developed at different sites by different teams and from projects of Turkish software company (SOFTLAB) which is related to the embedded controller for white goods, respectively. Since these data sets are collected from different companies or different projects developed by different languages, they are under different distributions.

4.2. Discrimination Analysis on Software Data Sets

Firstly we conduct the discrimination on the data sets from NASA. Each Figure (a) shows the distributions of the defective modules and nondefective modules on two dimensions. Figures (b) and (c) depict the histograms of the first feature values obtained by LDA and KDA—the vertical axis corresponds to number of instances, and the horizontal axis to the project values on the first feature values. Note that the two dimensions of the data as shown in Figure (a) are the first two features, and the first features in Figures (b) and (c) are obtained from the projection from all the input features. The first 2D data of the original data pc1 is depicted in Figure 2(a). Since the distribution of the first two dimensions of the positive data and negative data is very similar, it is very hard to classify the two types without discrimination analysis. However, even projected onto one dimension using LDA, this data set is still mixed together as shown in Figure 2(b). Compared with LDA, when conducting KDA, we have found that the different patterns can be separated as shown in Figure 2(c). It means that the pc1 data set is a multimodel data set, which is nonlinearly separable. The result of the analysis on each data set of different companies contracted with NASA is very similar to this data set, as shown in Figures 2, 3, 4, 5, 6, 7, 8, and 9. In order to investigate the inherent property of the data sets used in software defect prediction, we also apply the discrimination analysis on the (SOFTLAB) data sets from local company, as shown in Figures 10, 11, 12, and 13. We can see that all the data sets have the property of nonlinear separability, which requires more sophisticated classification.

Each Figure (c) shows that KDA separates the software modules with defect from nondefect modules reasonably well. In addition to maximizing the between-class variance, KDA also tries to minimize the inner-class variance of each class. Another interesting finding from the figures is that the first feature obtained by KDA has a strong positive correlation to the defective modules. While LDA is incapable of providing correct classification because of its linear nature, KDA can usually provide correct classification through nonlinear transformations. Therefore, it is able to produce linear separable features for such data that are from the input space and have bad linear separability. KDA is to find a nonlinear projection direction, by which the original inputs can be mapped into a high dimension feature space, where they were linearly separable, and then the LDA was employed.

KDA produces a nonlinear decision boundary, which is very useful in defect prediction since classes are not always well separated by a linear function. After transforming the low-dimensional input space into a high-dimensional feature space, the data set in the new feature space becomes linearly separable. Theoretically speaking, the kernel function is able to implicitly and explicitly map the input space, which may not be linearly separable, into an arbitrary high-dimensional feature space that can be linearly separable. What is more, we calculate the local mean distance for each class to combat the class-imbalanced problem, as shown in the Algorithm 1. The performances of KDC can also be seen from the following experiment.

4.3. Performance Measures

To evaluate the performance of the prediction model, we can use the confusion matrix of the predictor from Witten and Frank [30]. In the confusion matrix, True Positive (TP) is the number of defective modules predicted as defective; False Negative (FN) is the number of defective modules predicted as nondefective; False Positive (FP) is the number of nondefective modules predicted as defective; True Negative (TN) is the number of nondefective modules predicted as nondefective.

Since [31] value serves as a good singular performance metric when dealing with the class-imbalanced problem, it is widely used in the software defect prediction field [32, 33]. It can be expressed as follows: where is the probability of true defective modules to the number of defective modules, and is the probability of true defective modules to the number of modules predicted as defective.

4.4. Result

In order to investigate the performance of KDC (Gaussian kernel is used here), we compare it with J4.8 decision trees [15, 34], Naive Bayes [2], random forest (RF) [35], AdaBoost [17], Smote [36], and linear discrimination analysis based classifier (LDC) [26]. The details are as follows.(i)Under each labeled rate, each data set is divided into ten random partitions.(ii) The defect predictor is built from nine partitions and tested on the remaining partition for each method.(iii) Running ten times follows the steps above.(iv) For comparing the results for these methods, we conducted Mann-Whitney -Test (Mann-Whitney test is a nonparametric statistical hypothesis test to compare two independent groups of sampled data, which is without an assumption of a normal distribution. For details see [37]. That is, we speak of two results for a data set as being “significantly different” only if the difference is statistically significant at the 0.05 level according to the Mann-Whitney -Test.) with level of significance: 5% ( = 0.05).

We calculate the means and variances of values of running 10 times’ results for these methods, as summarized in Table 4. It shows that on all the data sets, KDC achieves higher values than J4.8 significantly. Although all the methods fail to build practical predictors on data sets such as pc2, we can still consider that the KDA has the best performance. Note that this data set has an extreme imbalance ratio, which is too low (only 21 defective modules in 745 modules) to express the information of the defective modules.

4.5. Threats to Validity

As every empirical experiment, our results are subject to some of the potential threats to validity. Firstly, in this study, we validated our findings on open data sets with different characteristics, from two different organizations, that is, NASA and SOFTLAB. By doing so, we have gained more confidence in the validity of the results reported in this paper. Secondly, since systems are developed for different domains or different applications, someone could think more about the ability of application of our method in industrial practice. Therefore, the replicated studies examining our method on other software systems will be useful to generalize our findings and improve our method. Finally, to the best of our knowledge, there is little current negative criticism for Mann Whitney as a statistical test for comparing data miners, but many for others. Although this is often misunderstood as a criticism of empirical studies, this study shows encouraged results with kernel discrimination analysis method. Our method should encourage more researchers to run similar studies on more kernel based method and deepen the understanding of the inherent property of the software engineering data. Our study would be replicated with more projects, different metrics, and replaced by more sophisticated method.

5. Conclusion and Future Work

In this paper, we addressed the issue of analysis of the property of the data sets in software defect prediction. By conducting the linear discrimination analysis on software data sets, we found that these data could not be separated by LDA. The results projected by LDA and KDA showed that these data sets had a property of nonlinear separability. Motivated by the analysis, we proposed a new algorithm KDC based on the kernel discrimination analysis to build defect predictor. It can tackle the nonlinearly separable and class-imbalanced problems. Experiments show that KDC can give good performances among the comparative methods on the test sets.

There are several areas in which we can improve this work. First, when we try to analyze the data structure of the software data sets, we only use one type of discrimination methods. However, we may try other data analysis methods to get more information of the data for building defect predictors. Second, in the future we will try to investigate other kernel based algorithms for software defect prediction on more software data sets.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was partially supported by NSFC (61070151, 61373147) and Xiamen Scientific Research Foundation (3502Z20123037). The authors thank the anonymous reviewers for their great helpful comments.