On Software Defect Prediction Using Machine Learning

Ren, Jinsheng; Qin, Ke; Ma, Ying; Luo, Guangchun

doi:https://doi.org/10.1155/2014/785435

Journal of Applied Mathematics

On this page

Abstract Introduction Conclusion Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2014 | Article ID 785435 | https://doi.org/10.1155/2014/785435

On Software Defect Prediction Using Machine Learning

Jinsheng Ren,¹Ke Qin,¹Ying Ma,²and Guangchun Luo¹

Academic Editor: Chin-Yu Huang

Received19 Oct 2013

Revised02 Jan 2014

Accepted16 Jan 2014

Published23 Feb 2014

Abstract

This paper mainly deals with how kernel method can be used for software defect prediction, since the class imbalance can greatly reduce the performance of defect prediction. In this paper, two classifiers, namely, the asymmetric kernel partial least squares classifier (AKPLSC) and asymmetric kernel principal component analysis classifier (AKPCAC), are proposed for solving the class imbalance problem. This is achieved by applying kernel function to the asymmetric partial least squares classifier and asymmetric principal component analysis classifier, respectively. The kernel function used for the two classifiers is Gaussian function. Experiments conducted on NASA and SOFTLAB data sets using F-measure, Friedman’s test, and Tukey’s test confirm the validity of our methods.

1. Introduction

Software defect prediction is an essential part of software quality analysis and has been extensively studied in the domain of software-reliability engineering [1–5]. However, as pointed out by Menzies et al. [2] and Seiffert et al. [4], the performance of defect predictors can be greatly degraded by class imbalance problem of the real-world data sets. Here the “class imbalanced” means that the majority of defects in a software system are located in a small percentage of the program modules. Current approaches to solve the class imbalance problem can be roughly categorized into two ways: in a data-level way or algorithm-level way, as reported in [4]. The literature [4] shows that the algorithm-level method AdaBoost almost always outperforms even the best data-level methods in software defect prediction. AdaBoost is a typical adaptive algorithm which has received great attention since Freund and Schapire’s proposal [6]. Adaboost attempts to reduce the bias generated by majority class data, by updating the weights of instances dynamically according to the errors in previous learning. Some other studies improved dimension reduction methods for the class imbalanced problem by means of partial least squares (PLS) [7], linear discriminant analysis (LDA) [8], and principle component analysis (PCA) [9, 10]. Although PLS was not inherently designed for problems of classification and discrimination, it is widely used in many areas that need class proclaimation. The authors of [7] reported that rarely will PLS be followed by an actual discriminant analysis on the scores and rarely is the classification rule given a formal interpretation. Still this method often produces nice separation. Based on the previous work, recently, Qu et al. investigated the effect of PLS in unbalanced pattern classification. It is reported that, beyond dimension reduction, PLS is proved to be superior to generate favorable features for classification. Thereafter, they proposed an asymmetric partial least squares (APLS) classifier to deal with the class imbalance problem. They illustrated that APLS outperforms other algorithms because it can extract favorable features for unbalanced classification. As for the PCA, it is an effective linear transformation, which maps high-dimensional data to a lower dimensional space. Based on the PCA, the authors of [11] proposed kernel principal component analysis (KPCA) which can perform nonlinear mapping to transform an input vector to a higher dimensional feature space, where kernel function is introduced to reduce computation for mapping the data nonlinearly into a feature space. Then linear PCA is used in this feature space.

While both APLS and KPCA are of great value, they have their own disadvantages. For example, the APLS classifier is a bilinear classifier, in which the dimension is mapped to a bilinear subspace, which is, to some degree, obscure and not easy to implement. The KPCA regression model does not consider the correlation between principal components and the class attribution. PCA dimension reduction is affected inevitably by asymmetric distribution. In this paper, we propose two kernel-based learning methods to solve the class imbalance problem, called asymmetric kernel partial least squares classifier (AKPLSC) and asymmetric kernel principal component analysis classifier (AKPCAC), respectively. The former is able to nonlinearly extract the favorable features and retrieve the loss caused by class imbalance problem, while the latter is more adaptive to imbalance data sets.

It is not out of place to explain the relationship between this paper and our previous papers [12, 13]. The AKPLSC and AKPCAC were firstly proposed in [12, 13], respectively. However, recently, we found some errors when we proceeded to our work. And due to these errors, the AKPCAC and AKPLSC proposed in [12, 13] show superiority only in part of the data sets. We carefully rectified the source code and then tested the AKPCAC and AKPLSC again on the whole data sets by means of statistical tools, such as Friedman’s test and Tukey’s test. The outcomes show that our classifiers indeed outperform the others, namely, APLSC, KPCAC, AdaBoost, and SMOTE. We carefully examine the theory and experimental results and then form this paper in more detail.

2. State of the Art

In software defect prediction, denotes the labeled example set with size and denotes the unlabeled example set with size . For labeled examples, , the defective modules are labeled “+1” and the nondefective modules are labeled “−1”. Software defect data sets are highly imbalanced; that is, the examples of the minority class (defective modules) are heavily underrepresented in comparison to the examples of majority class (nondefective modules). Thereby, lots of algorithms are proposed to cope with this problem, as will be seen below.

2.1. Software Defect Predictor Related to Partial Least Squares

Linear partial least squares (PLS) [7] is an effective linear transformation, which performs the regression on the subset of extracted latent variables. Kernel PLS [14] first performs nonlinear mapping, , to project an input vector to a higher dimensional feature space, in which the linear PLS is used.

Given the center , the radius of the class region , and the parameter of overlapping , the relationship of the two classes can be expressed as . The parameter indicates the level of overlapping between the region of the two classes (the smaller the value of is, the higher the overlapping will be).

APLSC can be expressed as , which is derived from the regression model of the linear PLS, , where is the number of the latent variables, is the th score vector of testing data, indicates the direction of th score, and the bias is equal to .

APLSC suffers from the high overlapping, especially when the data sets are nonlinear separable [15]. A suggestion of solving such overlapping problem is by using a kernel method. Kernel PLS [14] corresponds to solving the eigenvalue equation as follows: where and denote the matrix of mapped X-space data and the matrix of mapped Y-space data in the feature space , respectively. The nonlinear feature selection methods can reduce the overlapping level of the two classes, but the class imbalance problem makes them fail to distinguish the minority class [15]. In order to retrieve the loss caused by class imbalance problem, we want to get the bias of the kernel PLS classification, KPLSC [14].

Different from the APLSC, the kernel PLS regression is , where is the size of labeled example set, is a kernel function, and is dual regression coefficient. Consequently, we may combine the APLSC and kernel PLS so that we get the asymmetric kernel PLS, as will be seen in Section 3.1.

2.2. Kernel Principal Component Analysis Classifier for Software Defect Prediction

Principal Component Analysis (PCA) [10] is an effective linear transformation, which maps high-dimensional data to a lower dimensional space. Kernel principal component analysis (KPCA) [11] first performs nonlinear mapping to transform an input vector to a higher dimensional feature space. And then linear PCA is used in this feature space.

For both of the algorithms demonstrated in [10, 11], the input data are centralized in the original space and the transformed high-dimensional space; that is, and , where is the number of the labeled data and is the th instance of the data set. In the proceeding of PCA, the correlation matrix should be diagonalized, while, in KPCA, the correlation matrix should be diagonalized. It is equal to solving the eigenvalue problem , where is an eigenvalue and is a matrix of eigenvectors in KPCA. It can also be written as , where is the kernel matrix.

The kernel principal component regression algorithm has been proposed by Rosipal et al. [11]. The standard regression model in the transformed feature space can be written as where is the number of components, is the th primal regression coefficient, and is the regression bias. Consider , where is the th eigenvector of . and are the eigenvectors and eigenvalues of the correlation matrix, respectively.

2.3. Data Set

There are many data sets for machine learning test, such as the UCI [16] and the PROMISE data repository [17] (since the contributors maintain these data sets continuously, the metrics listed in Table 1 may vary at different times). What we are using in this paper are the latest ones updated in June 2012. They are different from the data set that we used in our previous papers [12, 13]), which is a data collection from real-world software engineering projects. The choice that which data set should be used depends on the area of the machine learning where it will be applied. In this paper, the experimental data sets come from NASA and SOFTLAB, which can be obtained from PROMISE [17], as shown in Tables 1 and 2. These software modules are developed in different languages, at different sites by different teams, as shown in Table 1. The SOFTLAB data sets (ar3, ar4, and ar5) are drawn from three controller systems for a washing machine, a dishwasher, and a refrigerator, respectively. They are all written in C. The rests are from NASA projects. They are all written in C++, except for kc3, which is written in JAVA. All the metrics are computed according to [17].

3. Design the Asymmetric Classifiers Based on Kernel Method

3.1. The Asymmetric Kernel Partial Least Squares Classifier (AKPLSC)

As we illustrated in Section 2.1, APLSC can be expressed as and the kernel PLS regression is ; thus the AKPLSC can be well characterized as where is dual regression coefficient, which can be obtained from kernel PLS, as shown in Algorithm 1 and is the bias of the classifier.

Input: Labeled and unlabeled data sets, and ; number of components, .
Output: Asymmetric Kernel Partial Least Squares Classifier, ;
Method:
(1) ;
(2) , where is the kernel matrix, is the label vector.
(3) for do
(4) , where is a projection direction.
(5) repeat
(6)
(7)
(8) untill convergence
(9) , where is the score
(10) , where is the direction of the score
(11) , where is the deflation of
(12)
(13) end for
(14)
(15) , where is the vector of dual regression coefficients
(16) Calculate according to (4);
(17) ;
(18) return ;
End Algorithm AKPLSC.

Since kernel PLS put most of the information on the first dimension, the bias in the AKPLSC can be computed similarly as [15] where indicates the direction of the first score and the centers (, ) and radiuses (, ) are computed based on , which can be obtained from (1). Then we move the origin to the center of mass by employing data centering, as reported in [14]: where is the a vector with all elements that are equal to 1. After data centering, the AKPLSC can be described as shown in Algorithm 1.

3.2. The Asymmetric Kernel Principal Component Analysis Classifier (AKPCAC)

The KPCA regression model does not consider the correlation between principal components and the class attribution. PCA dimension reduction is inevitably affected by asymmetric distribution [15]. We analyze the effect of class imbalance on KPCA. Considering the class imbalance problem, we propose an asymmetric kernel principal component analysis classifier (AKPCAC), which retrieves the loss caused by this effect.

Suppose that denotes the between-class scatter matrix and denotes the within-class scatter matrix, where is class-conditional mean vector, is mean vector of total instances, is the th instances in the th class, and is the number of instances of the th class. The total noncentralized scatter matrix in the form of kernel matrix is The third term of (6) can be rewritten as Note that . Then the third term and fourth term of (6) are equal to zero. Thus, we have the relation , where is the number of positive instances, is the number of negative instances, is the positive covariance matrices, and is the negative covariance matrix. Since class distribution has a great impact on , the class imbalance also impacts the diagonalization problem of KPCA.

In order to combat the class imbalance problem, we propose the AKPCAC, based on kernel method. It considers the correlation between principal components and the class distribution. The imbalance ratio can be denoted as , which is the probabilities of the positive instances to the negative instances of training data, where is an indicator function: if , zero otherwise. We assume that future test examples are drawn from the same distribution, so the imbalance ratio of the training data is the same as that of the test data. Then, we have where is the bias of the classifier and is the regression result of . can be computed by regression model equation (2). Note that the regression is conducted on the principal components. Solving this one variable equation, we get

Based on principal components, (9) describes the detail deviation of the classifier. This deviation may be caused by class imbalance, noise, or other unintended factors. In order to retrieve the harmful effect, we compensate this deviation. By transforming the regression model (2), the classifier model can be written as where , .

AKPCAC is summarized in Algorithm 2. Since the AKPCAC was firstly studied for reducing the effect of class imbalance for classification, it inherently has the advantage of kernel method, which can conduct quite general dimensional feature space mappings. In this paper, again, we have illustrated how the unreliable dimensions based on KPCA can be removed; thereafter, the imbalance problem based on the PCA has also been solved.

Input: The set of Labeled samples, ;
The set of unlabeled samples, ;
Output: Kernel Principal Component Classifier, ;
Method:
(1) ;
(2) , where is a vector with all elements equal to 1.
(3) ;
(4) ; % is the label vector
(5) Calculate according to (9), (10);
(6) return ;
End Algorithm APPCC.

4. Experimental Result

The experiments are conducted under the data set from NASA and SOFTLAB. The Gaussian kernel function is adopted for the performance investigation for both AKPLSC and AKPCAC. The efficiency is evaluated by -measure and Friedman’s test, as will be explained presently.

4.1. Validation Method Using F-Measure

-measure method is widely used for assessing a test’s accuracy. It considers both the precision and the recall to compute the score. is defined as the number of correct results divided by the number of all returned results. is the number of correct results divided by the number of results that should have been returned. For the clarity of this paper, we give a short explanation of the -measure as below. Obviously, there are four possible outcomes of a predictor:(1)TP: true positives are modules classified correctly as defective modules;(2)FP: false positives refer to nondefective modules incorrectly labeled as defective;(3)TN: true negatives correspond to correctly classified nondefective modules;(4)FN: false negatives are defective modules incorrectly classified as nondefective.

Thereby, the precision is defined as and the recall is .

The general formula of the -measure is where is a positive real number. According to the definition of and , (11) can be rewritten as

Generally, there are 3 commonly used -measures: (which is a balance of and ), (which puts more emphasis on than ), and (which weights higher than ). In this paper, is used to evaluate the efficiency of different classifiers. The -measure can be interpreted as a weighted average of the precision and recall. It reaches its best value at 1 and worst score at 0.

We compare the -measure values of different predictors including AKPLSC, AKPCAC, APLSC [15], KPCAC [11], AdaBoost [4], and SMOTE [18]. The results are listed in Table 3. For each data set, we perform a -fold cross validation.

From the table we may see clearly that the AKPLSC and the AKPCAC are superior than the other 4 classifiers, which validates our contributions of this paper.

4.2. Validation Method Using Friedman’s Test and Tukey’s Test

The Friedman test is a nonparametric statistical test developed by Friedman [19, 20]. It is used to detect differences in algorithms/classifiers across multiple test attempts. The procedure involves ranking each block (or row) together and then considering the values of ranks by columns. In this section, we present a multiple AUC value comparison among the six classifiers using Friedman’s test.

At first, we make two hypotheses: : the six classifiers have equal classification probability; : at least two of them have different probability distribution.

In order to determine which hypothesis should be rejected, we compute the statistic: where is the number of blocks (or rows), is the number of classifiers, and is the summation of ranks of each column. The range of rejection for null hypothesis is . In our experiment, the degree of freedom is and we set ; thus , which implies that should be rejected.

Friedman’s test just tells us that at least two of the classifiers have different performance, but it does not give any implication which one performs best. In this case, a post hoc test should be proceeded. Actually, there are many post hoc tests such as LSD (Fisher’s least significant difference), SNK (Student-Newman-Keuls), Bonferroni-Dunn’s test, Tukey’s test, and Nemenyi’s test, which is very similar to the Tukey test for ANOVA. In this paper, the Tukey test [21] is applied.

Tukey’s test is a single-step multiple comparison procedure and statistical test. It is used in conjunction with an ANOVA to find means that are significantly different from each other. It compares all possible pairs of means and is based on a studentized range distribution.

Tukey’s test involves two basic assumptions:(1)the observations being tested are independent;(2)there is equal within-group variance across the groups associated with each mean in the test.

Obviously, our case satisfies the two requirements.

The steps of the Tukey multiple comparison with equal sample size can be summarized in Algorithm 3.

Input: , , , , , and samples. The meaning of these parameters is: is an error rate.
is the number of means, , is the degree of freedom related with . is the
number of observations of each sample.
Output: The minimum significant difference and a deduction;
Method:
(1) Choose a proper error rate ;
(2) Calculate the statistic according to
,
where is the critical value of Studentized range statistic, which can be found from any
statistics textbooks.
(3) Compute and rank all the means
(4) Draw a deduction based on the ranks in the confidence level .
End Algorithm Tukey Multiple Comparison.

In this paper, we set . Since we compare 6 classifiers over 9 data sets, then , , , and . , which can be found from the studentized range statistic table. Now the only problem to find the value of is to determine and . This can be calculated accordingly as where is the corresponding AUC value in Table 4 and is the AUC summation of each column. Now we have the results: , , and . The means comparison is listed in Table 5. From this table we can see clearly the following.(1)The difference is greater than the critical value , which hints that the AKPCAC is significantly better than the APLSC.(2)But compared to the rest, except the APLSC, the two newly proposed methods have no significant difference.(3)Nevertheless, the AKPCAC and AKPLSC have the largest and second largest means, which implies that both indeed outperform the rest, although insignificantly.(4)Compared to the AKPLSC, the AKPCAC is slightly more powerful, which supports our claim that the AKPCAC is more adaptive to dimensional feature space mappings over imbalanced data sets.(5)The deduction is made in the confidence level .

5. Conclusion

In this paper, we introduce kernel-based asymmetric learning for software defect prediction. To eliminate the negative effect of class imbalance problem, we propose two algorithms called the asymmetric kernel partial least squares classifier and the asymmetric kernel principal component analysis classifier. The former one is derived from the regression model of linear PLS, while the latter is derived from kernel PCA method. The AKPLSC can extract feature information in a nonlinear way and retrieve the loss caused by class imbalance. The AKPCAC is more adaptive to dimensional feature space mappings over imbalanced data sets and has a better performance. -measure, Friedman’s test, and a post hoc test using Tukey’s method are used to verify the performance of our algorithms. Experimental results on NASA and SOFTLAB data sets validate their effectiveness.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The work of this paper was supported by the National Natural Science Foundation of China (Grant no. 61300093) and the Fundamental Research Funds for the Central Universities in China (Grant no. ZYGX2013J071). The authors are extremely grateful to the anonymous referees of the initial version of this paper, for their valuable comments. The present version incorporates all the changes requested. Their comments, thus, significantly improved the quality of this present paper.

References

T. M. Khoshgoftaar, E. B. Allen, and J. Deng, “Using regression trees to classify fault-prone software modules,” IEEE Transactions on Reliability, vol. 51, no. 4, pp. 455–462, 2002.
View at: Publisher Site | Google Scholar
T. Menzies, J. Greenwald, and A. Frank, “Data mining static code attributes to learn defect predictors,” IEEE Transactions on Software Engineering, vol. 33, no. 1, pp. 2–13, 2007.
View at: Publisher Site | Google Scholar
Y. Ma, G. Luo, X. Zeng, and A. Chen, “Transfer learning for cross-company software defect prediction,” Information and Software Technology, vol. 54, no. 3, pp. 248–256, 2012.
View at: Publisher Site | Google Scholar
C. Seiffert, T. M. Khoshgoftaar, and J. Van Hulse, “Improving software-quality predictions with data sampling and boosting,” IEEE Transactions on Systems, Man, and Cybernetics A, vol. 39, no. 6, pp. 1283–1294, 2009.
View at: Publisher Site | Google Scholar
L. Guo, Y. Ma, B. Cukic, and H. Singh, “Robust prediction of fault-proneness by random forests,” in Proceedings of the 15th International Symposium on Software Reliability Engineering (ISSRE '04), pp. 417–428, November 2004.
View at: Google Scholar
Y. Freund and R. Schapire, “Experiments with a new boosting algorithm,” in Proceedings of the 13th International Conference on Machine Learning, pp. 148–156, 1996.
View at: Google Scholar
M. Barker and W. Rayens, “Partial least squares for discrimination,” Journal of Chemometrics, vol. 17, no. 3, pp. 166–173, 2003.
View at: Publisher Site | Google Scholar
J.-H. Xue and D. M. Titterington, “Do unbalanced data have a negative effect on LDA?” Pattern Recognition, vol. 41, no. 5, pp. 1558–1571, 2008.
View at: Publisher Site | Google Scholar
X. Jiang, “Asymmetric principal component and discriminant analyses for pattern classification,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 931–937, 2009.
View at: Publisher Site | Google Scholar
I. T. Jolliffe, Principal Component Analysis, Springer, New York, NY, USA, 1986.
View at: MathSciNet
R. Rosipal, M. Girolami, L. J. Trejo, and A. Cichocki, “Kernel PCA for feature extraction and de-noising in nonlinear regression,” Neural Computing and Applications, vol. 10, no. 3, pp. 231–243, 2001.
View at: Google Scholar
Y. Ma, G. Luo, and H. Chen, “Kernel based asymmetric learning for software defect prediction,” IEICE Transactions on Information and Systems, vol. E-95-D, no. 1, pp. 267–270, 2012.
View at: Publisher Site | Google Scholar
Y. Ma, G. Luo, and H. Chen, “Kernel based asymmetric learning for software defect prediction,” IEICE Transactions on Information and Systems, vol. E-95-D, no. 1, pp. 267–270, 2012.
View at: Publisher Site | Google Scholar
R. Rosipal, L. J. Trejo, and B. Matthews, “Kernel PLS-SVC for linear and nonlinear classification,” in Proceedings of the 20th International Conference on Machine Learning (ICML '03), pp. 640–647, August 2003.
View at: Google Scholar
H.-N. Qu, G.-Z. Li, and W.-S. Xu, “An asymmetric classifier based on partial least squares,” Pattern Recognition, vol. 43, no. 10, pp. 3448–3457, 2010.
View at: Publisher Site | Google Scholar
K. Bache and M. Lichman, “UCI Machine Learning Repository,” University of California, School of Information and Computer Science, Irvine, Calif, USA, 2013 http://archive.ics.uci.edu/ml/.
View at: Google Scholar
T. Menzies, B. Caglayan, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan, “The PROMISE Repository of empirical software engineering data,” West Virginia University, Department of Computer Science, 2012, http://promisedata.googlecode.com/.
View at: Google Scholar
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
View at: Google Scholar
M. Friedman, “The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” Journal of the American Statistical Association, no. 32, pp. C675–C701, 1937.
View at: Google Scholar
M. Friedman, “A comparison of alternative tests of significance for the problem of $m$ rankings,” Annals of Mathematical Statistics, vol. 11, pp. 86–92, 1940.
View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
M. William and S. Terry, Statistics for Engineering and the Sciences, Pearson, London, UK, 5th edition, 2006.

Copyright

Copyright © 2014 Jinsheng Ren et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

6243

Downloads

1934

Citations