Applied Computational Intelligence and Soft Computing

Volume 2016 (2016), Article ID 7658207, 12 pages

http://dx.doi.org/10.1155/2016/7658207

## Prediction of Defective Software Modules Using Class Imbalance Learning

Indian Institute of Information Technology, No. 5203, CC-3 Building, Allahabad, Uttar Pradesh 211012, India

Received 17 November 2015; Accepted 19 January 2016

Academic Editor: Zhang Yi

Copyright © 2016 Divya Tomar and Sonali Agarwal. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Software defect predictors are useful to maintain the high quality of software products effectively. The early prediction of defective software modules can help the software developers to allocate the available resources to deliver high quality software products. The objective of software defect prediction system is to find as many defective software modules as possible without affecting the overall performance. The learning process of a software defect predictor is difficult due to the imbalanced distribution of software modules between defective and nondefective classes. Misclassification cost of defective software modules generally incurs much higher cost than the misclassification of nondefective one. Therefore, on considering the misclassification cost issue, we have developed a software defect prediction system using Weighted Least Squares Twin Support Vector Machine (WLSTSVM). This system assigns higher misclassification cost to the data samples of defective classes and lower cost to the data samples of nondefective classes. The experiments on eight software defect prediction datasets have proved the validity of the proposed defect prediction system. The significance of the results has been tested via statistical analysis performed by using nonparametric Wilcoxon signed rank test.

#### 1. Introduction

Software Development Life Cycle (SDLC) consists of five phases: Analysis, Design, Implementation, Test, and Maintenance phases. These phases should be operated effectively in order to deliver bug-free and high quality software product to the end users. Developing a defect-free software product is a very challenging task due to the occurrence of unknown bugs or unforeseen deficiencies even if all the guidelines of software project development are followed carefully. Early prediction of defective software modules helps the software project manager to effectively utilize the resources such as people, time, and budget to develop high quality software [1–4]. Identifying defective software modules is a major issue of concern in the software industry which facilitates further software evolution and maintenance. Software project managers, quality managers, and software developers monitor, detect, and correct software defects in all phases of software development life cycle in order to deliver quality software on time and within budget. The quality of a software product is highly correlated with the absence or presence of the defects [5, 6]. A software defect is an error or deficiency in a software process which occurs due to incorrect programming logic, miscommunication of requirements, lack of coding experience, poor software testing skill, and so forth. Defective software modules generate wrong output and lead to a poor quality software product which further increases the development and maintenance cost and is responsible for customer dissatisfaction [1, 2]. In last two decades researchers have focused on software defect prediction problem by applying several statistical and machine learning techniques. The software defect data suffers from the class imbalance problem due to the skewed distribution of defective and nondefective software modules [7–11]. Mostly machine learning algorithms consider equal distribution of data samples in each class and assume the misclassification cost of each class is equally important. However, the misclassification cost of data samples of minority class is higher than that of the data samples of majority class in most cases [12]. In case of the software defect prediction, predicting the defective software module as nondefective one can increase the cost of maintenance and for the opposite case in which nondefective module is considered as defective can involve unnecessary testing activities. But the latter is generally more acceptable than the former. Hence, the objective of this research work is to consider the different misclassification cost of each class for the effective prediction of defective software modules.

Software defect prediction problem requires a binary classifier as it is a two-class classification problem. In recent years, many nonparallel hyperplane Support Vector Machine (SVM) classifiers have been proposed by the researchers for binary classification [13–15]. For example, Mangasarian and Wild proposed a Generalized Eigenvalue Proximal Support Vector Machine (GEPSVM), which is the first nonparallel hyperplane classifier and it aims to find a pair of nonparallel hyperplane in such a way that each hyperplane is nearest to one of the two classes and as far as possible from the other class [16]. GEPSVM shows excellent performance with several benchmark datasets especially with the “Cross-Planes” dataset. Later, by utilizing the concept of traditional SVM and GEPSVM, Jayadeva et al. proposed a nonparallel hyperplane based novel binary classifier, named as TWSVM [13]. TWSVM has shown better performance as compared to Support Vector Machine (SVM) and other classifiers not only in terms of predictive accuracy but also in terms of time [13, 14]. For equally distributed classes, the training process of TWSVM is four times faster than that of SVM as it solves two smaller Quadratic Programming Problems (QPPs) instead of a complex QPP as in SVM. TWSVM seeks two nonparallel hyperplanes one for each class in such a way that each hyperplane remains within the close affinity of its corresponding class while being as far as possible from the other class. Although TWSVM classifier is faster than that of conventional SVM, yet it involves solving of two QPPs which is a complex process. Hence, Arun Kumar and Gopal proposed a binary classifier referred to as Least Squares Twin Support Vector Machine (LSTSVM) which solves two linear equations rather than two QPPs as in TWSVM [17]. It is the least square variant of Twin Support Vector Machine (TWSVM). LSTSVM has shown its effectiveness over TWSVM in terms of better generalization ability and lesser computational time. Therefore, this research work has adopted LSTSVM classifier for the defect prediction in software modules. This study takes the misclassification cost issue into account and proposes a Weighted Least Squares Twin Support Vector Machine classifier to develop a software defect prediction system that considers misclassification cost for each class. Experiments on eight software defect prediction datasets taken from PROMISE repository demonstrate the superiority of our proposed system over existing approaches, including Support Vector Machine (SVM), Cost-Sensitive Neural Network (CBNN), weighted Naive Bayes (NB), Random Forests (RF), Logistic Regression (LR), -Nearest Neighbor (-NN), Bayesian Belief Network (BBN), C4.5 Decision Tree, and Least Squares Twin Support Vector Machine (LSTSVM). The effectiveness of the proposed software defect prediction system has also been analyzed by using nonparametric Wilcoxon signed rank hypothesis tests. The statistical inferences are made from the observed difference in the geometric mean.

The paper is organized into five sections. Section 2 summarizes the related work in the field of software defect prediction and class imbalance learning. Section 3 discusses the proposed software defect prediction approach. Results of experiment are presented and discussed in Section 4 and conclusion is drawn in Section 5.

#### 2. Related Work

##### 2.1. Class Imbalance Learning

In imbalanced data distribution, one class contains large number of data samples (majority class) as compared to the other class (minority class). Traditional classification algorithms assume balanced distribution of data samples among classes. The degree of imbalance varies from one problem domain to another and the correct class prediction of data samples in an unusual class becomes more important than the contrary case. In the software defect prediction problem the cases of defective software modules are less as compared to nondefective software modules. For such type of problem, software developers take more interest in the correct identification of defective software modules. The failure to identify defective software modules can degrade the software quality. Therefore, a software defect predictor could be beneficial if it correctly recognizes the defective software modules.

Class imbalanced learning is the process of learning from the imbalanced datasets [18]. The challenge of imbalanced data learning is that the unusual class cannot draw equal attention to the learning algorithm as compared to the majority class. For imbalanced dataset, the learning algorithm generates specific or missing classification rules for the unusual class [18–20]. These rules cannot be generalized well for the unseen data and thus are not appropriate for the future prediction.

Various solutions have been recommended by the researchers to handle class imbalance problem-data level, algorithmic level, and cost-sensitive solutions. In data level solutions, the training data is manipulated to rebalance the distribution of data among classes for the purpose of rectifying the effect of class imbalance by using different resampling techniques such as random oversampling, random undersampling, SMOTE, informed undersampling, and cluster based sampling [20–27]. Data level solutions are more versatile in nature as they are independent of the learning algorithms. In algorithmic level solutions, the learning algorithms modify their training mechanism with the objective to achieve better accuracy on the minority class. One-class learning approaches such as REMED and RIPPER are used to predict the data samples of minority class [28]. Ensemble learning approaches have been used by the researchers for imbalance data handling. In this approach, a set of classifiers are used for learning and their outputs are combined in order to predict the class of new data samples. Boosting, Random Forest, AdaBoost. NC, SMOTEBoost, and so forth are examples of ensemble learning approaches [29]. Cost-sensitive learning methods consider different misclassification cost for different classes in such a way that the data samples of minority class get importance. Cost-Sensitive Decision Tree, Cost-Sensitive Neural Network, and Cost-Sensitive Boosting methods such as Adacost are some approaches which are proposed by the researchers to handle the class imbalance learning problem [30–33]. Cost functions have also been combined with Support Vector Machine and Bayesian classifiers.

##### 2.2. Software Defect Prediction

Researchers are taking great interest in software defect prediction problem using statistical and machine learning algorithms such as Neural Network, Support Vector Machine, Naive Bayes, Random Forest, Case Based Reasoning, Logistic Regression, and Association Rule Mining [34–40]. K. O. Elish and M. O. Elish investigated the capability of Support Vector Machine in predicting defective software modules and analyzed its performance against some statistical and machine learning approaches on four NASA datasets [37]. Czibula et al. developed a system to identify the defective software modules using relational association rule mining which is an extension of association rules [38]. Association rules are used to determine the different types of relations between metrics for defect prediction. Challagulla et al. have evaluated the performance of various machine learning approaches and statistical models on four software defect prediction datasets taken from NASA repository for predicting software quality [41]. From experiments, it was analyzed that the combination of 1-rule classification and instance based learning incorporation with consistency based subset evaluation approach achieved the highest defect predictive accuracy as compared to the other methods. Guo et al. proposed Random Forests, which is an extension of Decision Tree, for identifying the defective software modules [39]. They have performed experiment on five case studies based on NASA datasets and compared the performance of their proposed methodology with statistical and machine learning approaches of WEKA and See5 machine learning packages. They concluded that the Random Forest algorithm has produced higher defect prediction rate as compared to the other approaches. Moeyersoms et al. used Data Mining approaches such as Random Forest, Support Vector Regression, C4.5, and Regression Tree [42]. They have applied ALPA rule extraction technique to improve the rule sets in terms of accuracy, fidelity, and recall. Okutan and Yıldız developed a software defect prediction model by using Bayesian Network [43]. This model determines the probabilistic influential relationships of software metrics with defect-prone software modules. Bayesian Network is one of the most widely used approaches to analyze the effect of object-oriented metrics on the number of defects [43–48]. Pai and Dugan performed experiment on KC1 project taken from NASA repository using Bayesian Network [47]. Fenton et al. used Bayesian Network to predict the defect, quality, and risk of software system [48]. They have analyzed the influence of information variables such as test effectiveness and defect present on target variable defects detected. Catal and Diri have investigated the effect of dataset size, metrics, and feature selection on the prediction of defective software modules [49]. They have conducted experiments on five datasets and analyzed that the Random Forest (RF) algorithm obtained better performance on large datasets while Naive Bayes performed better on small datasets as compared to RF. Again they have used Artificial Immune System (AIS) algorithm to analyze the effect of metrics set. Artificial Immune Recognition Systems (AIRS2Parallel) perform better with the method level metrics while Immunos2 algorithm shows better results with class-level metrics. They have found that the algorithm is more important component of software defect prediction than the metrics suite. Apart from these basic classification approaches, several optimization approaches such as Genetic Algorithm, Particle Swarm Optimization (PSO), and Ant Colony Optimization (ACO) have also been applied to the software defect prediction problem [50–52].

The imbalance distribution of defective and nondefective software modules leads to the poor performance of machine learning approaches. To balance the distribution of data samples between classes, various solutions such as oversampling and undersampling methods have been applied by the researchers. Arar and Ayan proposed a Cost-Sensitive Neural Network based defect prediction system with the objective to handle class imbalance problem [53]. Artificial Bee Colony algorithm was used to find the optimal weights. They have investigated the performance of their proposed approach on five publically available datasets taken from NASA repository. Zheng considered different misclassification costs and developed a software defect prediction model by using Cost-Sensitive Boosting Neural Network [8]. Khoshgoftaar and Gao also studied the impact of data sampling and feature selection on software defect prediction datasets [10, 54]. They used wrapper based feature selection approach to select relevant features and random undersampling to reduce the negative impact of imbalanced data on the performance of software defect prediction model. Wang and Yao investigated the impact of imbalanced data on the software defect prediction learning models [7]. They have performed experiments on ten publically available datasets taken from PROMISE repository with different types of class imbalance learning approaches such as resampling, ensemble approach, and threshold moving. From the experiment, it was found that the AdaBoost.NC has shown better performance as compared to the other approaches. Jing et al. employed dictionary learning approach and proposed a cost-sensitive discriminative dictionary learning (CDDL) based software defect prediction model. They have analyzed the performance of their proposed model on ten NASA datasets [55].

Apart from these researches, various studies have been done on predicting the software defect using Data Mining techniques. Researchers have also analyzed the impact of metrics on identifying defect-prone software modules. They have focused on the selection of relevant metrics which are useful for defect prediction [52, 56–62]. From the literature, we have analyzed that the Data Mining plays crucial role in predicting software defect. The datasets which are used for defect prediction are highly imbalanced in nature as the number of defective software modules is usually less than the nondefective software modules. Therefore, this research work focuses on the imbalance nature of software defect prediction dataset in order to get effective results.

#### 3. Weighted Least Squares Twin Support Vector Machine

Only few researches have considered the misclassification cost of defective and nondefective software modules. This research work has used Weighted Least Squares Twin Support Vector Machine (WLSTSVM) to develop the effective software defect prediction model in which different misclassification cost or weight is assigned to each class according to its sample distribution. Let the training dataset contain “” data samples , where , denotes feature vector and represents corresponding class label. Suppose the size of class 1 and class 2 is and correspondingly, where . Let matrices and consist of data samples of class 1 and class 2, respectively. The appropriate selection of cost is an important issue of consideration. The weight or misclassification cost is determined for each class according to the following formula:The following conclusions can be drawn from the above-mentioned formula:(1)Cost lies within 0 to 1 range, that is, so that the classifier could be trained with convergence.(2)Costs are normalized without loss of generality.(3)Lower misclassification cost is assigned to the majority class while minority class receives higher misclassification cost.Linear and nonlinear WLSTSVM classifier is formulated as follows.

##### 3.1. Linear WLSTSVM

Least Squares Twin Support Vector Machine (LSTSVM), proposed by Arun Kumar and Gopal, is a binary classifier which classifies the data samples of two classes by generating hyperplane for each class [17]. The hyperplanes are constructed in such a way that the data samples of each class lie in the close proximity of its corresponding hyperplane while maintaining clear separation from other hyperplanes. For each new data sample, its distance is calculated from each hyperplane and the data sample is assigned into the class which lies closer to it. Weighted Least Squares Twin Support Vector Machine is obtained by adding weight or misclassification cost to the formulation of LSTSVM according to (1). Linear WLSTSVM solves the following two objective functions:to determine the following two nonparallel hyperplanes:Here, and are two normal vectors to the hyperplanes and and are bias terms. and represent nonnegative penalty parameters. and are the vectors of 1’s and , are slack variables. and represent the diagonal matrix containing misclassification cost for the data samples of class 2 and class 1, respectively, according to (1). The first term of the objective function as indicated in (2) measures the squared sum distances of the data samples of class 1. The minimization of it keeps the hyperplane in the close proximity with class 1. The second term of the objective function minimizes the misclassification error due to the data samples of class 2. Thus, in this way the hyperplane is kept near the data samples of class 1 and as far as possible from the data samples of class 2. The Lagrangian function corresponding to (2) is given byHere, is a Lagrangian multiplier. Following Karush-Kuhn-Tucker (KKT) necessary and sufficient optimality conditions are determined by differentiating (5) with respect to , , , and :Equations (6) and (7) lead toLet , , and . With these notations, (10) can be rewritten asThe solution of the above equation requires the inverse . However, sometimes it is not possible to determine the inverse of it due to ill-conditioned matrix. To avoid this situation, a regularization term may be added to the . Here, and is an identity matrix of suitable dimension. Equation (11) can be rewritten asLagrangian multiplier is determined by (8), (9), and (11) asIn the same way, Lagrangian function of (3) is obtained asHere, is a Lagrangian multiplier. The hyperplane parameters corresponding to class 2 are obtained by solving the above equation (14) asHyperplane parameters are obtained using (11) and (15) which are further used to determine nonparallel hyperplanes, one for each class. A class is assigned to new data sample depending on which plane lies closest to it. The decision function for class evaluation is defined as

*Algorithm 1. *(1)Define weight matrix for each class (defective or nondefective) using (1).(2)Obtain matrices and where matrices and comprise the software modules of defective and nondefective classes or vice versa.(3)Select the penalty parameters on validation basis.(4)Determine hyperplane parameters using (12) and (15) which are further used to determine the hyperplane for each class.(5)For new software module its class (either it is defective or not) is determined by using decision function as mentioned by (17).

##### 3.2. Nonlinear WLSTSVM

Nonlinear WLSTSVM is obtained by using kernel trick. Kernel function maps the data samples into higher-dimensional feature space in order to make easier separation. WLSTSVM classifier generates the following kernel surfaces in that space instead of hyperplanes:Here, “” is an appropriately chosen kernel function and . Nonlinear WLSTSVM classifier is constructed asLet and . The kernel generated surface parameters are obtained asThese parameters generate kernel surfaces and the class is assigned to new data sample depending on its distance from the kernel surface. The decision function is defined asThe algorithm of nonlinear WLSTSVM classifier is similar to that of linear WLSTSVM classifier except that there is a need to choose a kernel function. Kernel function transforms the data samples into higher-dimensional feature space and then kernel generated surface parameters are calculated using (20) and (22). The class is assigned to new data samples using (24).

#### 4. Numerical Experiment

##### 4.1. Dataset Description and Performance Measurements

In this study, we have performed the experiment on eight benchmark datasets taken from PROMISE repository [63]. These datasets are NASA MDP software projects which were developed in C/C++ language for spacecraft instrumentation, satellite flight control, scientific data processing, and storage management of ground data. The detailed description of each dataset is given in Table 1.