Research Article | Open Access
Shang Zheng, Jinjing Gai, Hualong Yu, Haitao Zou, Shang Gao, "Software Defect Prediction Based on Fuzzy Weighted Extreme Learning Machine with Relative Density Information", Scientific Programming, vol. 2020, Article ID 8852705, 18 pages, 2020. https://doi.org/10.1155/2020/8852705
Software Defect Prediction Based on Fuzzy Weighted Extreme Learning Machine with Relative Density Information
To identify software modules that are more likely to be defective, machine learning has been used to construct software defect prediction (SDP) models. However, several previous works have found that the imbalanced nature of software defective data can decrease the model performance. In this paper, we discussed the issue of how to improve imbalanced data distribution in the context of SDP, which can benefit software defect prediction with the aim of finding better methods. Firstly, a relative density was introduced to reflect the significance of each instance within its class, which is irrelevant to the scale of data distribution in feature space; hence, it can be more robust than the absolute distance information. Secondly, a K-nearest-neighbors-based probability density estimation (KNN-PDE) alike strategy was utilised to calculate the relative density of each training instance. Furthermore, the fuzzy memberships of sample were designed based on relative density in order to eliminate classification error coming from noise and outlier samples. Finally, two algorithms were proposed to train software defect prediction models based on the weighted extreme learning machine. This paper compared the proposed algorithms with traditional SDP methods on the benchmark data sets. It was proved that the proposed methods have much better overall performance in terms of the measures including G-mean, AUC, and Balance. The proposed algorithms are more robust and adaptive for SDP data distribution types and can more accurately estimate the significance of each instance and assign the identical total fuzzy coefficients for two different classes without considering the impact of data scale.
SDP (software defect prediction)  has become an active research topic in software engineering, which has drawn growing interests from both academia and industry. It can be formulated as a learning problem, which is used to facilitate software testing and to save testing cost. Various machine learning methods have utilised software defect training data set to build prediction models, among which Random Forest  and Naive Bayes  were proved to have relatively stable performance. However, class imbalance [4–7] is a common problem in SDP data set, which can affect the model performance. Software defects distribution in software modules roughly conforms to Pareto principle, also known as the 80–20 rule. It means that 80% of the defects are concentrated in 20% of the program modules and the numbers of nondefective modules are much larger than the numbers of defective program modules. Hence, the accuracy of predicting few classes is lower.
Previous studies [8, 9] have indicated that the model tends to fail when it is applied to data with class imbalance problem. In order to address this problem, some imbalanced techniques as ROS (random oversampling) , RUS (random undersampling) , and SMOTE (synthetic minority oversampling technique)  have been considered to construct SDP model. In addition, Wang and Yao  analysed three different types of class imbalance methods for software defect prediction. They found their proposed ensemble approach DNC (Dynamic Adaboost.NC) is better than the ROS, RUS, and SMOTE. The DNC adjusts the parameter automatically during the training process, which can improve the prediction model’s performance further. However, the above prediction models may encounter the following unpredictable problems: (1) choosing suitable coefficients for different classes, (2) abandoning the instances in the small disjunctions, and (3) estimating the wrong class boundary, further resulting in the unexpected SDP classification results.
In this paper, we present a more robust representation measure of data distribution information called relative density, which can be extracted by a K-nearest-neighbors-based probability density estimation (KNN-PDE) [14–16] alike strategy, to evaluate the significance of each training instance and to design the corresponding fuzzy membership function. In contrast to Euclidean-distance-based measure, the relative density is irrelevant to the scale of data distribution in feature space. Meanwhile, it can also reflect the proportional relation of different instances within the class. Moreover, the KNN-PDE alike strategy has another merit that there is no need to normalize the fuzzy coefficients after acquiring the relative density information of all training instances. Then the fuzzy membership function is designed based on KNN-PDE and can assign the larger weights for those high-density instances. This paper recognises the fuzzy values as the weighted values of training instances and embeds them into weighted extreme learning machine (WELM), which can solve the noise or outliers effectively. WELM is selected as the baseline classifier based on three observations: (1) compared to other classifiers, WELM always has better or at least comparable generality ability and classification performance , (2) it can tremendously save training time compared to other classifiers , and (3) it can deal with data with imbalanced distribution based on cost-sensitive learning . Finally, two algorithms based on WELM are proposed: one relies on the intraclass relative density and the other depends on the interclass relative density. That means the first function assigns the larger weights for those high-density instances, while the second function designates the larger weights for the examples which are nearer to the real classification boundary. To evaluate the algorithms’ effectiveness, this paper performed a comparison with the previous works on the benchmark data sets, and the experimental results indicate that the proposed algorithms can generally produce better or at least comparable performance in terms of the measures including G-mean, AUC, and Balance.
The remainder of this paper is structured as follows. Section 2 introduces some a priori knowledge related to this work including software defect prediction, extreme learning, and weighted extreme learning machine. Section 3 describes the proposed methods in detail. The experiments and analysis are given in Section 4, and Section 5 concludes the research and provides suggestion for future work.
2. Related Work
In this section, some preliminaries are presented, including software defect prediction, extreme learning machine, and weighted extreme learning machine.
2.1. Software Defect Prediction
SDP models are expected to improve software quality and reduce maintenance cost of software systems. The researchers utilised the defect prediction data sets to build comparable models for studies. So far, a great number of researches have been devoted to metrics describing code modules and learning algorithms to create SDP models. A variety of machine learning methods have been proposed and compared for SDP problems, such as neural networks , decision trees , Naive Bayes , and support vector machine . However, the above methods ignore the effect of class imbalance on the model performance ; that is, the numbers of defective instances are more or less than nondefective instances. It is a great challenge for most conventional classification algorithms to work with data that have an unbalanced class distribution, because they may ignore the minority class that could be more valuable in a wide range of applications. Thus, some class imbalance learning techniques have been utilised to reduce the negative effect. The work in  studied which type of metrics is useful to handle class imbalance based on static code. An undersampled approach was proposed to balance training data  and check how little information is required to learn a defect predictor. The authors found that throwing away data does not affect the performance of selected predictors. In addition, ensemble algorithms  and their cost-sensitive variants were studied and shown to be effective if a proper cost ratio can be set. Issam et al.  implemented software defect prediction using ensemble learning on selected features-greedy forward selection. Yang et al.  proposed an ensemble learning approach for just-in-time defect prediction, which contains two layers to improve the performance of SDP.
Besides the above introduced works, there are other existing works about software defect prediction which will not be listed because some of them do not consider data distribution, and several works just chose the basic sampling methods but different learning methods. In Section 1, it has been proved that WELM has three advantages over the traditional learning methods. Therefore, this paper will only study how to build more robust SDP models based on data distribution and do a comparison with the sampling methods through extensive experiments and comprehensive analyses.
2.2. Extreme Learning Machine
Extreme learning machine (ELM) that was proposed by Huang et al.  is a specific learning algorithm for single-hidden layer feedforward neural networks (SLFN). The main characteristic of ELM which distinguishes it from those conventional learning algorithms of SLFN is the random generation of hidden nodes. Therefore, ELM does not need to iteratively adjust parameters to make them approach the optimal values; thus it has faster learning speed and better generalization ability. Previous research has indicated that ELM can produce better or at least comparable generality ability and classification performance to SVM and multiple-level perceptron (MLP) but only consumes tenths or hundredths of training time compared to SVM and MLP.
Let us consider a classification problem with N training instances to distinguish m categories, and then the ith training instance can be represented as , where is an input vector and is the corresponding output vector. Suppose that there are L hidden nodes in ELM and that all weights and biases on these nodes are generated randomly. Then, for the instance , its hidden layer output can be represented as a row vector . The mathematical model of ELM could be described aswhere is the hidden layer output matrix over all training instances; is the weight matrix of the output layer; in equation (1), only is unknown, so the least square algorithm is applied to acquire its solution, which can be described as follows:where denotes the Moore–Penrose generalized inverse of the hidden layer output matrix H, which can guarantee that the solution is the least-norm least-squares solution for equation (1).
According to previous work, ELM can be trained in the viewpoint of optimization. In the optimization version of ELM, we wish to synchronously minimize and , so the question can be described as follows:where denotes the training error vector of the m output nodes with respect to the training instance and C is the penalty factor, representing the tradeoff between the minimization of training errors and maximization of generality ability. Obviously, this is a typical quadratic programming problem that can be solved by the Karush–Kuhn–Tucker (KKT) theorem . The solution for equation (3) can be described as follows:
2.3. Weighted Extreme Learning Machine
Weighted extreme learning machine (WELM) that can be regarded as a cost-sensitive learning version of ELM is an effective way to handle imbalanced data . Similar to CS-SVM, the main idea of WELM is to assign different penalties for different categories, where the minority class has a larger penalty factor C, while the majority class has a smaller C value. Then, WELM focuses on the training errors of the minority instances, making a classification hyperplane emerge in a more impartial position. A weighted matrix W is used to regulate the parameter C for different instances; that is, equation (3) can be rewritten aswhere W is an diagonal matrix in which each value existing on the diagonal represents the corresponding regulation weight of parameter. Zong et al.  provided two different weighting strategies, which are described as follows:where , , AUG, and 0.618 denote the weight of the ith training instance, the number of instances belonging to the class , the average number of instances over all classes, and the value of the golden standard, respectively. Compared with WELM2, WELM1 is more practical and popular. Then, the solution can be shown as follows:
Obviously, no matter which weight distribution method is used, few types of samples will be given more weight. Hence, the class imbalance ratio is the higher, and the weight ratio between different types of samples becomes higher. According to the work in , users can define for every sample xi to improve the performance, so the paper considers constructing new based on the data distribution.
3. The Proposed Methods
Although WELM can improve class imbalance problem, it does not consider the distribution of samples in feature space. In addition, there are noise and outliers in the software defect data, which can further affect the performance. Thus, this paper draws on the experience of the works in [30, 31] and introduces the concept of fuzzy sets, which can mine the distribution of each instance in feature space and conduct the more personalized setting for the weights. In order to describe our method, this section first introduces the relative density that is applied to avoid the large calculation of probability density in high-dimensional space. Then the fuzzy membership functions are designed to replace the WELM’s weight matrix based on relative density, and finally the two proposed algorithms of SDP are described. Finally, the experiments are designed to validate the methods. The whole framework can be seen in Figure 1.
3.1. Relative Density Estimation Strategy
As is known, it will be easy to identify outliers and noise from the significant instances if we can estimate the probability of each training instance. However, on high-dimensional feature space, it is always difficult to acquire the exact measurements of the probability density. It would be time-consuming even if an approximately accurate estimation of the probability density is obtained. In order to solve this problem, we introduce an improved method in this subsection. We consider that it is unnecessary to measure the probability density exactly, but it is enough to precisely extract the proportional relation of the probability densities between any two training instances. We call the information reflecting the proportional relation as relative density.
To obtain relative density, a similar K-nearest-neighbors-based probability density estimation (KNN-PDE) is applied. As a nonparametric probability density estimation approach, KNN-PDE estimates the probability density distribution in multidimensional continuous space by measuring the K-nearest-neighbor distance of each training instance. When the number of the training instances goes to infinity, the result obtained from KNN-PDE can approximately converge to the real probability density distribution. Hence, the K-nearest-neighbor distances can be used to estimate the relative density, and Euclidian distance is selected to calculate the distance in the proposed algorithms.
Suppose that a data set contains N instances; then, for each instance , we can find its Kth-nearest neighbors and record the distance between them as . As is known, the larger is, the lower density the instance will hold. At the same time, no matter noise or outliers should appear in the region of low density, thereby we can use as the measure to evaluate the significance of each instance. However, to provide larger value for high-density instances and lower value for low-density instances, for example, noise and outliers, should be transformed to its reciprocal . In this paper, the reciprocal of K-nearest-neighbors distance is defined as the instance’s relative density. It is not difficult to observe that the proportional relation of the relative density between any two instances exactly equals the inverse of that of the K-nearest-neighbors distance between them, as
Also, it is important to confirm the selection of the parameter K for the relative density. If the value of K is too small, it would be failed to identify the noise and outliers from those normal instances, but the distinction between those significant instances and the noise or outliers might become ambiguous and some small disjunctions would not be captured if the value of K is too large. To avoid this problem, this paper considers assigning an appropriate value for K. It is empirically set to be during the experiments, where N denotes the number of the training instances.
3.2. Design of Fuzzy Membership Functions
Based on the relative density, two different fuzzy membership functions are designed. One adopts intraclass relative density information, and the other uses interclass relative density information. The details will be introduced in the following sections.
3.2.1. Fuzzy Membership Function Based on Intraclass Density Information
In this type of fuzzy membership function, is defined with respect to , which is the reciprocal of the distance between the instance and its Kth-nearest neighbors within the same class. The instances appearing in the high-density region are seen as more information ones and they are assigned higher values, while the examples far from the high-density region are seen as the noise or outliers and assigned them lower values. To avoid the impact induced by data distribution scale, a normalized fuzzy membership function can be represented as follows:where denotes the number of instances belonging to the class which drops in. The merit lies in the fact that the fuzzy membership value only reflects the relative density within its own class but is irrespective of the number of instances in that class. Therefore, it will be more robust to the variance of the data distribution scale. In addition, due to the fact that each class is handled independently, it is adaptive for class imbalance problem.
3.2.2. Fuzzy Membership Function Based on Interclass Relative Density Information
In this type of fuzzy membership function, is associated with the estimated class boundary; that is, the instance closer to the estimated class boundary will be assigned a higher membership value. To precisely estimate the class boundary, we deeply investigate the characteristics of four kinds of instances with respect to different density distributions. The instances are divided into noindent normal, boundary, noise, and outliers, respectively. Figure 2 provides a visualized description for these instances. The characteristics of these four instances could be concluded as follows:(a)Normal: the instance appears in the high-density region within its own class but low-density region in the other class(b)Boundary: the instance appears in the low- or medium-density region in both classes but always has a little higher density within its own class than that of the other(c)Noise: the instance appears in the low-density region within its own class but higher-density region in the other class(d)Outliers: the instances appear in the low-density region in both classes
According to the characteristics listed above, we can exactly locate the boundary. First, for each instance, we can compare its intraclass relative density with interclass relative density to find the noise, which can be detected with a discriminant. If the instance is from the positive class, its discriminant is shown as follows:where d’ denotes the distance calculated only with the distance in the other class, and denote, respectively, the numbers of instances in positive class and negative class, respectively, provides the round-up operation, and IR is the class imbalance ratio that equals . Meanwhile, if comes from the negative class, then the discriminant is modified asfor each instance satisfying the discriminant condition in equations (10) and (11), this paper can extract them as noise and then assign a very small member value for them.
Then, for the rest of instances, we assign their membership values with interclass relative density information. The fuzzy membership function can be represented as the following piecewise function:where and denote the numbers of instances belonging to nonnoise and noise with the same class of , and .
3.3. Two Proposed Algorithms
In this section, two proposed algorithms based on WELM are described. In order to set the personalized weights, this paper firstly considers the distribution information and obtained as the new weight of each training sample based on the value of , which replaces the original . Then the new diagonal matrix can be described as follows:
Next, this section describes the algorithm based on the intraclass relative density information called FWELM-INTRA and the algorithm based on the interclass relative density information called FWELM-INTER. Their flow paths are briefly described in Algorithms 1 and 2.
4. Experiments and Analysis
4.1. Data Sets
During this study, the experimental data sets are available from the public PROMISE repository , which have been commonly used in empirical studies of SDP. Detailed information about the data sets is shown in Table 1; each data set contains the number of instances, the number of defects, the number of metrics, and the percentage of defective modules. According to the defective modules ratio, each data set is imbalanced. In order to ensure the accuracy and convergence of the proposed solutions, the zero padding is used to solve the missing values in the date set and data normalization  is adopted before conducting the experiments.
4.2. Experimental Settings
Firstly, to validate the effectiveness and superiority of the two proposed algorithms, this paper compared them not only with many representative class imbalance learning algorithms based on ELM but also with WELM1 and WELM2. In addition, we also compared them with the ensemble method of DNC  that has been proved to be better than the traditional classifiers, that is, Naive Bayes and Random Forest. The simple description can be seen as follows:(1)ELM : it is the standard ELM algorithm without any operations to address class imbalance problem of SDP data sets.(2)RUS : it first adopts RUS algorithm to generate a totally balanced training set and then trains an ELM model on this training set.(3)ROS : it first adopts ROS algorithm to generate a totally balanced training set and then trains an ELM model on this training set.(4)SMOTE : it first adopts SMOTE algorithm to generate a totally balanced training set and then trains an ELM model on this training set.(5)WELM1 and WELM2 : two different weighted strategies of WELM have been adopted as the balance control for a binary-classification task. In particular, they can be regarded as a baseline algorithm that is used to indicate the effect of noise or outliers of SDP data sets.(6)DNC : it is the ensemble learning method to solve imbalance problem of SDP data sets. It can be regarded as a baseline algorithm that is used for comparison with the ensemble learning on performance.
Secondly, to measure the performance on the SDP data set, the probability of detection (PD) and the probability of false alarm (PF) are used based on . For more comprehensive evaluation of predictors in the imbalanced context, G-mean  and AUC  are frequently used to measure how well the predictor can balance the performance between two classes. In the SDP context, G-mean reflects the change in PD efficiently . It can be calculated by
AUC estimates the area under the ROC curve, formed by a set of (PF, PD) pairs. The ROC curve illustrates the tradeoff between detection and false alarm rates, which serves as the performance of a classifier across all possible decision thresholds. AUC provides a single number for performance comparison, varying in [0, 1]. A better classifier should produce a higher AUC. AUC is equivalent to the probability that a randomly chosen example of the positive class will have a smaller estimated probability of belonging to the negative class than a randomly chosen example of the negative class.
In the work in , the point (PF = 0, PD = 1) was proposed as the ideal position on the ROC curve, where all defects are recognized without mistakes; the measure Balance is introduced by calculating the Euclidean distance from the real (PF, PD) point to (0, 1) and is frequently used by software engineers in practice . By definition,
In summary, this paper uses G-mean, AUC, and Balance to guarantee that the experiments are effective. All of them are expected to be high for a good predictor. The advantage of the three measures is their insensitivity to class distributions in data .
Thirdly, to avoid the randomness of the experiments, this paper applied 10-fold cross validation at each time of building modes using nine of ten partitions and testing on the remaining one. The above procedure is repeated 100 times (10-fold cross validation) to calculate the average result for each algorithm, and the results are provided in the form of mean ± standard deviation. The whole settings can be seen in Figure 3.
Meanwhile, for each algorithm related to ELM and WELM, a sigmoid function is used to calculate the hidden layer output matrix, and two main parameters L and C are determined by grid search, where and .
4.3. Research Questions
In this section, we are interested in answering the following three research questions:(i)RQ1: do the proposed algorithms perform better than the previous studies?(ii)RQ2: how does the K value impact performance?(iii)RQ3: how about time-complexity of the two proposed algorithms?
In RQ1, we evaluate the effectiveness of the two proposed algorithms and compare them with previous studied G-mean, AUC, and Balance. In RQ2, we investigate the K value impacting our algorithms based on the range. In RQ3, we examine the time-complexity of two algorithms when the number of training instances and the number of attributes change, respectively. In the following sections, we provide the results analysis of the aforementioned three research questions.
4.4. RQ1: Do the Proposed Algorithms Perform Better than the Previous Studies?
Previous studies note that good SDP methods can support the developers to find the defects. In this RQ, we would like to investigate whether our proposed algorithms can effectively perform better. The benefits of identifying software defects lie in two aspects: First, once a software defect is predicted, we can provide a timely warning to the development team and save the effort and time of the developers. Second, identifying the software defects can help to avoid defects in the future.
To answer this RQ, we implement our approach (the data and code for the algorithms are available at https://github.com/Dark204/Work) and compare the performance with the baselines based on the open-source projects. Then, we measure the performance and perform the statistical test.
Tables 2–4 show the mean and standard deviation values of G-mean, AUC, and Balance. By observing the results, it is not difficult to draw some conclusions as follows:(1)Class imbalance learning techniques are useful for promoting the prediction performance of SDP imbalance data, as, on all the selected SDP data sets, no matter the proposed fuzzy membership function with similar KNN-PDE information, sampling algorithms, WELM algorithms, or ensemble DNC algorithms have gained higher G-mean and Balance values than the standard ELM predictive model.(2)Compared to DNC, the predictive models trained by FWELM-INTRA and FWELM-INTER have more higher G-mean values on all the data sets with the exception of pc4 and higher Balance values on all the data sets except kc1 and pc4. In addition, FWELM-INTRA and FWELM-INTER also have higher AUC values than DNC on six of ten data sets. Therefore, it is clear that the proposed algorithms show better performance than DNC. For some data sets that have lower performance than DNC, we analyse that the sizes of classes overlap and the absolute number of training instances will affect the results. The results do not only imply that the two algorithms achieve a better balance between PD and PF but also prove the importance of the distribution of samples in space.(3)The two proposed algorithms are based on WELM. So, they are also compared with WELM 1 and WELM 2. For G-mean, AUC, and Balance, two algorithms show outperforming efficiency, which proves that the fuzzy values based on relative information can improve WELM to train the better SDP models. Unlike traditional WELM, the proposed algorithms assign different weights that avoid the effect of noise or outliers.(4)As illustrated in Section 2, ELM cannot deal with the class imbalance. But the combination of ELM and data sampling techniques, ELM-RUS, ELM-ROS, and ELM-SMOTE, is effective. Table 2 shows that FWELM-INTRA and FWELM-INTER are still better than the three improved ELM algorithms on G-mean metric. In Tables 3 and 4, the results of AUC and Balance of FWELM-INTRA and FWELM-INTER are also higher than those on 9 of 10 data sets.(5)According to the results, WELM1 or WELM2 assigns the weights to instances; the noise or outliers can still degrade WELM performance and are also able to achieve at least comparable performance with DNC.