Abstract

To identify software modules that are more likely to be defective, machine learning has been used to construct software defect prediction (SDP) models. However, several previous works have found that the imbalanced nature of software defective data can decrease the model performance. In this paper, we discussed the issue of how to improve imbalanced data distribution in the context of SDP, which can benefit software defect prediction with the aim of finding better methods. Firstly, a relative density was introduced to reflect the significance of each instance within its class, which is irrelevant to the scale of data distribution in feature space; hence, it can be more robust than the absolute distance information. Secondly, a K-nearest-neighbors-based probability density estimation (KNN-PDE) alike strategy was utilised to calculate the relative density of each training instance. Furthermore, the fuzzy memberships of sample were designed based on relative density in order to eliminate classification error coming from noise and outlier samples. Finally, two algorithms were proposed to train software defect prediction models based on the weighted extreme learning machine. This paper compared the proposed algorithms with traditional SDP methods on the benchmark data sets. It was proved that the proposed methods have much better overall performance in terms of the measures including G-mean, AUC, and Balance. The proposed algorithms are more robust and adaptive for SDP data distribution types and can more accurately estimate the significance of each instance and assign the identical total fuzzy coefficients for two different classes without considering the impact of data scale.

1. Introduction

SDP (software defect prediction) [1] has become an active research topic in software engineering, which has drawn growing interests from both academia and industry. It can be formulated as a learning problem, which is used to facilitate software testing and to save testing cost. Various machine learning methods have utilised software defect training data set to build prediction models, among which Random Forest [2] and Naive Bayes [3] were proved to have relatively stable performance. However, class imbalance [47] is a common problem in SDP data set, which can affect the model performance. Software defects distribution in software modules roughly conforms to Pareto principle, also known as the 80–20 rule. It means that 80% of the defects are concentrated in 20% of the program modules and the numbers of nondefective modules are much larger than the numbers of defective program modules. Hence, the accuracy of predicting few classes is lower.

Previous studies [8, 9] have indicated that the model tends to fail when it is applied to data with class imbalance problem. In order to address this problem, some imbalanced techniques as ROS (random oversampling) [10], RUS (random undersampling) [11], and SMOTE (synthetic minority oversampling technique) [12] have been considered to construct SDP model. In addition, Wang and Yao [13] analysed three different types of class imbalance methods for software defect prediction. They found their proposed ensemble approach DNC (Dynamic Adaboost.NC) is better than the ROS, RUS, and SMOTE. The DNC adjusts the parameter automatically during the training process, which can improve the prediction model’s performance further. However, the above prediction models may encounter the following unpredictable problems: (1) choosing suitable coefficients for different classes, (2) abandoning the instances in the small disjunctions, and (3) estimating the wrong class boundary, further resulting in the unexpected SDP classification results.

In this paper, we present a more robust representation measure of data distribution information called relative density, which can be extracted by a K-nearest-neighbors-based probability density estimation (KNN-PDE) [1416] alike strategy, to evaluate the significance of each training instance and to design the corresponding fuzzy membership function. In contrast to Euclidean-distance-based measure, the relative density is irrelevant to the scale of data distribution in feature space. Meanwhile, it can also reflect the proportional relation of different instances within the class. Moreover, the KNN-PDE alike strategy has another merit that there is no need to normalize the fuzzy coefficients after acquiring the relative density information of all training instances. Then the fuzzy membership function is designed based on KNN-PDE and can assign the larger weights for those high-density instances. This paper recognises the fuzzy values as the weighted values of training instances and embeds them into weighted extreme learning machine (WELM), which can solve the noise or outliers effectively. WELM is selected as the baseline classifier based on three observations: (1) compared to other classifiers, WELM always has better or at least comparable generality ability and classification performance [17], (2) it can tremendously save training time compared to other classifiers [18], and (3) it can deal with data with imbalanced distribution based on cost-sensitive learning [19]. Finally, two algorithms based on WELM are proposed: one relies on the intraclass relative density and the other depends on the interclass relative density. That means the first function assigns the larger weights for those high-density instances, while the second function designates the larger weights for the examples which are nearer to the real classification boundary. To evaluate the algorithms’ effectiveness, this paper performed a comparison with the previous works on the benchmark data sets, and the experimental results indicate that the proposed algorithms can generally produce better or at least comparable performance in terms of the measures including G-mean, AUC, and Balance.

The remainder of this paper is structured as follows. Section 2 introduces some a priori knowledge related to this work including software defect prediction, extreme learning, and weighted extreme learning machine. Section 3 describes the proposed methods in detail. The experiments and analysis are given in Section 4, and Section 5 concludes the research and provides suggestion for future work.

In this section, some preliminaries are presented, including software defect prediction, extreme learning machine, and weighted extreme learning machine.

2.1. Software Defect Prediction

SDP models are expected to improve software quality and reduce maintenance cost of software systems. The researchers utilised the defect prediction data sets to build comparable models for studies. So far, a great number of researches have been devoted to metrics describing code modules and learning algorithms to create SDP models. A variety of machine learning methods have been proposed and compared for SDP problems, such as neural networks [20], decision trees [21], Naive Bayes [22], and support vector machine [23]. However, the above methods ignore the effect of class imbalance on the model performance [1]; that is, the numbers of defective instances are more or less than nondefective instances. It is a great challenge for most conventional classification algorithms to work with data that have an unbalanced class distribution, because they may ignore the minority class that could be more valuable in a wide range of applications. Thus, some class imbalance learning techniques have been utilised to reduce the negative effect. The work in [24] studied which type of metrics is useful to handle class imbalance based on static code. An undersampled approach was proposed to balance training data [25] and check how little information is required to learn a defect predictor. The authors found that throwing away data does not affect the performance of selected predictors. In addition, ensemble algorithms [26] and their cost-sensitive variants were studied and shown to be effective if a proper cost ratio can be set. Issam et al. [27] implemented software defect prediction using ensemble learning on selected features-greedy forward selection. Yang et al. [9] proposed an ensemble learning approach for just-in-time defect prediction, which contains two layers to improve the performance of SDP.

Besides the above introduced works, there are other existing works about software defect prediction which will not be listed because some of them do not consider data distribution, and several works just chose the basic sampling methods but different learning methods. In Section 1, it has been proved that WELM has three advantages over the traditional learning methods. Therefore, this paper will only study how to build more robust SDP models based on data distribution and do a comparison with the sampling methods through extensive experiments and comprehensive analyses.

2.2. Extreme Learning Machine

Extreme learning machine (ELM) that was proposed by Huang et al. [28] is a specific learning algorithm for single-hidden layer feedforward neural networks (SLFN). The main characteristic of ELM which distinguishes it from those conventional learning algorithms of SLFN is the random generation of hidden nodes. Therefore, ELM does not need to iteratively adjust parameters to make them approach the optimal values; thus it has faster learning speed and better generalization ability. Previous research has indicated that ELM can produce better or at least comparable generality ability and classification performance to SVM and multiple-level perceptron (MLP) but only consumes tenths or hundredths of training time compared to SVM and MLP.

Let us consider a classification problem with N training instances to distinguish m categories, and then the ith training instance can be represented as , where is an input vector and is the corresponding output vector. Suppose that there are L hidden nodes in ELM and that all weights and biases on these nodes are generated randomly. Then, for the instance , its hidden layer output can be represented as a row vector . The mathematical model of ELM could be described aswhere is the hidden layer output matrix over all training instances; is the weight matrix of the output layer; in equation (1), only is unknown, so the least square algorithm is applied to acquire its solution, which can be described as follows:where denotes the Moore–Penrose generalized inverse of the hidden layer output matrix H, which can guarantee that the solution is the least-norm least-squares solution for equation (1).

According to previous work, ELM can be trained in the viewpoint of optimization. In the optimization version of ELM, we wish to synchronously minimize and , so the question can be described as follows:where denotes the training error vector of the m output nodes with respect to the training instance and C is the penalty factor, representing the tradeoff between the minimization of training errors and maximization of generality ability. Obviously, this is a typical quadratic programming problem that can be solved by the Karush–Kuhn–Tucker (KKT) theorem [29]. The solution for equation (3) can be described as follows:

2.3. Weighted Extreme Learning Machine

Weighted extreme learning machine (WELM) that can be regarded as a cost-sensitive learning version of ELM is an effective way to handle imbalanced data [19]. Similar to CS-SVM, the main idea of WELM is to assign different penalties for different categories, where the minority class has a larger penalty factor C, while the majority class has a smaller C value. Then, WELM focuses on the training errors of the minority instances, making a classification hyperplane emerge in a more impartial position. A weighted matrix W is used to regulate the parameter C for different instances; that is, equation (3) can be rewritten aswhere W is an diagonal matrix in which each value existing on the diagonal represents the corresponding regulation weight of parameter. Zong et al. [19] provided two different weighting strategies, which are described as follows:where , , AUG, and 0.618 denote the weight of the ith training instance, the number of instances belonging to the class , the average number of instances over all classes, and the value of the golden standard, respectively. Compared with WELM2, WELM1 is more practical and popular. Then, the solution can be shown as follows:

Obviously, no matter which weight distribution method is used, few types of samples will be given more weight. Hence, the class imbalance ratio is the higher, and the weight ratio between different types of samples becomes higher. According to the work in [19], users can define for every sample xi to improve the performance, so the paper considers constructing new based on the data distribution.

3. The Proposed Methods

Although WELM can improve class imbalance problem, it does not consider the distribution of samples in feature space. In addition, there are noise and outliers in the software defect data, which can further affect the performance. Thus, this paper draws on the experience of the works in [30, 31] and introduces the concept of fuzzy sets, which can mine the distribution of each instance in feature space and conduct the more personalized setting for the weights. In order to describe our method, this section first introduces the relative density that is applied to avoid the large calculation of probability density in high-dimensional space. Then the fuzzy membership functions are designed to replace the WELM’s weight matrix based on relative density, and finally the two proposed algorithms of SDP are described. Finally, the experiments are designed to validate the methods. The whole framework can be seen in Figure 1.

3.1. Relative Density Estimation Strategy

As is known, it will be easy to identify outliers and noise from the significant instances if we can estimate the probability of each training instance. However, on high-dimensional feature space, it is always difficult to acquire the exact measurements of the probability density. It would be time-consuming even if an approximately accurate estimation of the probability density is obtained. In order to solve this problem, we introduce an improved method in this subsection. We consider that it is unnecessary to measure the probability density exactly, but it is enough to precisely extract the proportional relation of the probability densities between any two training instances. We call the information reflecting the proportional relation as relative density.

To obtain relative density, a similar K-nearest-neighbors-based probability density estimation (KNN-PDE) is applied. As a nonparametric probability density estimation approach, KNN-PDE estimates the probability density distribution in multidimensional continuous space by measuring the K-nearest-neighbor distance of each training instance. When the number of the training instances goes to infinity, the result obtained from KNN-PDE can approximately converge to the real probability density distribution. Hence, the K-nearest-neighbor distances can be used to estimate the relative density, and Euclidian distance is selected to calculate the distance in the proposed algorithms.

Suppose that a data set contains N instances; then, for each instance , we can find its Kth-nearest neighbors and record the distance between them as . As is known, the larger is, the lower density the instance will hold. At the same time, no matter noise or outliers should appear in the region of low density, thereby we can use as the measure to evaluate the significance of each instance. However, to provide larger value for high-density instances and lower value for low-density instances, for example, noise and outliers, should be transformed to its reciprocal . In this paper, the reciprocal of K-nearest-neighbors distance is defined as the instance’s relative density. It is not difficult to observe that the proportional relation of the relative density between any two instances exactly equals the inverse of that of the K-nearest-neighbors distance between them, as

Also, it is important to confirm the selection of the parameter K for the relative density. If the value of K is too small, it would be failed to identify the noise and outliers from those normal instances, but the distinction between those significant instances and the noise or outliers might become ambiguous and some small disjunctions would not be captured if the value of K is too large. To avoid this problem, this paper considers assigning an appropriate value for K. It is empirically set to be during the experiments, where N denotes the number of the training instances.

3.2. Design of Fuzzy Membership Functions

Based on the relative density, two different fuzzy membership functions are designed. One adopts intraclass relative density information, and the other uses interclass relative density information. The details will be introduced in the following sections.

3.2.1. Fuzzy Membership Function Based on Intraclass Density Information

In this type of fuzzy membership function, is defined with respect to , which is the reciprocal of the distance between the instance and its Kth-nearest neighbors within the same class. The instances appearing in the high-density region are seen as more information ones and they are assigned higher values, while the examples far from the high-density region are seen as the noise or outliers and assigned them lower values. To avoid the impact induced by data distribution scale, a normalized fuzzy membership function can be represented as follows:where denotes the number of instances belonging to the class which drops in. The merit lies in the fact that the fuzzy membership value only reflects the relative density within its own class but is irrespective of the number of instances in that class. Therefore, it will be more robust to the variance of the data distribution scale. In addition, due to the fact that each class is handled independently, it is adaptive for class imbalance problem.

3.2.2. Fuzzy Membership Function Based on Interclass Relative Density Information

In this type of fuzzy membership function, is associated with the estimated class boundary; that is, the instance closer to the estimated class boundary will be assigned a higher membership value. To precisely estimate the class boundary, we deeply investigate the characteristics of four kinds of instances with respect to different density distributions. The instances are divided into noindent normal, boundary, noise, and outliers, respectively. Figure 2 provides a visualized description for these instances. The characteristics of these four instances could be concluded as follows:(a)Normal: the instance appears in the high-density region within its own class but low-density region in the other class(b)Boundary: the instance appears in the low- or medium-density region in both classes but always has a little higher density within its own class than that of the other(c)Noise: the instance appears in the low-density region within its own class but higher-density region in the other class(d)Outliers: the instances appear in the low-density region in both classes

According to the characteristics listed above, we can exactly locate the boundary. First, for each instance, we can compare its intraclass relative density with interclass relative density to find the noise, which can be detected with a discriminant. If the instance is from the positive class, its discriminant is shown as follows:where d’ denotes the distance calculated only with the distance in the other class, and denote, respectively, the numbers of instances in positive class and negative class, respectively, provides the round-up operation, and IR is the class imbalance ratio that equals . Meanwhile, if comes from the negative class, then the discriminant is modified asfor each instance satisfying the discriminant condition in equations (10) and (11), this paper can extract them as noise and then assign a very small member value for them.

Then, for the rest of instances, we assign their membership values with interclass relative density information. The fuzzy membership function can be represented as the following piecewise function:where and denote the numbers of instances belonging to nonnoise and noise with the same class of , and .

3.3. Two Proposed Algorithms

In this section, two proposed algorithms based on WELM are described. In order to set the personalized weights, this paper firstly considers the distribution information and obtained as the new weight of each training sample based on the value of , which replaces the original . Then the new diagonal matrix can be described as follows:

Next, this section describes the algorithm based on the intraclass relative density information called FWELM-INTRA and the algorithm based on the interclass relative density information called FWELM-INTER. Their flow paths are briefly described in Algorithms 1 and 2.

Input:
 Software defect training set , where ;
 Penalty factor , Hidden layer neurons .
Output:
 FWELM-INTRA Classifier.
(1) Divide into two sets, only contains positive instances, and only contains negative instances. Here, SDP aims to discover the defective modules, so this paper uses positive instances to represent software defective instance, and negative instances to represent non-defective instances;
(2) Count the number of instances in and , then record them as and , where ;
(3) Calculate the parameter K for positive and negative class, where , ;
(4) For each instance in , calculate its instances to the th nearest neighbors in and record it as , as well for each instances in , calculate its distance to the th nearest neighbors in and record it as ;
(5) For each instance in , calculate its relative density by equation (8), and then calculate its fuzzy membership value by equation (9);
(6) Obtain based on the value of , and train a FWELM-INTRA classifier by equation (5) with the given parameters C and L.
Input:
 Software defect training set , where ;
 Penalty factor , Hidden layer neurons .
Output:
 FWELM-INTER Classifier.
(1) Divide into two sets, only contains positive instances, and only contains negative instances. Here, SDP aims to discover the defective modules, so this paper uses positive instances to represent software defective instance, and negative instances to represent non-defective instances;
(2) Count the number of instances in and , then record them as and , where ;
(3) Calculate the class imbalance ratio as ;
(4) Calculate the parameter K for positive and negative class, where , ;
(5) For each instance in , calculate its instances to the th nearest neighbors in and record it as , as well for each instances in , calculate its distance to the th nearest neighbors in and record it as ;
(6) Calculate the relative density of each instance and find the noise in two different classes by equations (10) and (11);
(7) For each instance in , calculate its relative density by equation (8), and then calculate its fuzzy membership value by equation (12);
(8) Obtain based on the value of , and train a FWELM-INTER classifier by equation (5) with the given parameters C and L.

4. Experiments and Analysis

4.1. Data Sets

During this study, the experimental data sets are available from the public PROMISE repository [32], which have been commonly used in empirical studies of SDP. Detailed information about the data sets is shown in Table 1; each data set contains the number of instances, the number of defects, the number of metrics, and the percentage of defective modules. According to the defective modules ratio, each data set is imbalanced. In order to ensure the accuracy and convergence of the proposed solutions, the zero padding is used to solve the missing values in the date set and data normalization [33] is adopted before conducting the experiments.

4.2. Experimental Settings

Firstly, to validate the effectiveness and superiority of the two proposed algorithms, this paper compared them not only with many representative class imbalance learning algorithms based on ELM but also with WELM1 and WELM2. In addition, we also compared them with the ensemble method of DNC [13] that has been proved to be better than the traditional classifiers, that is, Naive Bayes and Random Forest. The simple description can be seen as follows:(1)ELM [28]: it is the standard ELM algorithm without any operations to address class imbalance problem of SDP data sets.(2)RUS [34]: it first adopts RUS algorithm to generate a totally balanced training set and then trains an ELM model on this training set.(3)ROS [35]: it first adopts ROS algorithm to generate a totally balanced training set and then trains an ELM model on this training set.(4)SMOTE [12]: it first adopts SMOTE algorithm to generate a totally balanced training set and then trains an ELM model on this training set.(5)WELM1 and WELM2 [19]: two different weighted strategies of WELM have been adopted as the balance control for a binary-classification task. In particular, they can be regarded as a baseline algorithm that is used to indicate the effect of noise or outliers of SDP data sets.(6)DNC [13]: it is the ensemble learning method to solve imbalance problem of SDP data sets. It can be regarded as a baseline algorithm that is used for comparison with the ensemble learning on performance.

Secondly, to measure the performance on the SDP data set, the probability of detection (PD) and the probability of false alarm (PF) are used based on [13]. For more comprehensive evaluation of predictors in the imbalanced context, G-mean [36] and AUC [37] are frequently used to measure how well the predictor can balance the performance between two classes. In the SDP context, G-mean reflects the change in PD efficiently [38]. It can be calculated by

AUC estimates the area under the ROC curve, formed by a set of (PF, PD) pairs. The ROC curve illustrates the tradeoff between detection and false alarm rates, which serves as the performance of a classifier across all possible decision thresholds. AUC provides a single number for performance comparison, varying in [0, 1]. A better classifier should produce a higher AUC. AUC is equivalent to the probability that a randomly chosen example of the positive class will have a smaller estimated probability of belonging to the negative class than a randomly chosen example of the negative class.

In the work in [13], the point (PF = 0, PD = 1) was proposed as the ideal position on the ROC curve, where all defects are recognized without mistakes; the measure Balance is introduced by calculating the Euclidean distance from the real (PF, PD) point to (0, 1) and is frequently used by software engineers in practice [39]. By definition,

In summary, this paper uses G-mean, AUC, and Balance to guarantee that the experiments are effective. All of them are expected to be high for a good predictor. The advantage of the three measures is their insensitivity to class distributions in data [40].

Thirdly, to avoid the randomness of the experiments, this paper applied 10-fold cross validation at each time of building modes using nine of ten partitions and testing on the remaining one. The above procedure is repeated 100 times (10-fold cross validation) to calculate the average result for each algorithm, and the results are provided in the form of mean ± standard deviation. The whole settings can be seen in Figure 3.

Meanwhile, for each algorithm related to ELM and WELM, a sigmoid function is used to calculate the hidden layer output matrix, and two main parameters L and C are determined by grid search, where and .

4.3. Research Questions

In this section, we are interested in answering the following three research questions:(i)RQ1: do the proposed algorithms perform better than the previous studies?(ii)RQ2: how does the K value impact performance?(iii)RQ3: how about time-complexity of the two proposed algorithms?

In RQ1, we evaluate the effectiveness of the two proposed algorithms and compare them with previous studied G-mean, AUC, and Balance. In RQ2, we investigate the K value impacting our algorithms based on the range. In RQ3, we examine the time-complexity of two algorithms when the number of training instances and the number of attributes change, respectively. In the following sections, we provide the results analysis of the aforementioned three research questions.

4.4. RQ1: Do the Proposed Algorithms Perform Better than the Previous Studies?
4.4.1. Motivation

Previous studies note that good SDP methods can support the developers to find the defects. In this RQ, we would like to investigate whether our proposed algorithms can effectively perform better. The benefits of identifying software defects lie in two aspects: First, once a software defect is predicted, we can provide a timely warning to the development team and save the effort and time of the developers. Second, identifying the software defects can help to avoid defects in the future.

4.4.2. Approach

To answer this RQ, we implement our approach (the data and code for the algorithms are available at https://github.com/Dark204/Work) and compare the performance with the baselines based on the open-source projects. Then, we measure the performance and perform the statistical test.

4.4.3. Results

Tables 24 show the mean and standard deviation values of G-mean, AUC, and Balance. By observing the results, it is not difficult to draw some conclusions as follows:(1)Class imbalance learning techniques are useful for promoting the prediction performance of SDP imbalance data, as, on all the selected SDP data sets, no matter the proposed fuzzy membership function with similar KNN-PDE information, sampling algorithms, WELM algorithms, or ensemble DNC algorithms have gained higher G-mean and Balance values than the standard ELM predictive model.(2)Compared to DNC, the predictive models trained by FWELM-INTRA and FWELM-INTER have more higher G-mean values on all the data sets with the exception of pc4 and higher Balance values on all the data sets except kc1 and pc4. In addition, FWELM-INTRA and FWELM-INTER also have higher AUC values than DNC on six of ten data sets. Therefore, it is clear that the proposed algorithms show better performance than DNC. For some data sets that have lower performance than DNC, we analyse that the sizes of classes overlap and the absolute number of training instances will affect the results. The results do not only imply that the two algorithms achieve a better balance between PD and PF but also prove the importance of the distribution of samples in space.(3)The two proposed algorithms are based on WELM. So, they are also compared with WELM 1 and WELM 2. For G-mean, AUC, and Balance, two algorithms show outperforming efficiency, which proves that the fuzzy values based on relative information can improve WELM to train the better SDP models. Unlike traditional WELM, the proposed algorithms assign different weights that avoid the effect of noise or outliers.(4)As illustrated in Section 2, ELM cannot deal with the class imbalance. But the combination of ELM and data sampling techniques, ELM-RUS, ELM-ROS, and ELM-SMOTE, is effective. Table 2 shows that FWELM-INTRA and FWELM-INTER are still better than the three improved ELM algorithms on G-mean metric. In Tables 3 and 4, the results of AUC and Balance of FWELM-INTRA and FWELM-INTER are also higher than those on 9 of 10 data sets.(5)According to the results, WELM1 or WELM2 assigns the weights to instances; the noise or outliers can still degrade WELM performance and are also able to achieve at least comparable performance with DNC.

Besides, G-mean [36] is frequently used to measure how well the predictor performs. Hence, this paper mainly chose G-mean to make the statistics analysis in Tables 57. For each data set, nine predictive models will be built following the algorithms’ settings described in the previous section. The Friedman test is also used to detect significant difference among a group of results, and Holm’s post hoc test is adopted to examine whether the proposed algorithms are distinctive among a comparison [41, 42]. The post hoc procedure allows us to know whether a hypothesis of comparison of means could be rejected at a specified level of significance . The adjusted value (APV) is also calculated to denote the lowest level of significance of a hypothesis which results in a rejection. Furthermore, we consider the average rankings of the algorithms to measure how good an algorithm is with respect to its partners. The ranking could be calculated by assigning a position to each algorithm depending on its performance on each data set. The algorithm that achieves the best performance on a specific data set will be given rank 1 (value 1); then the algorithm with the second-best result is assigned rank 2, and so forth. This task is conducted over all data sets and finally an average ranking is calculated.

In Table 5, this paper calculated its rankings, and APVs first. It can be found that FWELM-INTRA algorithm has acquired the lowest average ranking value, indicating that it is the best among all algorithms. As for FWELM-INTER algorithm, it ranks second, which is only a little worse than FWELM-INTER. In addition, we observe that the APVs of the ELM, ELM-RUS, and WELM2 are lower than a standard level of significance of 0.05. That means the three algorithms are significantly different from FWELM-INTRA algorithm. However, we cannot say that FWELM-INTRA is significantly different from FWELM-INTER, ELM-ROS, ELM-Smote, WELM1, and DNC, although it has a lower ranking value. Meanwhile, we know FWELM-INTRA is less complex than FWELM-INTER. Therefore, we can recommend that FWELM-INTRA would be an efficient choice for learning from SDP data sets.

Next, this paper selects FWELM-INTRA as the baseline to execute the one-versus-one comparison. Specifically, we calculated the average percentage of performance promotion and counted win/same/loss number based on the pairwise t-test at significance level throughout all data sets. The results are presented in Table 6. The results show that the proposed FWELM-INTRA algorithm has promoted the classification performance compared with the other algorithms, and the performance can be also improved more or less (2.84%–41.76%). In addition, we observed that FWELM-INTRA is better than all other algorithms on 8 data sets at least, indicating its superiority. In contrast with FWELM-INTER algorithm, its average performance only increases by 0.59%. Therefore, we can say that the two proposed algorithms are fairly similar to each other.

Furthermore, effect size is also computed, since it emphasises the practical size of difference [43]. We use Cliff’s delta [44], which is a nonparametric effect size measure that quantifies the amount of difference between two groups. In our context, Cliff’s delta is computed to compare FWELM-INTRA with the other approaches. That is,  < 0.147 is “negligible,”  < 0.33 and are “small,”  < 0.474 and are “medium,” and otherwise it is “large.” Table 7 presents Cliff’s delta values for FWELM-INTRA compared with other approaches in terms of G-mean values. We observe that the effect size for FWELM-INTRA compared with ELM and WELM2 is large on all data sets, and the effect size for FWELM-INTRA compared with ELM-RUS, ELM-ROS, ELM-SMOTE, WELM1, and DNC fluctuated on different data sets. Moreover, the effect size for FWELM-INTRA compared with FWELM-INTER is basically negligible on the data sets. Therefore, the readers are suggested to choose an appropriate algorithm according to the characteristics of their practical applications.

4.5. RQ2: How Does the K Value Impact Performance?
4.5.1. Motivation

In the two proposed algorithms, we need to select the correct K for the relative density. If the value of K is too small, it would be failed to identify the noise and outliers. If the value of K is too large, some distinctions would not be captured. Therefore, it is necessary to know whether there is a fixed K that can be applied to all data sets.

4.5.2. Approach

To decide the value of K in relative density calculation to the classification performance, we choose the parameter K based on the number of instances. Figure 4 plots the variance of G-mean with the change of the parameter K for all ten data sets based on the two proposed algorithms.

4.5.3. Result

In Figure 4, the X values in axis, 1–5, denote , respectively. We observe that although there are some fluctuations, the performance still presents a rising trend with the increase of K in the initial phases on most data sets. Then, it will arrive at the peak and, subsequently, the performance will decrease. That means when the value of K is undersize or oversize, the performance of the proposed algorithms will be deteriorative. Excessively low K might assign oversize weights for some noise and outliers, while excessively high K might make the weights belonging to the instances in the same category converge. Although there might be different optimal settings for the parameter K in two different proposed algorithms, Figure 4 still shows some useful guidance; that is, data sets can choose K between and , and the performance of proposed algorithm could be guaranteed in safety. That also illustrates that the parameter K could be conservatively and empirically in the experiments. Users are suggested and encouraged to choose an appropriate value for the parameter K by themselves in detailed situation.

4.6. RQ3: How About Time-Complexity of the Two Proposed Algorithms?
4.6.1. Motivation

Without considering the time consumption of modeling WELM classifier, the computational complexity of the two proposed algorithms is affected by the distance calculation, finding the corresponding neighbor, and calculating the fuzzy membership value. Nevertheless, we should focus on the time-complexity to find the relationships with running time.

4.6.2. Approach

For a data set with n instances, where each instance holds a attributes, we firstly analyse that calculating distances will take time, sorting will take time, and calculating and assigning fuzzy membership values will require a constant time. Then, it can be found that FWELM-INTER needs to cost more training time than FWELM-INTRA, as the density calculation will be run twice.

For the two algorithms, the computational complexity depends on the calculation of the relative density and the fuzzy membership value for each training instance. Taking jm1 as an example, this paper tracked the variance of the running time with the variance of the number of training instances n and the number of attributes a, respectively. When we track the variance of n, a is assigned a constant value of 2, while when we track the variance of a, n is fixed at 100. The range of n is 100, 500, 1000, 2000, and 10000 and a = 1, 5, 10, 15, and 20. Figure 5 presents the variance of running time with the variance of n and a, respectively.

4.6.3. Result

In Figure 5, the running time of the two algorithms is approximately linear with respect to the number of attributes a but exponential with respect to the number of training instances n. In addition, we found that FWELM-INTER always costs more running time than FSVM-INTRA and is consistent with the theory analysis listed above. It can be found that the computation of relative density is time-consuming, and the proposed algorithms scarify running time efficiency to acquire promotion of the learning model.

4.7. Threats to Validity

In this section, the threats to validity are described through internal aspect and external aspect.

Threads to internal validity concern the selection of data sets and methods. The research selected NASA data sets that have been commonly used in software prediction. These data sets are from software engineering and have different metrics. However, evaluating the proposed approach on a large scale of practical projects is always desirable. Meanwhile, for the NASA data sets, the parameter settings in the proposed algorithms are from the experimental results. Nevertheless, the parameters need to be selected based on the chosen data sets. Thus, the proposed method should be verified in more data sets.

Threads to external validity concern the possibility to generalize experimental results and comparisons with related algorithms. The proposed approach only requires that the metrics can be known and computed in the data sets, and the data sets are available. However, the selected data sets are open source and provide the metrics in our experiment. Meanwhile, we just compare some of state-of-the-art methods. Moreover, it is necessary to ask for support from more software developers in the companies to evaluate our approach and discuss the causal between software defects and software development. Further studies using different data sets, analysing metrics in detail, and comparing more algorithms may prove fruitful.

5. Conclusions

SDP aims at finding as many defective software modules as possible without hurting the overall performance of the constructed predictor. Although a lot of SDP models have been proposed in previous works, the imbalanced distribution between classes is still not considered well. By drawing the experience of previous works, this paper improved WELM to solve imbalance problem of SDP data set. Firstly, a novel measure called relative density was proposed to estimate the significance of each instance in feature space. Then, fuzzy membership functions are designed and replace the weights of WELM. Next, the algorithms of FWELM-INTRAN and FWELM-INTER were created to train SDP models. Finally, the algorithms were evaluated on ten real-world SDP data sets, and three performance measures were considered: G-mean, AUC, and Balance.

In order to prove the effectiveness and efficiency of our methods to tackle imbalanced SDP problems, this paper performed comparison with six class imbalance learning methods related to ELM (ELM, ELM-RUS, ELM-ROS, ELM-SMOTE, WELM1, and WELM2) and also performed comparison shown to have better performance than the two top-ranked predictors (Naive Bayes and Random Forest) in the SDP literature. From the results obtained, the proposed algorithms could result in significantly better classification results in the context of SDP.

In summary, the proposed algorithms are more robust as they are adaptive for different data distribution types and can more accurately estimate the significance of each instance and assign the identical total fuzzy coefficients for two different classes without considering the impact of data scale. Of course, they have several drawbacks as well, including quite high time-complexity induced by K-nearest-neighbors calculation, and the selection of the parameter K might also affect the quality of the classification model to some extent.

In future work, it will be interesting to develop more efficient cost-sensitive learning algorithms with low time-complexity. In addition, how to transform the proposed methods to address cross-project SDP problem effectively will be investigated, too.

Data Availability

The data and code used to support this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the Scientific Research Foundation for the introduction of talent of Jiangsu University of Science and Technology, China; Natural Science Foundation of the Higher Education Institutions of Jiangsu Province, China (Grant no. 18JKB520011); Primary Research and Development Plan (Social Development) of Zhenjiang City, China (Grant no. SH2019021); Natural Science Foundation of Jiangsu Province, China (Grant no. BK20191457); and high tech ship research project of Ministry of Industry and Information Technology.