Abstract

Cross-project defect prediction (CPDP) on projects with limited historical data has attracted much attention. To the best of our knowledge, however, the performance of existing approaches is usually poor, because of low quality cross-project training data. The objective of this study is to propose an improved method for CPDP by simplifying training data, labeled as TDSelector, which considers both the similarity and the number of defects that each training instance has (denoted by defects), and to demonstrate the effectiveness of the proposed method. Our work consists of three main steps. First, we constructed TDSelector in terms of a linear weighted function of instances’ similarity and defects. Second, the basic defect predictor used in our experiments was built by using the Logistic Regression classification algorithm. Third, we analyzed the impacts of different combinations of similarity and the normalization of defects on prediction performance and then compared with two existing methods. We evaluated our method on 14 projects collected from two public repositories. The results suggest that the proposed TDSelector method performs, on average, better than both baseline methods, and the AUC values are increased by up to 10.6% and 4.3%, respectively. That is, the inclusion of defects is indeed helpful to select high quality training instances for CPDP. On the other hand, the combination of Euclidean distance and linear normalization is the preferred way for TDSelector. An additional experiment also shows that selecting those instances with more bugs directly as training data can further improve the performance of the bug predictor trained by our method.

1. Introduction

Software defect prediction is one of the most active research topics in Software Engineering. Most early studies usually trained predictors (also known as prediction models) from the historical data on software defects/bugs in the same software project and predicted defects in its upcoming release versions [1]. This approach is referred to as Within-Project Defect Prediction (WPDP). However, WPDP has an obvious drawback when a project has limited historical defect data.

To address the above issue, researchers in this field have attempted to apply defect predictors built for one project to other projects [27]. This method is termed Cross-Project Defect Prediction (CPDP). The main purpose of CPDP is to predict defect-prone instances (such as classes) in a project based on the defect data collected from other projects on those public software repositories like PROMISE (http://openscience.us/repo/). The feasibility and potential usefulness of cross-project predictors built with a number of software metrics have been validated [1, 3, 5, 6], but how to improve the performance of CPDP models is still an open issue.

Peters et al. [5] argued that selecting appropriate training data from a software repository became a major issue for CPDP. Moreover, some researchers also suggested that the success rate of CPDP models could be drastically improved when using a suitable training dataset [1, 7]. That is to say, the selection of training data of quality could be a key breakthrough on the above issue. Thus, the construction of an appropriate training dataset gathered from a large number of projects on public software repositories is indeed a challenge for CPDP [7].

As far as we know, although previous studies on CPDP have taken different types of software metrics into account during the process of selecting relevant training samples, none of them considered the number of defects contained in each sample (denoted by defects). But in fact, we argue that it is also an important factor to consider. Fortunately, some studies have empirically demonstrated the relevance of defects to prediction. For example, “modules with faults in the past are likely to have faults in the future” [8], “17% to 54% of the high-fault files of release are still high-fault in release ” [9], “cover 73%–95% of faults by selecting 10% of the most fault prone source code file” [10], and “the number of defects found in the previous release of file correlates with its current defect count on a high level” [11].

Does the selection of training data considering defects improve the performance of CPDP models? If the answer is “Yes”, on the one hand, it is helpful to validate the feasibility of CPDP; on the other hand, it will contribute to better software defect predictors by making full use of those defect datasets available on the Internet.

The objective of our work is to propose an improved method of training data selection for CPDP by introducing the information of defects. Unlike the prior studies similar to our work, such as [5, 12], which focus mainly on the similarity between instances from training set and test set, this paper gives a comprehensive account of two factors, namely, similarity and defects. Moreover, the proposed method, called TDSelector, can automatically optimize their weights to achieve the best result. In brief, our main contributions to the current state of research on CPDP are summarized as follows.

(1) Considering both similarity and defects, we proposed a simple and easy-to-use training data selection method for CPDP (i.e., TDSelector), which is based on an improved scoring scheme that ranks all possible training instances. In particular, we designed an algorithm to calculate their weights automatically, so as to obtain the best prediction result.

(2) To validate the effectiveness of our method, we conducted an elaborate empirical study based on 15 datasets collected from PROMISE and AEEEM (http://bug.inf.usi.ch), and the experimental results show that, in a specific CPDP scenario (i.e., many-to-one [13]), the TDSelector-based defect predictor outperforms its rivals that were built with two competing methods in terms of prediction precision.

With these technical contributions, our study could complement previous work on CPDP with respect to training data selection. In particular, we provide a reasonable scoring scheme as well as a more comprehensive guideline for developers to choose appropriate training data to train a defect predictor in practice.

The rest of this paper is organized as follows. In Section 2, we reviewed the related work of this topic; Section 3 presents the preliminaries to our work; Section 4 describes the proposed method TDSelector, Section 5 introduces our experimental setup, and Section 6 shows the primary experimental results; a detailed discussion of some issues including potential threats to the validity of our study is presented in Section 7; in the end, Section 8 summarizes this paper and presents our future work.

2.1. Cross-Project Defect Prediction

Many studies were carried out to validate the feasibility of CPDP in the last five years. For example, Turhan et al. [12] proposed a cross-company defect prediction approach using defect data from other companies to build predictors for target projects. They found that the proposed method increased the probability of defect detection at the cost of increasing false positive rate. Ni et al. [14] proposed a novel method called FeSCH and designed three ranking strategies to choose appropriate features. The experimental results show that FeSCH can outperform WPDP, ALL, and TCA+ in most cases, and its performance is independent of the used classifiers. He et al. [15] compared the performance between CPDP and WPDP using feature selection techniques. The results indicated that for reduced training data WPDP obtained higher precision, but CPDP in turn achieved a better recall or -measure. Some researchers have also studied the performance of CPDP based on ensemble classifiers and then validated their effects on this issue [16, 17].

Ryu et al. [18] proposed a transfer cost-sensitive boosting method by considering both distributional characteristics and the class imbalance for CPDP. The results show that their method significantly improves CPDP performance. They also [19] proposed a multiobjective naive Bayes learning technique under CPDP environments by taking into account the class-imbalance contexts. The results indicated that their approaches performed better than the single-objective ones and WPDP models. Li et al. [20] compared some famous data filters and proposed a method called HSBF (hierarchical select-based filter) to improve the performance of CPDP. The results demonstrate that the data filter strategy can indeed improve the performance of CPDP significantly. Moreover, when using appropriate data filter strategy, the defect predictor built from cross-project data can outperform the predictor learned by using within-project data.

Zhang et al. [21] proposed a universal CPDP model, which was built using a large number of projects collected from SourceForge (https://sourceforge.net/) and Google Code (https://code.google.com/). Their experimental results showed that it was indeed comparable to WPDP. Furthermore, CPDP is feasible for different projects that have heterogeneous metric sets. He et al. [22] first proposed a CPDP-IFS approach based on the distribution characteristics of both source and target projects to overcome this problem. Nam and Kim [23] then proposed an improved method called HDP, where metric selection and metric matching were introduced to build a defect predictor. Their empirical study on 28 projects showed that about 68% of predictions using the proposed approach outperformed or were comparable to WPDP with statistical significance. Jing et al. [24] proposed a unified metric representation (UMR) for heterogeneous defect data. More researches can be found in [2527]. The experiments on 14 public heterogeneous datasets from four different companies indicated that the proposed approach was more effective in addressing the problem.

2.2. Training Data Selection for CPDP

As mentioned in [5, 28], a fundamental issue for CPDP is to select the most appropriate training data for building quality defect predictors. He et al. [29] discussed this problem in detail from the perspective of data granularity, i.e., release level and instance level. They presented a two-step method for training data selection. The results indicated that the predictor built based on naive Bayes could achieve fairly good performance when using the method together with Peter filter [5]. Porto and Simao [30] proposed an Instance Filtering method by selecting the most similar instances from the training dataset, and the experimental results of 36 versions of 11 open-source projects show that the defect predictor built from cross-project data selected by Feature Selection and Instance Filtering can have generally better performances both in classification and in ranking.

With regard to the data imbalance problem of defect datasets, Jing et al. [31] introduced an effective feature learning method called SDA to provide effective solutions for class-imbalance problems of both within-project and cross-project types, by employing the semisupervised transfer component analysis (SSTCA) method to make the distributions of source and target data consistent. The results indicated that their method greatly improved WPDP and CPDP performance. Ryu et al. [32] proposed a method of hybrid instance selection using nearest neighbor (HISNN). Their results suggested that those instances, which had strong local knowledge, could be identified via nearest neighbors with the same class label. Poon et al. [33] proposed a credibility theory based naive Bayes (CNB) classifier to establish a novel reweighting mechanism between the source projects and target projects, so that the source data could simultaneously adapt to the target data distribution and retain its own pattern. The experimental results demonstrate the significant improvement in terms of the performance metrics considered achieved by CNB over other CPDP approaches.

The above-mentioned existing studies aimed at reducing the gap in prediction performance between WPDP and CPDP. Although they are making progress towards the goal, there is clearly a lot of room for improvement. For this reason, in this paper, we proposed a selection approach to training data based on an improved strategy for instance ranking instead of a single strategy for similarity calculation, which was used in many prior studies [1, 5, 7, 12].

3. Preliminaries

In our context, a defect dataset contains instances, which is represented as . Instance is an object class represented as , where is the metric value of instance and is the number of metrics (also known as features). Given a source dataset and a target dataset , CPDP aims to perform a prediction in using the knowledge extracted from , where (see Figure 1(a)). In this paper, source and target datasets have the same set of metrics, and they may differ in distributional characteristics of metric values.

To improve the performance of CPDP, several strategies used to select appropriate training data have been put forward (see Figure 1(b)); e.g., Turhan et al. [12] filtered out those irrelevant training instances by returning -nearest neighbors for each test instance.

3.1. An Example of Training Data Selection

First, we introduce a typical method for training data selection at the instance level, and a simple example is used to illustrate this method. For the strategy for other levels of training data selection, such as at the release level, please refer to [7].

Figure 2 shows a training set (including five instances) and a test set (including an instance). Here, each instance contains five metrics and a classification label (i.e., 0 or 1). An instance is defect-free (label = 0) only if its defects are equal to 0; otherwise, it is defective (label = 1). According to the -nearest neighbor method based on Euclidean distance, we can rank all the five training instances in terms of their distances from the test instance. Due to the same nearest distance from test instance , it is clear that three instances , , and are suitable for use as training instances when is set to 1. For the three instances, and have the same metric values, but is labeled as a defective instance because it contains a bug. In this case, will be selected with the same probability as that of , regardless of the number of defects they include.

In this way, those instances most relevant to the test one can be quickly determined. Clearly, the goal of training data selection is to preserve the representative training instances in as much as possible.

3.2. General Process of Training Data Selection

Before presenting our approach, we describe a general selection process of training data, which consists of three main steps: TDS (training dataset) setup, ranking, and duplicate removal.

TDS Setup. For each target project with little historical data, we need to set up an initial TDS where training data are collected from other projects. To simulate this scenario of CPDP, in this paper, any defect data from the target project must be excluded from the initial TDS. Note that different release versions of a project actually belong to the same project. A simple example is visualized in Figure 3.

Ranking. Once the initial TDS is determined, an instance will be treated as a metric vector , as mentioned above. For each test instance, one can calculate its relevance to each training instance and then rank these training instances in terms of their similarity based on software metrics. Note that a wide variety of software metrics, such as source code metrics, process metrics, previous defects, and code churn, have been used as features for CPDP approaches to improve their prediction performance.

Duplicate Removal. Let be the size of test set. For each test instance, if we select its -nearest neighbors from the initial TDS, there are a total of candidate training instances. Considering that these selected instances may not be unique (i.e., a training instance can be the nearest neighbor of multiple test instances), after removing the duplicate ones, they form the final training set which is a subset of the initial TDS.

4. Our Approach TDSelector

To improve the prediction performance of CPDP, we leverage the following observations.

Similar Instances. Given a test instance, we can examine its similar training instances that were labeled before. The defect proneness shared by similar training instances could help us identify the probability that a test instance is defective. Intuitively, two instances are more likely to have the same state if their metric values are very similar.

Number of Defects (defects). During the selection process, when several training instances have the same distance from a test instance, we need to determine which one should be ranked higher. According to our experiences in software defect prediction and other researchers’ studies on the quantitative analysis of previous defect prediction approaches [34, 35], we believe that more attention should be paid to those training instances with more defects in practice.

The selection of training data based on instance similarity has been used in some prior studies [5, 12, 35]. However, to the best of our knowledge, the information about defects has not been fully utilized. So, in this paper, we attempt to propose a training data selection approach combining such information and instance similarity.

4.1. Overall Structure of TDSelector

Figure 3 shows the overall structure of the proposed approach to training data selection named TDSelector. Before selecting appropriate training data for CPDP, we have to set up a test set and its corresponding initial TDS. For a given project treated as the test set, all the other projects (except the target project) available at hand are used as the initial TDS. This is the so-called many-to-one (M2O) scenario for CPDP [13]. It is quite different from the typical O2O (one-to-one) scenario, where only one randomly selected project is treated as the training set for a given target project (namely, test set).

When both of the sets are given, the ranks of training instances are calculated based on the similarity of software metrics and then returned for each test instance. For the initial TDS, we also collect each training instance’s defects and thus rank these instances by their defects. Then, we rate each training instance by combining the two types of ranks in some way and identify the top-k training instances for each test instance according to their final scores. Finally, we use the predictor trained with the final TDS to predict defect proneness in the test set. We describe the core component of TDSelector, namely, scoring scheme, in the following subsection.

4.2. Scoring Scheme

For each instance in training set and test set which is treated as a vector of features (namely, software metrics), we calculate the similarity between them in terms of similarity index (such as cosine similarity, Euclidean distance, and Manhattan distance, as shown in Table 1). Training instances are then ranked by the similarity between each of them and a given test instance.

For instance, the cosine similarity between a training instance and the target instance is computed via their vector representations, described as follows:where and are the metric vectors for and , respectively, and represents the th metric value of instance .

Additionally, for each training instance, we also consider the factor defects in order to further enrich the ranking of its relevant instances. The assumption here is that the more the previous defects, the richer the information of an instance. So, we propose a scoring scheme to rank those candidate training instances, defined as below:where represents the defects of , is a weighting factor which is learned from training data using Algorithm 1 (see Algorithm 1), and is a function used to normalize defects with values ranging from 0 to 1.

Optimizing the parameter
Input:
(1) Candidate TDS , test set ,
(2) , and
Output:
(3)
Method:
(4) Initialize ;
(5) While    do
(6)For  
(7) For  
(8) Score;
(9) End For
(10) descSort sort training instances in descending order
(11) select the top instances
(12) End For
(13) prediction result
(14) ;
(15) End While
(16) Return  ;

Normalization is a commonly used data preprocessing technique for mathematics and computer science [36]. Graf and Borer [37] have confirmed that normalization can improve prediction performance of classification models. For this reason, we normalize the defects of training instances when using TDSelector. As you know, there are many normalization methods. In this study, we introduce five typical normalization methods used in machine learning [36, 38]. The description and formulas of the five normalization methods are listed in Table 1.

For each test instance, the top- training instances ranked in terms of their scores will be returned. Hence, the final TDS is composed by merging the sets of the top- training instances for each test instance when those duplicate instances are removed.

5. Experimental Setup

5.1. Research Questions

Our experiments were conducted to find empirical evidence that answers the following three research questions.

RQ1: Does the Consideration of Defects Improve the Performance of CPDP? Unlike the previous methods [1, 5, 7, 12, 29], TDSelector ranks candidate training instances in terms of both defects and metric-based similarity. To evaluate the effectiveness of the proposed method considering the additional information of defects, we tested TDSelector according to the experimental data described in Section 5.2. According to (2), we also empirically analyzed the impact of the parameter α on prediction results.

RQ2: Which Combination of Similarity and Normalization Is More Suitable for TDSelector? Equation (2) is comprised of two parts, namely, similarity and the normalization of defects. For each part, several commonly used methods can be adopted in our context. To fully take advantage of TDSelector, one would wonder which combination of similarity and normalization should be chosen. Therefore, it is necessary to compare the effects of different combinations of similarity and normalization methods on prediction results and to determine the best one for TDSelector.

RQ3: Can TDSelector-Based CPDP Outperform the Baseline Methods? Cross-project prediction has attracted much research interest in recent years, and a few CPDP approaches using training data selection have also been proposed, e.g., Peter filter based CPDP [5] (labeled as baseline1) and TCA+ (Transfer Component Analysis) based CPDP [39] (labeled as baseline2). To answer the third question, we compared TDSelector-based CPDP proposed in this paper with the above two state-of-the-art methods.

5.2. Data Collection

To evaluate the effectiveness of TDSelector, in this paper, we used 14 open-source projects written in Java on two online public software repositories, namely, PROMISE [40] and AEEEM [41]. The data statistics of the 14 projects in question are presented in Table 2, where #Instance and #Defect are the numbers of instances and defective instances, respectively, and % Defect is the proportion of defective instances to the total number of instances. Each instance in these projects represents a file of object class and consists of two parts, namely, software metrics and defects.

The first repository, PROMISE, was collected by Jureczko and Spinellis [40]. The information of defects and 20 source code metrics for the projects on PROMISE have been validated and used in several previous studies [1, 7, 12, 29]. The second repository, AEEEM, was collected by D’Ambros et al. [41], and each project on it has 76 metrics, including 17 source code metrics, 15 change metrics, 5 previous defect metrics, 5 entropy-of-change metrics, 17 entropy-of-source-code metrics, and 17 churn-of-source-code metrics. AEEEM has been successfully used in [23, 39].

Before performing a cross-project prediction, we need to determine a target dataset (test set) and its candidate TDS. For PROMISE (10 projects), each one in the 10 projects was selected to be the target dataset once, and then we set up a candidate TDS for CPDP, which excluded any data from the target project. For instance, if Ivy is selected as test project, data from the other nine projects was used to construct its initial TDS.

5.3. Experiment Design

To answer the three research questions, our experimental procedure, which is designed under the context of M2O in the CPDP scenario, is described as follows.

First, as with many prior studies [1, 5, 15, 35], all software metric values in training and test sets were normalized by using the -score method, because these metrics are different in the scales of numerical values. For the 14 projects on AEEEM and PROMISE, their numbers of software metrics are different. So, the training set for a given test set was selected from the same repository.

Second, to examine whether the consideration of defects improves the performance of CPDP, we compared our approach TDSelector with NoD, which is a baseline method considering only the similarity between instances, i.e., in (2). Since there are three similarity computation methods used in this paper, we designed three different TDSelectors and their corresponding baseline methods based on similarity indexes. The prediction results of each method in question for the 15 test sets were analyzed in terms of mean value and standard deviation. More specifically, we also used Cliff’s delta () [42], which is a nonparametric effect size measure of how often the values in one distribution are larger than the values in a second distribution, to compare the results generated through our approach and its corresponding baseline method.

Because Cliff did not suggest corresponding δ values to represent small, medium, and large effects, we converted Cohen’s effect size to Cliff’s δ using cohd2delta R package (https://rdrr.io/cran/orddom/man/cohd2delta.html.). Note that Table 3 contains descriptors for magnitude of to 2.0.

Third, according to the results of the second step of this procedure, 15 combinations based on three typical similarity methods for software metrics and five commonly used normalization functions for defects were examined by the pairwise comparison method. We then determined which combination is more suitable for our approach according to mean, standard deviation, and Cliff’s delta effect size.

Fourth, to further validate the effectiveness of the TDSelector-based CPDP predictor, we conducted cross-project predictions for all the 15 test sets using TDSelector and two competing methods (i.e., baseline1 and baseline2 introduced in Section 5.1). Note that the TDSelector used in this experiment was built with the best combination of similarity and normalization.

After this process is completed, we will discuss the answers to the three research questions of our study.

5.4. Classifier and Evaluation Measure

As an underlying machine learning classifier for CPDP, Logistic Regression (LR), which was widely used in many defect prediction literatures [4, 23, 39, 4346], is also used in this study. All LR classifiers were implemented with Weka (https://www.cs.waikato.ac.nz/ml/weka/). For our experiments, we used the default parameter setting for LR specified in Weka unless otherwise specified.

To evaluate the prediction performance of different methods, in this paper, we utilized the area under a Receiver Operating Characteristic curve (AUC). AUC is equal to the probability that a classifier will identify a randomly chosen defective class higher than a randomly chosen defect-free one [47], known as a useful measure for comparing different models. Compared with traditional accuracy measures, AUC is commonly used because it is unaffected by class imbalance and independent of the prediction threshold that is used to decide whether an instance should be classified as a negative instance [6, 48, 49]. The AUC value of 0.5 indicates the performance of a random predictor, and higher AUC values indicate better prediction performance.

6. Experimental Results

6.1. Answer to RQ1

We compared our approach considering defects with the baseline method NoD that selects training data in terms of cosine similarity. Table 5 shows that, on average, TDSelector does achieve an improvement in AUC value across the 15 test sets. Obviously, the average growth rates of AUC value vary from 5.9% to 9.0% when different normalization methods for defects were utilized. In addition, all the values in this table are greater than 0.2, which indicates that each group of 15 prediction results obtained by our approach has a greater effect than that of NoD. In other words, our approach outperforms NoD. In particular, for Jedit, Velocity, Eclipse, and Equinox, the improvements of our approach over NoD are substantial. For example, when using the linear normalization method, the AUC values for the four projects are increased by 30.6%, 43.0%, 22.6%, and 39.4%, respectively; moreover, the logistic normalization method for Velocity achieves the biggest improvement in AUC value (namely, 61.7%).

We then compared TDSelector with the baseline methods using other widely used similarity calculation methods, and the results obtained by using Euclidean distance and Manhattan distance to calculate the similarity between instances are presented in Tables 6 and 7. TDSelector, compared with the corresponding NoD, achieves the average growth rates of AUC value that vary from 5.9% to 7.7% in Table 6 and from 2.7% to 6.9% in Table 7, respectively. More specifically, the highest growth rate of AUC value in Table 6 is 43.6% for Equinox and in Table 7 is 39.7% for Lucene2. Besides, all Cliff’s delta () effect sizes in these two tables are also greater than 0.1. Hence, the results indicate that our approach can, on average, improve the performance of those baseline methods without regard to defects.

In short, during the process of training data selection, the consideration of defects for CPDP can help us to select higher quality training data, thus leading to better classification results.

6.2. Answer to RQ2

Although the inclusion of defects in the selection of training data of quality is helpful for better performance of CPDP, it is worthy to note that our method completely failed in Mylyn and Pde when computing the similarity between instances in terms of Manhattan distance (see the corresponding maximum AUC values in Table 7). This implies that the success of TDSelector depends largely on the reasonable combination of similarity and normalization methods. Therefore, which combination of similarity and normalization is more suitable for TDSelector?

First, we analyzed the two factors (i.e., similarity and normalization) separately. For example, we evaluated the difference among cosine similarity, Euclidean distance, and Manhattan distance, regardless of any normalization method used in the experiment. The results, expressed in terms of mean and standard deviation, are shown in Table 4, where they are grouped by factors.

If we do not take into account normalization, Euclidean distance achieves the maximum mean value 0.719 and the minimum standard deviation value 0.080 among the three similarity indexes, followed by cosine similarity. Therefore, Euclidean distance and cosine similarity are the first and second choices of our approach, respectively. On the other hand, if we do not take into account similarity index, the logistic normalization method seems to be the most suitable method for TDSelector, indicated by the maximum mean value 0.710 and the minimum standard deviation value 0.078, and it is followed by the linear normalization method.

Therefore, the logistic normalization method is the preferred way for TDSelector to normalize defects, while the linear normalization method is a possible alternative method. It is worth noting that the evidence that all Cliff’s delta ( effect sizes in Table 4 are negative also supported the result. Then, a simple guideline for choosing similarity indexes and normalization methods for TDSelector from two different aspects is presented in Figure 4.

Then, we considered both factors. According to the results in Tables 5, 6, and 7 grouped by different similarity indexes, TDSelector can obtain the best result , , and when using “Euclidean + Linear” (short for Euclidean distance + linear normalization), “Cosine + Logistic” (short for cosine similarity + logistic normalization), and “Manhattan + Logistic” (short for Manhattan distance + logistic normalization), respectively. We also calculated the value of Cliff’s delta () effect size for every two combinations under discussion. As shown in Table 8, according to the largest number of positive values in this table, the combination of Euclidean distance and the linear normalization method can still outperform the other 14 combinations.

6.3. Answer to RQ3

A comparison between our approach and two baseline methods (i.e., baseline1 and baseline2) across the 15 test sets is presented in Table 9. It is obvious that our approach is, on average, better than the two baseline methods, indicated by the average growth rates of AUC value (i.e., 10.6% and 4.3%) across the 15 test sets. The TDSelector performs better than baseline1 in 14 out of 15 datasets, and it has an advantage over baseline2 in 10 out of 15 datasets. In particular, compared with baseline1 and baseline2, the highest growth rates of AUC value of our approach reach up to 65.2% and 64.7%, respectively, for Velocity. We also analyzed the possible reason in terms of the defective instances of simplified training dataset obtained from different methods. Table 10 shows that the proportion of defective instances in each simplified training dataset is very close. However, according to instances with more than one defect among these defective instances, our method can return more, and the ratio approximates to twice as large as that of the baselines. Therefore, a possible explanation for the improvement is that the information about defects was more fully utilized due to the instances with more defects. The result further validated that the selection of training data considering defects is valuable.

Besides, the negative values in this table also indicate that our approach outperforms the baseline methods from the perspective of distribution, though we have to admit that the effect size 0.009 is too small to be of interest in a particular application.

In summary, since the TDSelector-based defect predictor outperforms those based on the two state-of-the-art CPDP methods, our approach is beneficial for training data selection and can further improve the performance of CPDP models.

7. Discussion

7.1. Impact of Top- on Prediction Results

The parameter determines the number of the nearest training instances of each test instance. Since was set to 10 in our experiments, here we discuss the impact of on prediction results of our approach as its value is changed from 1 to 10 with a step value of 1. As shown in Figure 5, for the three combinations in question, selecting the -nearest training instances (e.g., ) for each test instance in the 10 test sets from PROMISE, however, does not lead to better prediction results, because their best results are obtained when is equal to 10.

Interestingly, for the combinations of “Euclidean + Linear” and “Cosine + Linear”, a similar trend of AUC value changes is visible in Figure 6. For the five test sets from AEEEM, they achieve stable prediction results when ranges from four to eight, and then they reach peak performance when is equal to 10. The combination of “Manhattan + Logistic”, by contrast, achieves the best result as is set to 7. Even so, the best result is still worse than those of the other two combinations.

7.2. Selecting Instances with More Bugs Directly as Training Data

Our experimental results have validated the impact of defects on the selection of training data of quality in terms of AUC, and we also want to know whether the direct selection of defective instances with more bugs as training instances, which simplifies the selection process and reduces computation cost, would achieve better prediction performance. The result of this question is of particular concern for developers in practice.

According to Figure 7(a), for the 15 releases, most of them contain instances with no more than two bugs. On the other hand, the ratio of the instances that have more than three defects to the total instances is less than 1.40% (see Figure 7(b)). Therefore, we built a new TDSelector based on the number of bugs in each instance, which is referred to as TDSelector-3. That is to say, those defective instances that have at least three bugs were chosen directly from an initial TDS as training data, while the remaining instances in the TDS were selected in light of (2). All instances from the two parts then form the final TDS after removing redundant ones.

Figure 8 shows that the results of the two methods differ from dataset to dataset. For Ivy and Xerces collected from PROMISE, TDSelector outperforms TDSelector-3 in all the three scenarios, but only slightly. On the contrary, for Lucene and Velocity from PROMISE, the incremental AUC values obtained by using TDSelector-3 with “Cosine + Linear” reach up to 0.109 and 0.235, respectively. As shown in Figure 8, on average, TDSelector-3 performs better than the corresponding TDSelector, and the average AUC values for “Cosine + Linear”, “Euclidean + Linear”, and “Manhattan + Logistic” are improved by up to 3.26%, 2.57%, and 1.42%, respectively. Therefore, the direct selection of defective instances that contain quite a few bugs can, overall, further improve the performance of the predictor trained by our approach. In other words, those valuable defective instances can be screened out quickly according to a threshold for the number of bugs in each training instance (namely, three in this paper) at the first stage. Our approach is then able to be applied to the remaining TDS. Note that the automatic optimization method for such a threshold for TDSelector will be investigated in our future work.

7.3. Threats to Validity

In this study, we obtained several interesting results, but potential threats to the validity of our work remain.

Threats to internal validity concern any confounding factor that may affect our results. First, the raw data used in this paper were normalized by using the -score method, while the baseline method TCA+ provides four normalization methods [39]. Second, unlike TCA+, TDSelector does not introduce any feature selection method to process software metrics. Third, the weighting factor α changes with a step size 0.1, when Algorithm 1 calculates the maximum value of AUC. There is no doubt that a smaller step size will result in greater calculation time. Fourth, we trained only one type of defect predictor based on the default parameter settings configured by the tool Weka, because LR has been widely used in previous studies. Hence, we are indeed aware that the results of our study would change if we use different settings of the above three factors.

Threats to statistical conclusion validity focus on whether conclusions about the relationship among variables based on the experimental data are correct or reasonable [50]. In addition to mean value and standard deviation, in this paper, we also utilized Cliff’s delta effect size instead of hypothetical test methods such as the Kruskal–Wallis test [51] to compare the results of different methods, because there are only 15 datasets collected from PROMISE and AEEEM. According to the criteria that were initially suggested by Cohen and expanded by Sawilowsky [52], nearly all of the effect size values in this paper belong to small () and very small (). This indicates that there is no significant difference in AUC value between different combinations in question, though some perform better in terms of mean value and standard deviation. However, it is clear that our method obviously performs better than baseline1, indicated by .

Threats to external validity emphasize the generalization of the obtained results. First, the selection of experimental datasets—in addition to AEEEM and PROMISE—is the main threat to validate the results of our study. All the 14 projects used in this paper are written in Java and from the Apache Software Foundation and the Eclipse Foundation. Although our experiments can be repeated with more open-source projects written in other programming languages and developed with different software metrics, the empirical results may be different from our main conclusions. Second, we utilized only three similarity indexes and five normalization methods when calculating the score of each candidate training instance. Therefore, the generalizability of our method for other similarity indexes (such as Pearson Correlation Coefficient and Mahalanobis distance [53]) and normalization methods has yet to be tested. Third, to compare our method with TCA+, defect predictors used in this paper were built using LR, implying that the generalizability of our method for other classification algorithms remains unclear.

8. Conclusion and Future Work

This study aims to train better defect predictors by selecting the most appropriate training data from those defect datasets available on the Internet, to improve the performance of cross-project defect predictions. In summary, the study has been conducted on 14 open-source projects and consists of (1) an empirical validation on the usability of the number of defects that an instance includes for training data selection, (2) an in-depth analysis of our method TDSelector with regard to similarity and normalization, and (3) a comparison between our proposed method and the benchmark methods.

Compared with those similar previous studies, the results of this study indicate that the inclusion of defects does improve the performance of CPDP predictors. With a rational balance between the similarity of test instances with training instances and defects, TDSelector can effectively select appropriate training instances, so that TDSelector-based defect predictors, built by using LR, achieve better prediction performance in terms of AUC. More specifically, the combination of Euclidean distance and linear normalization is the preferred way for TDSelector. In addition, our results also demonstrate the effectiveness of the proposed method according to a comparison with the baseline methods in the context of M2O in CPDP scenarios. Hence, we believe that our approach can be helpful for developers when they are required to build suitable predictors quickly for their new projects, because one of our interesting findings is that those candidate instances with more bugs can be chosen directly as training instances.

Our future work mainly includes two aspects. On the one hand, we plan to validate the generalizability of our study with more defect data from projects written in different languages. On the other hand, we will focus on more effective hybrid methods based on different selection strategies such as feature selection techniques [32]. Last but not least, we also plan to discuss the possibility of considering not only the number of defects but also time variables for training data selection (such as bug-fixing time).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

The authors greatly appreciate Dr. Nam and Dr. Pan, the authors of [39], for providing them with the TCA source program and teaching them how to use it. This work was supported by the Natural Science Foundation of Hubei province (no. 2016CFB309) and the National Natural Science Foundation of China (nos. 61272111, 61273216, and 61572371).