Abstract

Software failure such as software defect causes billion of dollar loss every year. Software failure also affects billion of people worldwide. Inadequate software testing can cause software failure. To predict the software defect, this study proposed a model consisting of feature selection and classifications. The correlation base method was used for feature selection, and radial base function neural network (RBF) was used for classification. Also, for testing the proposed system, fourteen NASA data sets were used including CM1, JM1, KC1, KC2, KC3, KC4, MC1, MC2, MW1, PC1, PC2, PC3, PC4, and PC5. The data set was divided using the well-known K-cross-validation methods which were performed to divide the data set for training and testing the RBF. The RBF were trained and tested before and after feature selections. Precision, recall, F-measure, and accuracy are four methods used to evaluate the performance of the proposed methods. The precision obtained for the fourteen data sets was CM1, 94.01%; JM1, 85.18%; KC1, 83.24%; KC2, 81.27%; KC3, 79.30%; KC4, 85.29%; MC1, 99.89%; MC2, 73.27%; MW1, 90.90%; PC1, 98.79%; PC2, 100%; PC3, 95.67%; PC4, 95.12%; and PC5, 80.89%. Recall was as follows: CM1, 95.78%; JM1, 87.89%; KC1, 86.24%; KC2, 83.82%; KC3, 82.10%; KC4, 86.28%; MC1, 100%; MC2, 76.67%; MW1, 92.09%; PC1, 99.98%; PC2, 100%; PC3, 96.23%; PC4, 95.17%; and PC5, 81.80%. F-measure was as follows: CM1, 0.95; JM1, 0.87; KC1, 0.83; KC2, 0.82; KC3, 0.85; KC4, 0.86; MC1, 0.99; MC2, 0.76; MW1, 0.95; PC1, 0.99; PC2, 0.99; PC3, 0.97; PC4, 0.95; and PC5, 0.80. The accuracy obtained was as follows: CM1, 93.99%; JM1, 84.87%; KC1, 83.25%; KC2, 79.11%; KC3, 78.25%; KC4, 83.18%; MC1, 99.01%; MC2, 70.18%; MW1, 88.90%; PC1, 98.99%; PC2, 99.80%; PC3, 94.11%; PC4, 94.4%; and PC5, 79.02%. The proposed method results were compared with the result obtained from different methods. The proposed model obtained better results than other methods for data set CM1, KC4, MC1, PC1, PC2, PC3, PC4, and PC5.

1. Introduction

The volume of the software global business in 2020 was estimated by $389.86 billion, and it is also expected to exceed $428.84 billion by 2028 [1]. Software defect has a huge financial impact on software industrial. Software failure causes about $1.7 trillion loses in 2017 [2]. Mars Climate Orbiter, Mariner 1, and Arian5 are examples of expensive software failure that causes $193 million, 18.5 million, and 8 million, respectively. However, testing and quality assurance cost can reach up to 23% of software total cost [3]. Moreover, fixing software faults during the software development will cost 100 times less than fixing the software after deployments [3]. Different terms are used to describe software fails such as fault, defect, bugs, problem, errors, and incident. Developer tends to imply defect term when the software is in really severe condition and may even be dangerous [4]. Indeed, software defect refers to any fault occurred during developing any software.

Software defect problems are addressed by the inspectors who meet to inspect the software. Later on, the programmer and moderator make the change [4]. Inspector needs to spend time, effort, and money to detect the software defect. In addition, the accuracy of detecting the software defect can be affected by numerous factors, one of them is the human errors. Hence, the human error can make the process of determining software defect more complicated and a high-risk task. Therefore, additional opinion is needed to help inspectors to achieve more accurate software defects. Software defect prediction refers to the process that can predict the defected code and specify the area of the defected code through using classifier [5]. For the last few decades’ prediction, the software defect has been the subject of interest for many researchers. Therefore, many automatic software defect prediction models were proposed that make the inspectors and programmers spend more time and money in enhancing the model than predicting the software defects. In addition, the perditions and classification process made by humans have faced many challenges such as human error and low performance, and most of the prediction and classification process was performed through an automated system. Artificial neural networks (ANNs) is the most common method used by researchers for prediction. ANNs find their way in different applications in the real world like medical diagnosis such as heart disease, Wisconsin breast cancer, thyroid disease, and Pima Indian diabetes [6]. Also, ANN is being applied in the agriculture such as planet leaf disease [7], predicting mass of the fruits [8], and olive oil harvesting time [9]; weather forecasting [10]; language recognitions [11]; and many other applications.

All the data set can be used for training and testing the neural network. However, not all the features in the data set have the same importance for prediction the software defect. For instance, in the large dimensionality problem, features with less importance could cause the generalized performances of ANN to deteriorate [12]. Therefore, feature selection tools can be used to determine the features with high importance and to eliminate the feature with less importance [13]. Moreover, different tools can be used to select the feature such as discriminant analysis (DA), principal component analyses (PA), decision tree (T), and multilayer perceptron (MLP). Also, feature selection found many applications worldwide to solve real-world problems such as landslide prediction [14, 15], medical application [16, 17], food industries [18, 19], face recognitions [20, 21], and many other applications [2224].

The aim of the study can be achieved by answering the following research questions:(1)How the RBF neural network perform when it used to predict the software defect for fourteen NASA data sets?(2)How the RBF neural network perform when it used to predict the software defect for fourteen NASA data sets after eliminating nonrelated features?(3)How the RBF neural network perform when it compares to previous methods that used to predict software defect for fourteen NASA data sets?(4)How the RBF neural network perform when it compares to previous methods that used to predict the software defect for fourteen NASA data sets after eliminating nonrelated features?

In Section 1, an introduction for this study is introduced. Section 2 summarized the previous work that focused on the prediction methods that used to predict the software defection using the same data set that is used in this study. Section 3 introduced the data set and evaluation methods. Section 5 explained the experimental setup. Result is discussed in Section 6.

A frame work to predict software defect code area is introduced in [5]. The proposed system was performed through two steps: feature selection—using abstract syntax trees (AST) method—and prediction—using conventional neural networks. The system proposed in this study was trained and tested using seven data sets. The data set was imported from PROMISE repository. Also, the system performance was verified using three evaluation metrics including precision, recall, and F-measure. The proposed system was trained and tested two times; firstly, with all features and secondly with features that are selected by AST, the system achieved better performance when trained and tested with features that are selected by AST.

Software defect prediction using the Bayesian neural network was proposed in [25]. Nine data sets consisting of nine features obtained from the PROMISE data repository were used to train the Bayesian neural network. Training neural network was performed in two phases, using the data without reducing the number of features in the first phase. While in the second training phase, the number of features was reduced to five features by using two methods of feature selection, namely, Relief (Relief Fat tribute Eval with Ranker search method) and Subset Eval attribute evaluator with best first search method. Meanwhile, area under curve (AUC) was the evaluation performance tool. Bayesian neural network with data after feature selection has shown better performance over the performance achieved by using data with all features for training.

Deep belief network (DBN) was used to predict the defect in the software if exists. The DBN performance was evaluated using precision, recall, and F1-measure. As expected, DBN achieved better results when trained after eliminating unimportant features. A ten Java bench mark data set obtained from the PROMISE repository in [26] was used to predict the software defections. In addition, the abstract syntax trees (ASTs) were used as feature selection tool to select the most important factors that can improve the software defect prediction. DBN trained with data after selecting the most important feature achieved better results.

Two classifiers: random forests (RF), logistic regression (LR), and standard data sets obtained from the PROMISE repository was used in [27] to predict the software defect. The data set contains ten features that contribute to determining the place of defect in the code. In a study [27], they proposed a model to predict the software defect which reduced the number of features using the abstract syntax before entering the RF and LR. Results were evaluated by using four performance methods: precision, recall, F-measure, and AUC were used as discriminators. Evaluation method demonstrates that result achieved with feature selections has better accuracy.

Four classifiers: random forest (RF), Naïve Bayes, RPart, and support vector machine (SVM) was used to predict software defect predictions. The aim of the study was proposed by [28] to specify the best classifier among these four classifiers. For this, eighteen data sets were used. Twelve of these data sets were obtained from NASA data set, three from open source data set, and three from commercial data set. Also, the confusion matrix was used for performance evaluation. The study concludes that each classifier has its own strength and weakness when used to classify some data, and the strength of the classifier depends on the data type.

A study proposed in 2018 introduce a model to predict the software defect based on feature selection by Dependent Naïve Bayes (FDNB) and Naïve Bayes neural network. The proposed system was trained and tested using eight NASA data set, each data set conduct of thirty-eight features some of these features believes that does not have a strong relation with prediction the software defect. Cross-validation method with convolution matrix was used to evaluate the performance of the proposed model. The system achieved better performance when it was trained with the selected features [29].

Software defect can be determinate by using some features that are associated with software. Some of these features has a strong relationship with class of software as (defect or nondefect). The authors of [30] introduce a method to investigate the relation between the defect or nondefect of the software and the features. The study conducted eleven data set with features varied between twenty-one features to thirty-nine features (every data set has different number of features). Therefore, the similarity measure (SM) method was used to select the strongest relation feature with the software defect. The strongest features were tested using the k-nearest neighbor (KNN) model, and the performance of KNN was evaluated using the AUC model. The results show that KNN achieved better results when it used the selected features over all features.

Since the software defect prediction has been the subject of interest for the last few decades, comparison between different software predictions models was introduced by [31]. The study used four classifiers used by previous works which are bagging, support vector machines (SVM), decision tree (DS), and random forest (RF) classifiers. In addition, each the performance of each classifier was evaluated using precision, recall, F-Score, TPR, and FPR. In this study, the feature selection was not considered. The result of this study shown that RF had the best performance for the most data set.

Conventional neural network was used to predict the number of software defects within two data sets obtained from the NASA data set for the neural network training stage. For testing, two different data sets were used. The proposed classifier performance was evaluated using the mean square error. Which has shown a high performance from this model.

In addition to the previous listed work, different studies have been done to improve the methodology of predicting the software defect such as [32] introduced a frame work for software prediction. The authors of [33] proposed a methodology to use a class imbalance in software learning detection. A study introduced by [34] investigated the performance of using cost-sensitive classification in predicting the software defect.

3. Materials and Methods

The four main steps in this study include collecting and processing the data set followed by feature selection and software defect as shown in Figure 1.

3.1. Data Set

NASA data set is one of the most common bench mark data sets widely used to for software defect prediction analysis [3539]. In this research, fourteen NASA data sets was manipulated to train and test the proposed method. Data set obtained from different resources include PROMISE repository http://openscience.us, https://data.mendeley.com/datasets/923xvkk5mm/1, and http://promise.site.uottawa.ca/SERepository/datasets-page.html. Also, the number of attributes, the number of all samples, the number of defective samples, the number of nondefective samples, the number of missing attributes, and the defect rate for every data set conducted in this study are summarized in Table 1. Furthermore, the defect rate obtained by dividing the number of defected samples to the total number of the sample. For instance, the number of attributes for JM1 data set can be seen in column two (22), the number of all samples can be seen in column three (10885), the number of defective samples can be found in column four (2106), and number of nondefective samples found in column five, while the columns six and seven show the number of missing attributes (0) and defect rate, respectively (19.35).

3.2. Data Set Preparation and Analysis

Data sets in this study have different range of values for every attribute. Table 2 shows a random attribute values. For example, MC1 data set has values between 0.09 and 256. This variance of values can have many effects on the neural network performance. Therefore, all data set were normalized before it passes to the radial base function neural network, i.e., each attribute value for every data set was normalized between 0 and 1. Also, the true (defect software) and false (nondefect) values were converted to be one for the true value and zero for false value.

Neural network performance highly relies on the data set that used in the training stage, i.e., neural network trained with a data set that represent all class in will achieve a high performance [11]. Therefore, dividing data set to train and test neural network was subject of interest for many previous studies [40]. Moreover, there are no standard for dividing the data set for training and testing neural network porous. Some scholars used 80% of the data set for training and the rest of 20% for testing and validation purposes. Some other scholars used the well-known method k-fold cross validation (k is the number of folds). The latter is based on dividing the data set randomly into specific number of folds (k). Each fold carries the same number of data set samples. Training and testing neural network using the k-fold cross-validation method is performed by holding one of the folds for testing and the rest of the folds for training (k−1). This procedure is repeated k times. Moreover, each fold of the k-folds has the opportunity to pass the training for k−1 time and testing for one time [41]. In this study, the RBF neural network was trained and tested using (5 : 1) cross validation, where one of k-folds used for testing and four of the k-folds used for training.

3.3. Background on Feature Selection

The size of the data that used to train the neural network can have some effect on the performance of the neural network. Therefore, feature selection is widely used to choose the most important data to train the neural network. In this paper, the feature selection correlation base method was used to reduce the number of data by choosing the most important features, and this method was used before in [42, 43].

3.4. Radial Base Function Neural Network

For this study, the well-known radial base function neural network (RBF) was manipulated to predict the software defect. RBF neural network was being used in different applications such as medical application [44], image processing [45], and agricultures [46]. RB is a forward neural network consisting of three layers: input layer, hidden layer, and output layer as shown in Figure 2. Layers in the RBF neural network are fully connected to each other.

Figure 2 shows the RBF neural network, where Xp is the input data set. Each input represents one feature in the data set. The weight between the i th input and RBF hidden layer H are scaled through the input weight Ui,h.

Equation (1) shows the input vectors to the RBF hidden layer. Hence, Yp,h is the input to RBF layer (hidden layer), and h is the index to the hidden layer. Equation (2) shows the computation in the hidden layer. denotes RBF activation function. RBF used different activation function. Gaussian is the most common one.where Ch is the center of RBF, is the width of RBF, and is the Euclidean norm.

The RBF output, OP, can be calculated as

Hence, the weight values the output layer received from the hidden layer is denoted by . The bias weight is denoted by .

3.5. Evaluation Methods

In previous research, many models was used to predict the software defect. Meanwhile, different performance indicators were used to evaluate the performance of the software defect models such as precision [3], recall [10], and specificity [30]. Performance indicator used in this study include precision, recall, specificity (SPE), and accuracy (ACC).

The abbreviations used in this study are given as follows:(i)TP is the true positive, which represent the true prediction of the positive values(ii)FP is the false positive, which represent the false prediction of the positive values(iii)TN is the true negative, which represent the true prediction of the negative values(iv)FN is the false negative, which represent the false prediction of the negative values

Precision: it represents the correctness classification. It calculated by dividing the number of samples that correctly predicted defect divided by the total number of correctly predicted samples:

Recall: it represents the rate of defecting models. It is calculated by dividing the number of samples that correctly predicted defect divided by the total number of modules that are actually defective.

Accuracy: it is calculated by taking the ration between the total number of correctly prediction of defect and nondefect samples and total number of prediction samples.

F-measure: it considers the importance of recall and precision are equally.

4. Experimental Setup

In this research, the radial base function neural network was used to predict the software defect in fourteen data sets. Therefore, the experiments were performed for every data set in two stages: first stage by training and testing RBF using all the features in the data set. Secondly, choosing the best feature selection in every data set using WEKA software and then training and testing RBF using that selected data set. Moreover, the K-cross-validation method was used to divided the data randomly before it is used to train the RBF. Meanwhile, precision and recall, SPE, and accuracy were used to evaluate the RBF achieved results. RBF performance was compared with results mentioned in the previous work. The RBF neural network was executed using MATLAB Toolboxes 2014 B version.

5. Results and Discussion

In this section, the research questions will be answered, and the results will be analysis.

Table 3 carries the answer for the first research question “how the RBF neural network performance, when it is used to predict the software defect for fourteen NASA data sets?” Table 3 summarizes the performance of the RBF neural network for prediction of the defect in fourteen data sets before applying the feature selection technique. It shows the RBF neural network performance results using the prediction evaluation indicators: accuracy, F-measure, precision, and recall in the training and testing stage.

5.1. Training Stage

RBF attained 100% prediction accuracy for MC1, PC1, and PC2 data sets. RBF achieved accuracy ranges between 93.20% and 94.27% for CM1, PC3, PC4 data sets. Meanwhile, prediction accuracy for JM1, KC1, KC3, KC4, and MW1varied between 80.09% and 88.67%. The lowest RBF prediction accuracy was 70.08%. For the F-measure, the highest F-measure 1 obtained for MC, PC1, and PC2. Data sets include CM1, MW1, PC3, and PC4. RBF achieved performance result between 0.93 and 0.98. For the data set, JM1 KC1, KC2, KC3, and PC5. RBF achieved performance between 0.82 and 0.89. The lowest F-measure achieved by RBF was for MC2 0.77. The highest RBF precision achieved for MC1 99.91%, while the lowest was 72.63% for MC2. RBF precision for data set include CM1, MC1 PC1, PC2, PC3, and PC4 and varied between 94.19% and 98.99%, while the RBF performance varied between 80.88% and 89.91% for JM1, KC1, KC2, KC3, KC4, and PC5. Recall is the third performance indicator used in this study. As in Table 3, 100% is the highest recall achieved by RBF for MC1, and 76.82% is the lowest recall achieved using MC2. Furthermore, RBF predicted the software defect on CM1, MW1, PC1, PC2, PC3, and PC4 with recall varied between 90.09% and 99.98%. The rest of data set prediction performance ranged between 80.82%–89.91%.

5.2. Testing Stage

As the well-known neural network achieved better performance in the training stage over testing stage, the RBF neural network achieved the highest and lowest accuracy 98.42% and 67.29% for PC2 and MC2 data sets, respectively. Besides PC2, RBF obtained accuracy over 90% for five data set, and they are CM1, MC1, PC1, PC3, and PC4. The prediction accuracy for JM1, KC1, KC4, and MW1 is 80.05% and 86.89%, while the RBF prediction performance for the rest of the data set is lower than 78.25%. The best F-measure performance was 0.99 for PC1, MC1, and PC2. CM1, MW1, and PC3 was predicted with performance over 0.90. JM1, KC1, and KC3 were predated with performance over 0.80, while the lowest F-measure was obtained for KC2 and MC2. For recall performance indicator, MC1 with 99.41% and MC2 77.13% represent the highest and lowest F-measure, respectively. Eight data sets were predicted with performance over 90.0%: CM1, KC1, MC1, MW1, PC1, PC2, PC3, and PC4. The rest of the data set was predicted with performance over 80.0%. The last software prediction indicator used in the testing stage was precision. The highest precision obtained was 99.41% for MC1 data set, while the lowest precision obtained was 70.59% for MC2 data set. Six data sets predict with precision more than 90.0%: CM1, MC1, PC1, PC2, PC3, and PC4. In addition, five data sets predict with precision more than 80.0%: JM1, KC1, KC3, KC4, and MW1. KC2 and PC5 achieved the 79.82% for the former and 78.01% for the latter.

This section will answer for the second research question in this study “How the RBF neural network performance, when it is used to predict the software defect for fourteen NASA data sets after eliminating not related feature? Table 4 shows the number of selected features by using the correlation base method. In this study, only the top 50% of the feature selected by correlation based method was chosen.

Table 5 shows the RBF neural network performance after applying the correlation-based method as a feature selection method. The following section will discuss the RBF software prediction performance in both training and testing stages.

5.3. Training Stage

During the training stage, the RBF neural network attained 100% accuracy for predicting the software defect for MC1, PC1, and PC2 data sets. In addition, the RBF accuracy range between 90.71% and 96.49% for the data sets CM1, MW1, PC3, and PC4. The rest of the data set obtained accuracy range between 81.89% and 87.83% except for MC2 71.72%. The 1.00 F-measure was obtained for MC1, PC1, and PC2, while CM1, MW1, and PC3 obtained 0.99. While the F-measure performance for JM1 and PC4 was 0.90 and PC4, respectively. 0.79 is the lowest F-measure for software defect prediction obtained for MC2. The rest of the data set F-measure ranged between 0.84 and 0.89. The highest recall achieved by the RBF neural network was 100% achieved for three data sets MC1, PC1, and PC2. Data sets include CM1, JM1, MW1, PC3, and PC4 which obtained accuracy over 90.00%. Meanwhile, RBF predicts the software defect for KC1, KC2, KC3, KC4, and PC5 with recall performance range between 84.67% and 86.89%. For the precision as an evaluation method used in this study and as Table 5 shows, the RBF neural network attained the highest software defect prediction for PC1 and PC2, directly followed by 99.79% for MC1, 97.96% for PC4, 97.10% for CM1, 95.39% and 91.28% for PC3 and MW1 data set, respectively. The rest of the data ranged between 83.67% and 88.87%. The lowest precision was 72.27% for MC2.

5.4. Testing Stage

The testing result obtained by the RBF neural network using the data set after feature selection is listed under testing result parts in Table 5. The highest software defect prediction performance was 99.80% for PC2 data set, followed with slight difference 99.01% for MC1 data set. Data sets like PC1 achieved 98.99%, PC4 94.44%, PC3 94.11%, and CM1 93.99%. 70.18% was the lowest software defect prediction accuracy achieved for MC2 data set. The rest of the data set obtained accuracy between 79.02% and 88.90%. The highest F-measure obtained by the RBF neural network was 0.99 obtained for three data sets MC1, PC1, and PC, while the lowest F-measure was 0.76 obtained for MC2 data set. CM1, MW1, and PC3 achieved results between 0.95 and 0.97. The rest of the data set F-measure results were below 0.88. For recall, RBF predicts the software defect with 100% for MC1 and PC1, 99.98% and 96.23% for PC2 and PC3, respectively, 95.17% for PC4, and 92.09% for MW1. The recall performance for the rest of the data set was below 89.00%. Precession performance obtained by the RBF neural network for testing data set varied form the 100% obtained for MC1 and PC2 and 73.27% as lowest results for MC2. Precision results with performance over 90.00% obtained for data sets include CM1, MC2, MW1, PC2, PC3, and PC4. Meanwhile, the precision for two data sets—KC3 and MC2—was 79.30% and 73.27%, respectively. The rest of the data set obtained results between 80.89% and 85.29%. Finally, it is worth to mention that this study is the only study mention to the results values of the precision and recall.

5.5. Comparing RBF Performance with Previous Methods

The third research question in this study is as follows: how does the RBF neural network perform when it is compared to previous methods that are used to predict software defect for the same data set?

This section compares the software defect prediction results obtained by the RBF neural network for the fourteen data sets before feature selection, with methods mentioned in the previous studies in the first part of this section. It is worth to mention that the reason behind eliminating the two evaluation methods—precision and recall—is there are not many studies mentioned to the results obtained by using the precision and recall. Furthermore, every data set results will be compared separately with the other previous methods results.

Table 6 represents the result of CM1 data set. Eight classification methods are mentioned. It can be seen that the highest F-measure 0.96 obtained using RBF used in this study followed by RBF 0.95 and SVM 0.95 from the study [47]. Since the RBF neural network result is a black box, the reason behind the difference between the RBF result in this study and [34] cannot be explained. The lowest result obtained by SVM 0.78 in the study is proposed in [18]. Meanwhile, the best accuracy (93.99%) obtained by this study followed by SVM with 91.97% and RBF 90.33%. Meanwhile, the lowest accuracy was 75.00% obtained by SVM.

Table 7 reflected the results of JM1 data set. The F-measure results obtained from RBF in this study was 0.87, while the highest F-measure obtained was 0.90 by MLP, SVM and RBF. Naive Bayes, RF, DS, and SVM obtained 0.89, 0.76, 0.71, and 0.71, respectively. 89.97% is the highest accuracy achieved by MLP, followed with 84.87% achieved by RBF proposed in this study. The lowest accuracy was 69.00% obtained by SVM proposed in [18]. SVM [34], RBF [34], and Naive Bayes achieved 81.43%, 81.73%, and RBF. The rest of the methods achieved results lower than 80.00%.

Performance of the RBF neural network for KC1 used is presented in Table 8. The highest F-measure was 0.92 obtained by MLP, SVM, and RBF [34]. Naive Bayes obtained 0.90 followed by 0.83 obtained by RBF proposed in this study. The highest accuracy obtained using MLP was 85.51%, while the accuracy 83.25% was obtained using the RBF neural network proposed by this study. MLP [3] obtained the lowest accuracy 79.46%. Naive Bayes, SVM, and RBF [34] obtained accuracy 82.10%, 84.47%, and 84.99%, respectively.

KC2 F-measure results are shown in Table 9. Seven previous method results were used to compare with RBF obtained result in this study. Naive Bayes [34], MLP [34], SVM [34], and RBF [34] obtained the same F-measure 0.90. Also, RF, DS, and SVM obtained results 0.82, 0.78, and 0.80, respectively. Accuracy is the second evaluation method which was used to compare the RBF neural network used in this study. 84.78% was the highest accuracy achieved by Naive Bayes, followed by 83.64 and 83.63% accuracy obtained by MLP [34] and RBF [34]. KC2 software defect predicted accuracy was 82.34% and 82.00% for SVM [34] and RF [18], respectively. Also, MLP [3] predicts the software defect for KC2 data set 79.64%.

Table 10 reflects the result of KC3 data set. It can be seen that the best F-measure 0.95 was obtained using SVM [34] followed by MLP 0.94, Naive Bayes 0.91, and RBF [34] 0.90. Meanwhile, the lowest F-measure 0.71 obtained by RF. DS, SVM [18], and LR obtained 0.71, 0.78, and 0.78, respectively. The RBF used in this research obtained 0.85. Table 10 shows also the accuracy results where 90.04% and 90.02% was the best accuracy obtained by MLP and SVM [34], respectively. 89.78%, 86.17%, and 82.80% are the accuracy obtained by RBF [34], Naive Bayes, and RF. SVM [18], DS, and RBF in this research obtained 79.01%, 78.01%, and 78.25%, respectively. Also, LR obtained 77.01%, while the J star method obtained the lowest accuracy among the methods mentioned in Table 10.

The results of KC4 data set were compared with three previous methods SVM, RPart, and RF. The best F-measure 0.82 obtained by RPart followed by RF 0.80 and 0.79 for SVM. RBF proposed in this study achieves the best F-measure 0.86. It can be noticed from Table 11 that the accuracy did not mention in the previous method, while RBF obtained 83.18%.

Table 12 summarizes the result of previous work and RBF for MC1 data set. The highest and lowest F-measure were 0.99 for this research RBF and 0.10 for MLP, SVM [34], and RBF [34]. Naive Bayes and RF obtained the same F-measure 0.97. DS 0.95, SVM and LR [18] obtained 0.88. On the other hand, J star obtained 100% accuracy, while MLP and RBF [34] obtained the second and third accuracy with 99.40% and 99.26%, respectively. SVM [34] obtained 99.26%, RF 97%, and DS 94%, while SVM [18] and LR obtained the lowest results 81%. J48 obtained 88.60%.

Seven classification methods are listed in Table 13. 0.82 was the best F-measure obtained Naive Bayes and SVM [34] followed by 0.81 for RBF [34]. MLP and RBF (this research) obtained 0.78 and 0.76, respectively. Meanwhile, RBF [36] obtained the lowest F-measure 0.48. RF 0.67, DS 0.66, SVM [18] 0.64, and LR 0.63. Accuracy is the second performance evaluation method mentioned in Table 13. It can be noticed that 72.52 is the best accuracy obtained by Naive Bayes, 72.00% for SVM [34], 70.18% for RBF (this research), and 69.51% and 69% for RBF [34] and RF, respectively. DS 67.00%, SVM [18] 65.00%, and RBF [36] 64.68%, and LR obtained the lowest accuracy, 63.00%.

Results of MW1 data set are reflected in Table 14. 0.96 is the best F-measure obtained by two algorithms SVM [34] and RBF [34]; also, MLP [34] and RBF (this research) obtained 0.95 Naive Bayes 0.90. KNN and DT obtained the lowest F-measure 0.44 and 0.16, respectively. The best accuracy obtained to predict the software defect for MW1 data set is 92.19% for SVM [34] followed by 91.99 and 91.09% for RBF while and MLP. RBF [36] respectively. 89.33% and 88.90% for this research respectively . DT and KNN obtained the same accuracy, 86.66%, and 83.63% is the lowest accuracy obtained by Naive Bayes.

The PC1 data set results are mentioned in Table 15. RBF (this research) obtained the best F-measure 0.99, while SVM [34] obtained the lowest result 0.07. The two algorithms—MLP and Naive Bayes—obtained 0.11 for both—RF, 0.91; DS, 0.88; LR, 0.85; DT, 0.50; KNN, 0.28; and RBF [36], 0.15. RBF (this research) obtained the best accuracy—98.99% and 30.09%. The lowest accuracy obtained by MLP.94.60% for KNN, 93.13% RBF [34] and DT. 93.09% SVM [34] 91.00% RF, 88.07% Naive Bayes, 87.61 J star, DS 87.00%. 81.00% for LR, and 79.00% SVM [18].

Table 16 shows the results of PC2 data set. It can be noticed that F-measure obtained by MLP, SVM, and RBF [34] is 1.00 and 0.99 for Naive Bayes and RBF (this research). Also, the highest software defect prediction accuracy obtained is 99.80% RBF (this research), 99.59% SVM, 99.58% RBF [34], and 99.52% MLP. In addition, DT 97.69%, RBF [36] 97.65%, Naive Bayes 96.96%, and KNN 96.77%.

Table 17 reflects PC3 data set results. RBF (this research) obtained the highest F-measure 0.97, while KNN and DT obtained the lowest 0.35. RBF [34], SVM [34] 0.95, MLP 0.94, RF 0.84, DS 0.81, LR 0.79, SVM [18] 0.78 and Naive Bayes [34]. RBF (this research) obtained 94.11% which represents the best accuracy followed by RBF 89.76% and SVM [34] 89.33%. In addition, MLP 87.55%, RBF and DT [36] 86.39% for both. KNN 86.07%, RF [18] 84.00%, and SVM [18] 74.00%, and the lowest accuracy obtained 46.87% obtained by Naive Bayes.

The results of PC4 are summarized in Table 18. The highest F-measure obtained by RF (this research) is 0.95, and 0.25, 0.28, and 0.58 are the lowest for RBF, KNN, and DT, respectively. MLP and SVM [34] obtained 0.94. RBF [34] obtained 0.93. Naive Bayes and RF [18] obtained 0.92 and 0.90. Also DS 0.86 0.86, 0.84 for SVM [18] and LR. Moreover, RF (this research) obtained the highest accuracy (94.44%), while RBF [34] obtained the lowest accuracy (27.27%). RF is 90.00% followed by 89.115 for MLP, 88.45% for SVM, 87.40% for RBF [36], 86.87% for DT, 85.82% for KNN, 85.51% for Naive Bayes, 85.00% for DS, 82.00% for LR, and 81.00% for SVM.

Table 19, shows the result of PC5 data set, where the best F-measure 0.99 is obtained by MLP, SVM [34], and RBF [34]. Naive Bayes obtained 0.98, and RBF (this research) obtained 0.80 followed by RF, DS, and SVM [18] (0.76, 0.72, and 0.69, respectively). LR obtained 0.70. Meanwhile, DT, KNN, and RBF [36] obtained the lowest results with 0.53, 0.49, and 0.23, respectively. The accuracy obtained by SVM, MLP, and Naive Bayes was 97.23%, 97.03%, and 96.03%, respectively. 87.27% RBF [34], 79.02% RBF (this research), 76.00% RF, 75.79% RBF [36] and 75.00% DT, KNN 73.03%, 71.01% DS. 68.01% SVM and 68.00% for LR [18], respectively.

The fourth research question in this study is as follows: “how does the RBF neural network perform when it is compared to previous methods that are used to predict the software defect for fourteen NASA data sets after eliminating the nonrelated feature?” Table 20 reflects the answer for question four. As seen in Table 20, two things can be noticed that this study is the first study mentioned for software defect prediction for KC4 data set. In order to compare the result obtained in this study with results obtained in the previous work, two parameters were selected—F-measure and accuracy. Nevertheless, few studies mentioned the F-measure results, precisely if feature selection techniques performed before predecting the software defect.

CM1 results in Table 20 show the best F-measure obtained by RBF (this research), and SVM is 0.95, followed by DT 0.94 and MLP 0.94. While 93.99% is the best accuracy obtained by RBF (this research) followed by 90.69% obtained by SVM, MLP [36], Boost-RF, Bag-RF-FS, DT, and MLP [38] which obtained accuracy between 89.32% and 89.97%. In addition, RF obtained 84.60% and J star obtained the lowest accuracy 81.69% among all listed classification methods.

For the JM1 data set, RF (this research) obtained the highest F-measure (0.87), while MLP obtained the lowest (0.17). RF (this research) obtained the highest accuracy (84.87%) followed by J star (81.90%), MLP, Boost-RF, and Bag-RF-FS which obtained 80.44%, 80.56%, and 80.61%, respectively. RF obtained the lowest accuracy (73.90%).

Nine prediction methods used to predict the software defect in KC1 data set results are listed in Table 20. The MLP [38], SVM [38], and DT obtained the best F-measure with 0.92, 0.91, and 0.91, respectively. RBF (this research) obtained 0.83, and MLP [36] obtained the lowest F-measure 0.43. J star and DT obtained the best accuracy of 85.97% and 85.68%, respectively. The lowest accuracy was 71.08% obtained by J48. Algorithms like RF [29] obtained 71.08%, Boost-RF and Bag-RF-FS obtained 78.51%, MLP [36] obtained 77.65%, DT obtained 84.56%, RF obtained 84.16%, RBF (this research) obtained 83.25%, SVM [38] obtained 84.59%, and MLP [38] obtained 85.68%.

Only one previous study was available to compare this study result with it for KC2 data set. 0.82 RBF (this research) was obtained for F-measure, while the LSTSVM did not mention this value. RBF (this research) obtained 79.11%, while LSTSVM obtained a better accuracy with 86.12%.

For KC3 data set, the highest F-measure was 0.96 attained by using three algorithms MLP [36], DT, and SVM [38], while the lowest was obtained using MLP 0.28. RBF (this research) obtained 0.85. The highest and lowest accuracy was 93.50% and 77.58% attained using MLP [38] and Bag-RF-FS, respectively. SVM [38] and DT obtained 93.28% and 93.11%, respectively while MLP 82.75%, Jstar and RF 80.50% for both, 79.13% for Boost-RF. The RBF (this research) obtained 78.25%.

The result of software defect prediction, obtained using the RBF neural network For KC4 data set, after feature selection can be seen in Table 20, where 0.86 is for F-measure and 83.18% is for accuracy. Moreover, KC4 was not a subject of interest for researchers; therefore, no previous work was found.

Different algorithms were used to predict the software defect for MC1 data set, such as Boost-RF, Bag-RF-FS, MLP, and RBF (this research). The latter is the only algorithm to calculate the F-measure, and it was 0.99. Also, the RBF neural network attained the highest accuracy with 99.01%, while the rest algorithms obtained 97.06%.

The same algorithms used with the MC1 data set were used with the MC2 data set. F-measure obtained using RBF (this research) is the highest with 0.76, while the lowest is 0.33 for Boost-RF and 0.36 for Bag-RF-FS. MLP obtained the best accuracy (97.60%) and RBF (this research) obtained 70.18%, while Boost-RF and Bag-RF-FS obtained 64.86% and 62.16%, respectively.

The highest F-measure was result of MW1 was 0.95 and the lowest was 0.40 obtained using the RBF [this research] and MLP. Moreover, 92.00% was the best accuracy obtained using MLP and followed 89.33% obtained using Boost-RF and Bag–RF-FS. RBF obtained 88.90% accuracy.

The result of Software defect prediction for PC1 data set was summarized in Table 20. The best and lowest F-measure obtained were 0.99 and 0.42 obtained using RBF (this research) and MLP [38] respectively. The algorithms DT, MLP [36], and SVM obtained 0.96. RBF obtained the highest accuracy 98.99% followed by MLP [36] (96.56%) and Boost-RF Bag-RF-FS with 96.07%. DT and MLP [36] obtained 93.58% and 93.59%, respectively. SVM obtained 93.10%, and J star and RF obtained the lowest accuracy (91.17%).

Four algorithms’ results for PC2 are mentioned in Table 20. RBF (in this research) with 0.99 is the only study which mentioned the F-measure. In addition, the highest accuracy obtained is 99.80% using RBF and lowest accuracy is 97.23% using Boost-RF. Meanwhile, Bag-RF-FS and MLP obtained 97.69%.

The best F-measure result obtained for PC3 data set was 0.97 using RBF (this research), and the lowest was 0.14 using MLP. The accuracy (87.34%) obtained using the algorithms Boost-RF and Bag-RF-FS.MLP obtained the lowest accuracy (85.12%).

The lowest and highest F-measure for PC4 data set was obtained using MLP and RBF (this research), respectively. The lowest and highest accuracy for PC4 was obtained using RBF (this research) (94.44%) and MLP obtained 88.97%. Boost-RF obtained 91.60%, and Bag-RF-FS obtained 90.81.

The last data set in this study is PC5, where MLP obtained the lowest F-measure (0.24) and RBF (this research) obtained the best. The highest and lowest accuracy was obtained using RBF (this research) (79.02%), and the lowest was 74.80% using MLP. Also, Boost-RF obtained 75.78%, and Bag-RF-FS obtained 76.96%.

As seen in Table 20 and discussed in this section. RBF and feature selection proposed in this research has shown better F-measure results than the previous methods for the data sets CM1, JM1, KC2, KC4, MC1, MC2, MW1, PC1, PC2, PC3, PC4, and PC5. However, RBF and feature selection proposed in this research has shown the lowest F-measure results than previous methods for two data sets—KC1 and KC3. Furthermore, RBF and feature selection proposed in this research has shown better accuracy results than previous methods for the data sets CM1, KC4, MC1, PC1, PC2, PC3, PC4, and PC5. Previous method has shown better accuracy results for data sets including KC1, KC2, KC3, MC1, and MW1.

6. Conclusion

Software defect is one of the major causes of software failure. This study introduced a model to predict the software defect. The model consists of the correlation base method for feature selection and RBF for prediction. The study focused on predicting the software defect in fourteen well-known NASA data sets, namely, CM1, JM1, KC1, KC2, KC3, KC4, MC1, MC2, MW1, PC1, PC2, PC3, PC4, and PC5. RBF neural network was used to test the fourteen data sets in two stages: first, using the data set before performing feature selection. In the second stage, the correlation base method was used to select the important features. K-cross-validation methods were performed to divided the data set that is used to train and test the neural network. Furthermore, precision, recall, F-measure, and accuracy were performed to evaluate the RBF neural network performance. Results obtained in this study was compared with the results obtained in other methods. RBF neural network has shown a better performance when it is compared with other methods in data such as CM1, KC4, MC1, PC1, PC2, PC3, PC4, and PC5. The main goal of the study was to introduce a model that can detect the software defect with high accuracy; therefore, four questions mentioned earlier in this study were answered. The limitation of the proposed method is that it could not predict the software defect with higher performance for the KC1, KC2, KC3, MC1, and MW1 data sets as some previous methods do. The feature work must focus on these data sets. In conclusion, the proposed method in this study can be used to detect the software defects.

Data Availability

The data used to support the findings of this study are available in http://openscience.us, https://data.mendeley.com/datasets/923xvkk5mm/1, and http://promise.site.uottawa.ca/SERepository/datasets-page.html.

Conflicts of Interest

The author declares no conflicts of interest.

Acknowledgments

The author would like to thank Ajloun National University for the facilities used in this work.