Abstract

Software effort estimation plays a critical role in project management. Erroneous results may lead to overestimating or underestimating effort, which can have catastrophic consequences on project resources. Machine-learning techniques are increasingly popular in the field. Fuzzy logic models, in particular, are widely used to deal with imprecise and inaccurate data. The main goal of this research was to design and compare three different fuzzy logic models for predicting software estimation effort: Mamdani, Sugeno with constant output, and Sugeno with linear output. To assist in the design of the fuzzy logic models, we conducted regression analysis, an approach we call “regression fuzzy logic.” State-of-the-art and unbiased performance evaluation criteria such as standardized accuracy, effect size, and mean balanced relative error were used to evaluate the models, as well as statistical tests. Models were trained and tested using industrial projects from the International Software Benchmarking Standards Group (ISBSG) dataset. Results showed that data heteroscedasticity affected model performance. Fuzzy logic models were found to be very sensitive to outliers. We concluded that when regression analysis was used to design the model, the Sugeno fuzzy inference system with linear output outperformed the other models.

1. Introduction and Motivation

Generally, estimating project resources continues to be a critical step in project management, including software project development [1]. Ability to predict the cost or effort of a software project has a direct impact on management decision to accept or reject any given project. For example, overestimating software costs may lead to resource wastage and suboptimal delivery time, while underestimation may lead to project understaffing, over budgeting expenses, and delayed delivery time [2, 3]. This can lead to loss of contracts and thus potentially substantial financial losses. Although, in practice, there is a difference between the expressions, “software cost estimation” and “software effort estimation,” many authors use either to express the effort required to build a software project measured in person-hours. In this paper, the two expressions are used interchangeably.

Accurate estimation of software resources is very challenging and many techniques have been investigated in order to improve the accuracy of software estimation models [4, 5]. The techniques used in software effort estimation (SEE), are organized into three main groups: expert judgment, algorithmic models, and machine learning [6]. Expert judgment depends on the estimator’s experience, while algorithmic models use mathematical equations to predict software cost. On the other hand, machine-learning models are based on nonlinear characteristics [4]. Algorithmic models and machine-learning models depend on project and cost factors. Among machine-learning models, the fuzzy logic model, first proposed by Zadeh [7], has been investigated in the area of software cost estimation by many researchers who have proposed models that outperform the classical SEE techniques [5, 6, 8]. Even so, significant limitations of such models have been identified:(i)When examined individually, the performance of different fuzzy logic models seem to fluctuate when tested on different datasets, which can in turn cause confusion around determining the best model [9].(ii)Most fuzzy logic models were evaluated using mean magnitude of relative error (MMRE), mean magnitude of error relative to the estimate (MMER), relative error (RE), and prediction level (Pred). All these performance evaluation criteria are considered biased [1012].(iii)Several previous studies did not use statistical tests to confirm if the proposed models were statistically different from other models. Failure to employ proper statistical tests would invalidate the results [13].(iv)Effective design of Sugeno fuzzy logic models with linear outputs, which are scarce in the field of software effort estimation, is a challenging task, especially for such models with multiple inputs where identifying the number of input fuzzy sets is in itself challenging.

To address the above limitations, we developed and evaluated three different fuzzy logic models using proper statistical tests and identical datasets extracted from the International Software Benchmarking Standards Group (ISBSG) [14], according to the evaluation criteria proposed by Shepperd and MacDonell [10]. The three models were compared using a multiple linear regression (MLR) and feed-forward artificial neural network models developed with the same training and testing datasets used for the fuzzy logic models. This MLR type model was taken to be the base model for SEE.

Among the challenges in designing fuzzy logic models is to determine the number of model inputs and the parameters for the fuzzy Sugeno linear model. To tackle these challenges, we proposed regression fuzzy logic, where regression analysis was used to determine the optimal number of model inputs, as well as the parameters for the fuzzy Sugeno linear model. Note that our regression fuzzy logic (RFL) model should not be confused with fuzzy regression. The latter is actually a regression model that uses fuzzy logic as an input layer [15], whereas RFL is a fuzzy model that uses regression as an input layer. Regarding the fuzzy Sugeno linear model, () parameters are required if the number of inputs is . MLR models are used to find the () parameters.

In this study, we investigated the following research questions.

RQ1: What is the impact of using regression analysis to tune the parameters of fuzzy models?

To answer this question, we used stepwise regression to determine the number of model inputs and multiple linear regression to adjust the parameters of the Sugeno fuzzy linear model. Then, the three fuzzy logic models, as well as the multiple linear regression model, were evaluated using four datasets based on several evaluation performance criteria, such as the mean absolute error, mean balanced relative error, mean inverted balanced relative error, standardized accuracy, and the effect size. Statistical tests such as the Wilcoxon test and the Scott-Knott test were used to validate model performance. The mean error of all models was evaluated to determine if the models were overestimating or underestimating.

RQ2: How might data heteroscedasticity affect the performance of such models?

Heteroscedasticity exists as a problem when the variability of project effort increases with projects of the same size. To answer this question, we filtered the ISBSG dataset and divided it into four datasets based on project productivity (effort/size). Homoscedastic datasets are those that have very few variations in project productivity. We studied whether the performance of each model fluctuates when a heteroscedasticity problem exists.

RQ3: How do outliers affect the performance of the models?

To answer this question, we conducted experiments with datasets containing outliers and then repeated the experiments with the outliers removed. We studied the sensitivity of all four models to outliers.

In real life, a machine-learning software estimation model has to be trained on historical datasets. The main objectives of RQ2 and RQ3 are to show that data heteroscedasticity and outliers have a big impact on the performance of the fuzzy-regression estimation models. This would be very helpful in organizations where they have several historical projects. This implies that data cleansing, such as removing outliers and minimizing the data heteroscedasticity effect, would be very useful before training the machine-learning prediction model. So, identifying these characteristics is of paramount importance, and this is precisely what best-managed organizations are interested in for estimation purposes. When the software requirements are in such a state of uncertainty, best-managed organizations will work first at reducing these uncertainties of product characteristics. For instance, in the medical field, data cleansing is highly important. Causes and effects are identified within a highly specialized context within very specific parameters, and generalization is avoided outside of these selected limitations and constraints.

The contributions of this paper can be summarized as follows:(i)To the best of our knowledge, this is the first SEE study that compares the three different fuzzy logic models: Mamdani fuzzy logic, Sugeno fuzzy logic with constant output and Sugeno fuzzy logic with linear output. Both the training and testing datasets were the same for all models. In addition, the three fuzzy logic models were compared to an MLR model. The datasets are from the ISBSG industry dataset. The algorithm provided in Section 4 shows how the dataset was filtered and processed.(ii)Investigation of the use of regression analysis in determining the number of model inputs, as well as the parameters of the Sugeno model with linear output. We call this approach, “regression fuzzy.”(iii)Test the effect of outliers on the performance of fuzzy logic models.(iv)Investigation of the influence of the heteroscedasticity problem on the performance of fuzzy logic models.

The paper is organized as follows. Section 2 summarizes related work in the field. Section 3 presents additional background information on techniques used in the experiments. The preparation and characteristics of the datasets are defined in Section 4. Section 5 demonstrates how the models were trained and tested. Section 6 discusses the results. Section 7 presents some threats to validity and lastly, Section 8 concludes the paper.

Software effort estimation (SEE) plays a critical role in project management. Erroneous results may lead to overestimating or underestimating effort, which can have catastrophic consequences on project resources [16]. Many researchers have studied SEE by combining fuzzy logic (FL) with other techniques to develop models that predict effort accurately. Table 1 lists research in FL related to our work.

Table 1 also shows many studies that used datasets from the 1970s to the 1990s, such as COCOMO, NASA, and COCOMO II, to train and test FL models, and compares performance with linear regression (LR) and COCOMO equations. Moreover, most measured software size as thousands of line of codes (KLOC), several used thousands of delivered source instruction (KDSI) and two used use case points (UCP).

Most studies showed promising results for fuzzy logic (FL) models. Much of the research focus was on Mamdani fuzzy logic models rather than Sugeno fuzzy logic. Only one paper studied the difference between MLR, Mamdani fuzzy logic, and Sugeno fuzzy logic with constant parameters [29]. Our study is the first to compare Mamdani to Sugeno with constant output and Sugeno with linear output. The column “standalone” in Table 1 indicates whether an FL model was used as a standalone model to predict software effort or, alternatively, used in conjunction with other models. In some papers, FL models were compared to neural network (NN), fuzzy neural network (FNN), linear regression (LR), and SEER-SEM models. The evaluation criteria used in related work can be summarized as follows:(i)AAE: average absolute error(ii)ARE: average relative error(iii)AE: absolute error(iv)Pred (x): prediction level(v)MMER: mean magnitude of error relative to the estimate(vi)MMRE: mean magnitude of relative error(vii)VAF: variance-accounted-for is the criterion measuring the degree of closeness between estimated and actual values(viii)RMSE: root mean squared error(ix)MdMER: median magnitude of error relative to the estimate(x)MdMRE: median magnitude of relative error(xi)ANOVA: analysis of variance(xii)RE: relative error(xiii)MSE: mean squared error

Several limitations are evident in the reported work. First, the majority of the above studies used single datasets for model evaluations. This is a major drawback since the performance of machine-learning models might excel on one dataset and deteriorate on other datasets [39]. Second, most of the models in Table 1 were tested using only MMRE, MMER, and Pred (x). Moreover, researchers concentrated on Mamdani-type fuzzy logic and ignored Sugeno fuzzy logic, especially Sugeno with linear output. Furthermore, very few studies used statistical tests to validate their results. Myrveit and Stensrud [13] state that it is invalid to confirm that one model is better than another without using proper statistical tests.

Our paper addressed the above limitations. We developed and compared three different fuzzy logic models using four different datasets. We also used the statistical tests and evaluation criteria proposed by Shepperd and MacDonell [10].

3. Background

3.1. Fuzzy Logic Model

In attempting to deal with uncertainty of software cost estimation, many techniques have been studied, yet most fail to deal with incomplete data and impreciseness [40]. Fuzzy logic has been more successful [17, 41]. This is due to the fuzzy nature of fuzzy logic, where model inputs have multiple memberships. Fuzzy logic tends to smoothen the transition from one membership to another [7].

Fuzzy logic (FL) models, generally, are grouped into Mamdani models [42] and Sugeno models [43]. Inputs in FL are partitioned to membership functions with shape types such as triangular, trapezoidal, bell, etc., which represents how input points are mapped to output [44]. The output of an FL model depends on the model type, i.e., Mamdani or Sugeno. Mamdani FL has its output(s) partitioned to memberships with shapes [45, 46]. On the other hand, in Sugeno models (aka Takagi-Sugeno-Kang model), the output is represented as a linear equation or constant. The Sugeno fuzzy format [43] is given below.

If is the input group, then the output group is . Thus, the rules are as follows:

If and , then , where is the number of inputs in the model and are the coefficients of the linear equation. When the output equation is zero-order, will be equal to a constant value. In both model types, fuzzy logic has four main parts [47]:(i)Fuzzification, which maps the crisp input data to fuzzy sets in order to obtain the degree of equivalent membership.(ii)Rules, where expert knowledge can be expressed as rules that define the relationship between the input(s) and output.(iii)Aggregation, which involves firing the rules mentioned above. This occurs by inserting data for the fuzzy model, after which the resulting shapes from each output are added to generate one fuzzy output.(iv)Defuzzification, which involves conversion of the fuzzy output back to numeric output.

3.2. Multiple Linear Regression Model

Regression is one method for representing the relationship between two kinds of variables [48]. The dependent variable, representing the output, is the one that needs to be predicted. The others are called independent variables. Multiple regression involves many independent variables. A linear relationship between the predicted (dependent) variable and the independent variables can be expressed as follows:where is the dependent variable, are the independent variables for number of variables and are constant coefficients that are produced from the data using different techniques, such as least square error or maximum likelihood, that aim to reduce the error between the approximated and real data. Regardless of technique, error will exist, which is represented by in the above equation.

3.3. Evaluation Criteria

Examining the prediction accuracy of models depends upon the evaluation criteria used. Criteria such as the mean magnitude of relative error (MMRE), the mean magnitude of error relative to the estimate (MMER), and the prediction level (Pred (x)) are well known, but may be influenced by the presence of outliers and become biased [10, 49]; therefore, other tests were employed in order to improve the efficiency of the experiments.(i)Mean absolute error (MAE) calculates the average of differences in the absolute value between the actual effort () and each predicted effort (). The total number of projects is represented as .(ii)Standardized accuracy (SA) measures the meaningfulness of model results, which ensures our model is not a random guess. More details can be found in [10].where is the mean value of a large number runs of random guessing.(iii)Effect size (Δ) tests the likelihood the model predicts the correct values rather than being a chance occurrence.where is the sample standard deviation of the random guessing strategy.(iv)Mean balance relative error () is given bywhere is the absolute error and is calculated as .(v)Mean inverted balance relative error () is given by(vi)Mean error (ME) is calculated as

4. Datasets

For this research, the ISBSG release 11 [14] dataset was employed to examine the performance of the proposed models. According to Jorgensen and Shepperd [1], utilizing real-life reliable projects in SEE increases the reliability of the study. The dataset contains more than 5,000 industrial projects written in different programming languages and developed using various software development life cycles. Projects are categorized as either a new or enhanced development. Also, the software size of all projects was measured in function points using international standards such as IFPUG, COSMIC, etc. Therefore, to make the research consistent, only projects with IFPUG-adjusted function points were considered. The dataset contains more than 100 attributes for each project and includes such items as: project number, project completion date, software size, etc. Also, ISBSG ranks project data quality into four levels, “A” to “D,” where “A” indicates projects with the highest quality followed by “B” and so on.

After examining the dataset, we noticed that while some projects had similar software size, effort varied extensively. The ratio between software effort (output) and software size (the main input) is called the productivity ratio. We noticed a substantial difference in the productivity ratio among projects with similar software size. For instance, for the same adjusted function point (AFP), productivity (effort/size) varied from 0.2 to 300. The large difference in productivity ratio makes the dataset heterogeneous. Applying the same model for all projects was therefore not practical. To solve this issue, projects were grouped according to productivity ratio making the datasets more homogeneous. The main dataset was divided into subdatasets, where projects in each subdataset had only small variations in productivity [50]. For this research, the dataset was divided into three datasets as follows:(i)Dataset 1: small productivity ratio (), where ;(ii)Dataset 2: medium productivity projects where ; and(iii)Dataset 3: high productivity ().

Also, to evaluate the effect of mixing projects with different productivities together, a fourth dataset was added, which combined all three datasets. Dataset 3 was not as homogeneous as the first two, since productivity in this dataset varied between 20 and 330. This dataset was used to study the influence of data heteroscedasticity on the performance of fuzzy logic models.

Given the ISBSG dataset characteristics discussed above, a set of guidelines for selection of projects was needed to filter the dataset. The attributes chosen for analysis were as follows:(i)AFP: adjusted function points, which indicates software size(ii)Development type: it indicates whether the project is a new development, enhancement, or redevelopment(iii)Team size: it represents the number of members in each development team.(iv)Resource level: it identifies which group was involved in developing this project such as development team effort, development support, computer operation support, and end users or clients(v)Software effort: the effort in person-hours

In software effort estimation, it is important to choose nonfunctional requirements as independent variables, in addition to functional requirements [51]. All of the above features are continuous variables except Resource level which is categorical. The original raw dataset contained 5052 projects. Using the following guidelines to filter the datasets, projects were selected based on the following:(1)Data quality: only projects with data quality A and B as recommended by ISBSG were selected, which reduced dataset size to 4,474 projects(2)Software size in function points(3)Four inputs: AFP, team size, development type, and resource level; and one output variable: software effort(4)New development projects only: projects that were considered enhancement development, redevelopment, or other types were ignored, bringing the total projects to 1,805(5)Missing information: filtering the dataset by deleting all the rows with missing data leaving only 468 fully described projects(6)Dividing the datasets according to their productivity as explained previously to generate three distinct datasets and a combined one(7)Dividing each dataset into testing and training datasets by splitting them randomly into 70%/30%, where 70% of each dataset was used for training and 30% for testing

The resulting datasets after applying steps 6 and 7:(a)Dataset 1: with productivity consisted of 245 projects with 172 projects for training and 73 projects for testing(b)Dataset 2: with productivity consisted of 116 projects with 81 projects for training and 35 projects for testing(c)Dataset 3: with productivity higher than or equal to 20 () consisted of 107 projects with 75 projects for training and 32 projects for testing(d)Dataset 4: combining projects from all three datasets consisted of 468 projects with 328 projects for training and 140 projects for testing

Table 2 presents some statistical characteristics of the effort attribute in the four datasets. Before using the dataset, a check is needed as to whether or not the attributes data type can be used directly in the models. As discussed in Section 3, FL models divide the input into partitions to ensure smoothness of transition among input partitions; these inputs should be continuous. If one of the inputs is categorical (nominal), a conversion to a binary input is required [52]. Thus, the resource attribute, a categorical variable, was converted to dummy variables. A further operation was performed on the datasets to remove outliers from the testing dataset. The aim here was to study the effects on the results of statistical and error measurement tests. In other words, we analyzed the datasets with outliers, then without outliers. A discussion of the results is presented in Section 6. Figure 1 shows the boxplot of the four datasets, where stars represent outliers. Datasets 1, 3, and 4 had outliers, while Dataset 2 had none. Removing the outliers from Datasets 1, 3, and 4 reduced their sizes to 65, 29, and 130, respectively, and Dataset 2 remained unchanged.

5. Model Design

In this section, the methods used to design the four models, MLR, Sugeno linear FL, Sugeno constant FL, and Mamdani FL, are presented. The training dataset for each of the four datasets was used to train each model and then tested using the testing datasets. Performances were analyzed and results are presented in Section 6.

As mentioned in Section 4, since all projects have the same development type, the latter was removed as an input, such that three inputs remained for each model. They are software size (AFP), team size, and resource level. The resource-level attribute was replaced by dummy variables since it was a categorical variable. A stepwise regression was applied to exclude input variables that were not statistically significant. The same inputs were then utilized for all models in each dataset.

A multiple linear regression model was generated from every training dataset. The fuzzy logic models were then designed using the same input dataset.

To design the Mamdani FL model, the characteristics of each input were examined first, specifically the min, max, and average. This gives us a guideline as to the overall shape of memberships. Then, considering that information, all inputs and output were divided into multiple overlapping memberships. Simple rules were written to enable output generation. Usually, simple rules take each input and map it to the output in order to determine the effect of every input on the output. This step can be shortened if some knowledge of the data is available. In our case, since this knowledge existed, setting the rules was expedited. Then, to evaluate and improve the performance of the model, training datasets were randomly divided into multiple sections, and a group was tested each time. Rules and memberships were updated depending on the resulting error from those small tests.

Sugeno constant FL has similar characteristics to Mamdani FL, so the same steps were followed except for the output design. The output was divided into multiple constant membership functions. Initial values for each membership function were set by dividing the output range into multiple subsections and then calculating the average of each subsection. Then, the performance of the model was improved by utilizing the training datasets as explained previously.

Lastly, the Sugeno linear FL model was designed. As explained in Section 3, this model is a combination of fuzzy logic and linear regression concepts, each of which is reflected in the design. The steps for designing the input memberships were similar to the steps followed in the Mamdani and Sugeno constant models, whereas the output required a different methodology. The output was divided into multiple memberships, where each membership was represented by a linear regression equation. Hence, the output of the dataset was divided into corresponding multiple overlapping sections, and a regression analysis was applied to each, in order to generate the MLR equation. Subsequently, model performance was improved using the training dataset, as mentioned previously. Note that, overimproving the models using training datasets leads to overfitting, where training results are excellent, but testing results are not promising. Therefore, caution should be taken during the training steps. After training, all the models were tested on the testing datasets that were not involved in the training steps.

A summary of the system is shown in Figure 2.

Table 3 depicts the membership functions (mfs) of the Mamdani, Sugeno constant, and Sugeno linear models in the presence of outliers. Tables 46 display the parameters of the fuzzy logic models for Dataset 1, Dataset 2, and Dataset 3, respectively. Table 7 displays the parameters of the ANN and MLR models.

Regarding the software tools used in this research, MATLAB was used in designing fuzzy logic and neural network models. For statistical tests and analysis, MATLAB, Minitab, and Excel have been used. Testing results are analyzed and discussed in Section 6.

6. Model Evaluation & Discussion

The following subsections discuss the performance of the models with and without outliers.

6.1. Testing Models with Outliers

The three fuzzy logic models, Sugeno linear, Sugeno constant, and Mamdani, were tested on four testing datasets from ISBSG and then compared to the multilinear regression model. The resulting actual and estimated values were examined using the error criteria: MAE, MBRE, MIBRE, SA, and Δ. Table 8 presents the results of the comparisons.

Since MAE measures the absolute error between the estimated and actual value, the model that has the lowest MAE generated more accurate results. As shown in Table 8, Sugeno linear FL generated results (bold) had the lowest MAE among the four datasets. Additional tests using MBRE and MIBRE criteria were also used to examine the accuracy of the data results. The results, as shown in Table 8, indicate that Sugeno linear FL outperformed the other models. Also, SA measures the meaningfulness of the results generated by the models, and Δ measures the likelihood that the data were generated by chance. Table 8 shows that the Sugeno linear FL predicted more meaningful results than other techniques across the four datasets. It is also clear from the SA and delta tests that the fuzzy Mamdani model does not predict well when outliers are present, as shown in Table 8.

We also examined the tendency of a model to overestimate or underestimate, which was determined by the mean error (ME). ME was calculated by taking the mean of the residuals (difference between actual effort and estimated effort) from each dataset with outliers. As shown in Table 8, all models tended to overestimate in Dataset 3, three models overestimated in Dataset 1, and three models underestimated in Dataset 2. Surprisingly, Dataset 2 was the only dataset not containing outliers. Nonetheless, the Sugeno linear model outperformed the other models. We then continued to study this problem by repeating the same process after removing the outliers.

To confirm the validity of results, we applied statistical tests to examine the statistical characteristics of the estimated values resulting from the models, as shown in Table 9. We chose the nonparametric Wilcoxon test to check whether each pair of the proposed models is statistically different based on the absolute residuals. The rationale for choosing the nonparametric test was because the absolute residuals were not normally distributed as confirmed by the Anderson-Darling test. The hypothesis tested was:H0: There is no significant difference between model(i) and model(j)H1: There is a significant difference between model(i) and model(j)

If the resulting value is greater than 0.05, the null hypothesis cannot be rejected, which indicates that the two models are not statistically different. On the other hand, if the value is less than 0.05, then the null hypothesis is rejected. Table 9 reports the results of the Wilcoxon test, with test results below 0.05 given in bold. The results of Dataset 1 show that Sugeno linear FL was significantly different from all the other models, while for Datasets 2 and 4, the Sugeno linear FL & MLR performed similarly, and both were statistically different from Mamdani and Sugeno constant FL. For Dataset 3, none of the models performed differently. For this dataset, based on the Wilcoxon test, the models were not statistically different. This is because a heteroscedasticity problem exists in this dataset. The productivity ratio for this dataset (Dataset 3) was between 20 and 330 as discussed in Section 4. This huge difference in productivity led to the heteroscedasticity problem and affected the performance of the models.

One of the tests used to examine the stability of the models was the Scott-Knott test, which clusters the models into groups based on data results using multiple comparisons in one-way ANOVA [53]. Models were grouped without overlapping, i.e., without classifying one model into more than one group. Results were obtained, simply, from the graphs.

The Scott-Knott test uses the normally distributed absolute error values of the compared models. Therefore, if the values are not normally distributed, a transformation should take place using the Box-Cox algorithm [54], which was the case in our study.

The models to be compared are lined along the x-axis sorted according to rank, with transformed mean error showing across the y-axis. The farther a model from the y-axis is, the higher the rank is. The vertical lines indicate the statistical results for each model. Models grouped together have the same color. The mean of transformed absolute error is shown as a circle in the dashed line. The results of Scott-Knott tests are shown in Figure 3. The Sugeno linear model was grouped alone in Dataset 1 and, was also the highest rank in Datasets 1, 2, and 4. In Dataset 3, where there was a heteroscedasticity issue, the models showed similar behavior. Nevertheless, the Sugeno linear model was among the highest ranked. MLR was ranked second twice and third twice, generally showing stable average performance, while the other FL models did not show stable behavior. This demonstrates that the Sugeno linear model was stable and provides higher accuracy.

6.2. Testing Models without Outliers

In this section, the models were examined again to study the effect of outliers on model performance. The outliers were removed from the four datasets and the same statistical tests and error measurement tools were applied to the generated results. The filtered datasets were then used for testing the models. We used the interquantile range (IQR) method to determine the outliers. The IQR is defined as IQR = Q3 − Q1 where Q3 and Q1 are the upper and lower quantile, respectively. Any object that is greater than Q3 + 1.5 IQR or less than Q1 − 1.5 IQR was considered an outlier, since the region between Q1 − 1.5 IQR and Q3 + 1.5 IQR contains 99.3% of the objects [55].

An interval plot for mean absolute error was generated for all the models using the four testing datasets with and without outliers as depicted in Figure 4. Since the interval plot was for MAE results, the closer the midpoint of each variable to zero, the better it performed. Also, the shorter the interval range, the better and more accurate the results. Therefore, it can be concluded from the plots that the general behavior of all the models was improved after removing the outliers. The results were more accurate and the range interval decreased, while the midpoint was closer to zero. The Sugeno linear FL model was markedly more accurate than the other models with or without outliers. It is fair to note that the MLR model had equivalent behavior to the Sugeno linear FL in Dataset 2.

To examine the improvement resulting from removal of the outliers, the same error measures were applied to datasets without outliers. Table 10 presents the results for MAE, MBRE, MIBRE, SA, and Δ.

Finally, the mean error (ME) from each dataset was calculated to check the effect of removing outliers on overestimating and underestimating project effort. We noticed that the majority of models tend to underestimate after removing the outliers. This confirms the findings of the test on the datasets with outliers, where models tended to overestimate.

The performance of all models without outliers was improved, as the data in Table 10 indicates. We conclude that FL models are sensitive to outliers.

In addition, we examined the effect of outlier removal using the Scott-Knott test. Figure 5 shows the results of the Scott-Knott test. Generally, our conclusions about model stability did not change. However, we noted that the mean of transformed absolute error decreased. This shows that removing the outliers increases the accuracy of the models. We conclude that the Sugeno linear FL model was the superior model, both in the presence and absence of outliers.

To visualize the effect of the outliers in the result of all models, a Scatterplot was extracted for the Sugeno linear model in each dataset (with outliers and without outliers), where the x-axis is the actual effort and the y-axis is the estimated effort as shown in Figure 6. It is evident that removing the outliers decreased the drifting effect on the linear line generated. Note that Dataset 2 has no outliers.

To validate the conclusion drawn about Sugeno linear outperformance in estimating software costs, its results were compared to Forward Feed Artificial Neural Network model. The ANN model created were trained and tested in the 8 datasets that used in this research; 4 with outliers and 4 without outliers. A comparison between the MAE of both models is shown in Table 11. The Fuzzy linear outperformed the ANN model in all the datasets.

6.3. Answers to Research Questions

RQ1: What is the impact of using regression analysis on tuning the parameters of fuzzy models?

Based on the results in Section 6, we conclude that Sugeno linear FL model combined the fuzziness characteristics of fuzzy logic models with the nature of regression models. The different membership functions and rules used allowed the model to cope with software parameter complexity. The Sugeno linear FL model showed stable behavior and high accuracy compared to the MLR and other models as shown in Scott-Knott plots. We conclude that regression analysis can assist in designing fuzzy logic models, especially the parameters of Sugeno fuzzy with linear output.

RQ2: How might data heteroscedasticity affect the performance of such models?

A heteroscedasticity issue appears when the productivity (effort/size) fluctuates among projects in the same dataset. To see this impact, we divided the datasets into four sets containing different groups of productivity as described in Section 4. Heteroscedasticity appeared in the third dataset. Multiple tests were applied on all the datasets to identify the difference in performance. We concluded that heteroscedasticity had a detrimental effect on the performance of fuzzy logic models, but when we applied statistical tests, we found that in those datasets where heteroscedasticity existed, none of the models were statistically different. However, we concluded that the Sugeno linear FL model outperformed other models in the presence and absence of the heteroscedasticity issue.

RQ3: How do outliers affect the performance of the models?

After generating four datasets, we extracted the outliers from each testing dataset. We then applied the same error measurements and statistical tests on each, as described in Section 6.2. We extracted interval plots for mean absolute error of predicted results with and without outliers as shown in Figure 4. A general improvement was noticed after removing outliers, since we observed a major decrease in MAE and the interval range shortened (decreased). Furthermore, results showed that datasets became more homogenous after removing the outliers. We also found that the models tend to underestimate in the presence of outliers and overestimate when outliers are removed, yet the performance of all models improved when outliers were removed. Despite the fact that outliers affect the performance of the models, the Sugeno linear model still proved to be the best performing model.

We have proven in this research that the Sugeno linear fuzzy logic model outperforms other models in the presence of outliers and absence of outliers and when the dataset is homogenous or heterogeneous. We mentioned “the same model for all projects was therefore not practical,” this is because each model was trained using a different dataset. To predict the effort of a new project in a certain organization, the Sugeno linear fuzzy logic model can be retrained on some historical projects in the same organization and, thus, can be used to predict future projects.

7. Threats to Validity

This section presents threats to the validity of this research, specifically internal and external validity. Regarding internal validity, the datasets used in this research work were divided randomly into training and testing groups, 70% and 30%, respectively. Although the leave-one-out (LOO) cross validation method is less biased than the random splitting method [56], the technique was not implemented because of the difficulty of designing fuzzy logic models with the LOO method. In order to apply the LOO in our work, more than 1,000 models would have had to be manually generated in order to conduct all experiments with and without outliers, which is extremely difficult to implement. In our case, fuzzy logic models were designed manually from the training datasets.

External validity questions whether or not the findings can be generalized. In this work, four datasets were generated from the ISBSG dataset with projects ranked A and B. Moreover, unbiased performance evaluation criteria and statistical tests were used to affirm the validity of the results. So, we can conclude that the results of this paper can be generalized to a large degree. However, using more datasets would yield more robust results.

8. Conclusions

This paper compared four models: Sugeno linear FL, Sugeno constant FL, Mamdani FL, and MLR. Models were trained and tested using four datasets extracted from ISBSG. Then, the performance of the models was analyzed by applying various unbiased performance evaluation criteria and statistical tests that included: MAE, MBRE, MIBRE, SA, and Scott-Knott. Then, outliers were removed, and the same tests were repeated in order to draw a conclusion about superior models. The inputs for all models were software size (AFP), team size, and resource level, while the output was software effort. Three main questions were posed at the beginning of the research:RQ1: What is the impact of using regression analysis on tuning the parameters of fuzzy models?RQ2: How might data heteroscedasticity affect the performance of such models?RQ3: How do outliers affect the performance of the models?

Based on the discussions of the results in Section 6, we conclude the following:(1)Combining the multiple linear regression concept with the fuzzy concept, especially in the Sugeno fuzzy model with linear output, led to a better design of fuzzy models, especially by learning the optimized number of model inputs, as well as the parameters for the fuzzy linear model.(2)Where a heteroscedasticity problem exists, the Sugeno fuzzy model with linear output was the best performing among all models. However, we note that although the Sugeno linear is the superior model, it is not statistically different from the others.(3)When outliers were removed, the performance of all the models improved. The Sugeno fuzzy model with linear output did however remain the superior model.

In conclusion, results showed that the Sugeno fuzzy model with linear output outperforms Mamdani and Sugeno with constant output. Furthermore, Sugeno with linear output was found to be statistically different from the other models on most of the datasets using Wilcoxon statistical tests in the absence of the heteroscedasticity problem. The validity of the results was also confirmed using the Scott-Knott test. Moreover, results showed that despite heteroscedasticity and the influence of outliers on the performance of all the fuzzy logic models, the Sugeno fuzzy model with linear output remained the model with the best performance.

Data Availability

The dataset used in this study (ISBSG) is publicly available but not for free. It is copy-righted, and it is illegal to share it with anyone. However, a detailed algorithm is written in Section 4 (Datasets) to explain how the datasets are used and filtered.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank part-time research assistant Omnia Abu Waraga, Eng., for conducting experiments for this paper. Ali Bou Nassif extends thanks to the University of Sharjah for supporting this research through the Seed Research Project number 1602040221-P. The research was also supported by the Open UAE Research and Development Group at the University of Sharjah. Mohammad Azzeh is grateful to the Applied Science Private University, Amman, Jordan, for the financial support granted to conduct this research.