Computational Intelligence and Neuroscience

Volume 2019, Article ID 8367214, 17 pages

https://doi.org/10.1155/2019/8367214

## Software Development Effort Estimation Using Regression Fuzzy Models

^{1}Department of Electrical and Computer Engineering, University of Sharjah, P.O. Box 27272, Sharjah, UAE^{2}Department of Electrical and Computer Engineering, University of Western Ontario, London, Ontario, Canada^{3}Department of Software Engineering, Applied Science Private University, P.O. Box 166, Amman, Jordan^{4}Software Project Management Research Team, ENSIAS, Mohammed V University, Rabat, Morocco^{5}Department of Software Engineering, École de Technologie Supérieure, Montréal, Quebec, Canada

Correspondence should be addressed to Ali Bou Nassif; ea.ca.hajrahs@fissana

Received 27 October 2018; Revised 31 December 2018; Accepted 24 January 2019; Published 20 February 2019

Academic Editor: Maciej Lawrynczuk

Copyright © 2019 Ali Bou Nassif et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Software effort estimation plays a critical role in project management. Erroneous results may lead to overestimating or underestimating effort, which can have catastrophic consequences on project resources. Machine-learning techniques are increasingly popular in the field. Fuzzy logic models, in particular, are widely used to deal with imprecise and inaccurate data. The main goal of this research was to design and compare three different fuzzy logic models for predicting software estimation effort: Mamdani, Sugeno with constant output, and Sugeno with linear output. To assist in the design of the fuzzy logic models, we conducted regression analysis, an approach we call “regression fuzzy logic.” State-of-the-art and unbiased performance evaluation criteria such as standardized accuracy, effect size, and mean balanced relative error were used to evaluate the models, as well as statistical tests. Models were trained and tested using industrial projects from the International Software Benchmarking Standards Group (ISBSG) dataset. Results showed that data heteroscedasticity affected model performance. Fuzzy logic models were found to be very sensitive to outliers. We concluded that when regression analysis was used to design the model, the Sugeno fuzzy inference system with linear output outperformed the other models.

#### 1. Introduction and Motivation

Generally, estimating project resources continues to be a critical step in project management, including software project development [1]. Ability to predict the cost or effort of a software project has a direct impact on management decision to accept or reject any given project. For example, overestimating software costs may lead to resource wastage and suboptimal delivery time, while underestimation may lead to project understaffing, over budgeting expenses, and delayed delivery time [2, 3]. This can lead to loss of contracts and thus potentially substantial financial losses. Although, in practice, there is a difference between the expressions, “software cost estimation” and “software effort estimation,” many authors use either to express the effort required to build a software project measured in person-hours. In this paper, the two expressions are used interchangeably.

Accurate estimation of software resources is very challenging and many techniques have been investigated in order to improve the accuracy of software estimation models [4, 5]. The techniques used in software effort estimation (SEE), are organized into three main groups: expert judgment, algorithmic models, and machine learning [6]. Expert judgment depends on the estimator’s experience, while algorithmic models use mathematical equations to predict software cost. On the other hand, machine-learning models are based on nonlinear characteristics [4]. Algorithmic models and machine-learning models depend on project and cost factors. Among machine-learning models, the fuzzy logic model, first proposed by Zadeh [7], has been investigated in the area of software cost estimation by many researchers who have proposed models that outperform the classical SEE techniques [5, 6, 8]. Even so, significant limitations of such models have been identified:(i)When examined individually, the performance of different fuzzy logic models seem to fluctuate when tested on different datasets, which can in turn cause confusion around determining the best model [9].(ii)Most fuzzy logic models were evaluated using mean magnitude of relative error (MMRE), mean magnitude of error relative to the estimate (MMER), relative error (RE), and prediction level (Pred). All these performance evaluation criteria are considered biased [10–12].(iii)Several previous studies did not use statistical tests to confirm if the proposed models were statistically different from other models. Failure to employ proper statistical tests would invalidate the results [13].(iv)Effective design of Sugeno fuzzy logic models with linear outputs, which are scarce in the field of software effort estimation, is a challenging task, especially for such models with multiple inputs where identifying the number of input fuzzy sets is in itself challenging.

To address the above limitations, we developed and evaluated three different fuzzy logic models using proper statistical tests and identical datasets extracted from the International Software Benchmarking Standards Group (ISBSG) [14], according to the evaluation criteria proposed by Shepperd and MacDonell [10]. The three models were compared using a multiple linear regression (MLR) and feed-forward artificial neural network models developed with the same training and testing datasets used for the fuzzy logic models. This MLR type model was taken to be the base model for SEE.

Among the challenges in designing fuzzy logic models is to determine the number of model inputs and the parameters for the fuzzy Sugeno linear model. To tackle these challenges, we proposed regression fuzzy logic, where regression analysis was used to determine the optimal number of model inputs, as well as the parameters for the fuzzy Sugeno linear model. Note that our regression fuzzy logic (RFL) model should not be confused with fuzzy regression. The latter is actually a regression model that uses fuzzy logic as an input layer [15], whereas RFL is a fuzzy model that uses regression as an input layer. Regarding the fuzzy Sugeno linear model, () parameters are required if the number of inputs is . MLR models are used to find the () parameters.

In this study, we investigated the following research questions.

RQ1: What is the impact of using regression analysis to tune the parameters of fuzzy models?

To answer this question, we used stepwise regression to determine the number of model inputs and multiple linear regression to adjust the parameters of the Sugeno fuzzy linear model. Then, the three fuzzy logic models, as well as the multiple linear regression model, were evaluated using four datasets based on several evaluation performance criteria, such as the mean absolute error, mean balanced relative error, mean inverted balanced relative error, standardized accuracy, and the effect size. Statistical tests such as the Wilcoxon test and the Scott-Knott test were used to validate model performance. The mean error of all models was evaluated to determine if the models were overestimating or underestimating.

RQ2: How might data heteroscedasticity affect the performance of such models?

Heteroscedasticity exists as a problem when the variability of project effort increases with projects of the same size. To answer this question, we filtered the ISBSG dataset and divided it into four datasets based on project productivity (effort/size). Homoscedastic datasets are those that have very few variations in project productivity. We studied whether the performance of each model fluctuates when a heteroscedasticity problem exists.

RQ3: How do outliers affect the performance of the models?

To answer this question, we conducted experiments with datasets containing outliers and then repeated the experiments with the outliers removed. We studied the sensitivity of all four models to outliers.

In real life, a machine-learning software estimation model has to be trained on historical datasets. The main objectives of RQ2 and RQ3 are to show that data heteroscedasticity and outliers have a big impact on the performance of the fuzzy-regression estimation models. This would be very helpful in organizations where they have several historical projects. This implies that data cleansing, such as removing outliers and minimizing the data heteroscedasticity effect, would be very useful before training the machine-learning prediction model. So, identifying these characteristics is of paramount importance, and this is precisely what best-managed organizations are interested in for estimation purposes. When the software requirements are in such a state of uncertainty, best-managed organizations will work first at reducing these uncertainties of product characteristics. For instance, in the medical field, data cleansing is highly important. Causes and effects are identified within a highly specialized context within very specific parameters, and generalization is avoided outside of these selected limitations and constraints.

The contributions of this paper can be summarized as follows:(i)To the best of our knowledge, this is the first SEE study that compares the three different fuzzy logic models: Mamdani fuzzy logic, Sugeno fuzzy logic with constant output and Sugeno fuzzy logic with linear output. Both the training and testing datasets were the same for all models. In addition, the three fuzzy logic models were compared to an MLR model. The datasets are from the ISBSG industry dataset. The algorithm provided in Section 4 shows how the dataset was filtered and processed.(ii)Investigation of the use of regression analysis in determining the number of model inputs, as well as the parameters of the Sugeno model with linear output. We call this approach, “regression fuzzy.”(iii)Test the effect of outliers on the performance of fuzzy logic models.(iv)Investigation of the influence of the heteroscedasticity problem on the performance of fuzzy logic models.

The paper is organized as follows. Section 2 summarizes related work in the field. Section 3 presents additional background information on techniques used in the experiments. The preparation and characteristics of the datasets are defined in Section 4. Section 5 demonstrates how the models were trained and tested. Section 6 discusses the results. Section 7 presents some threats to validity and lastly, Section 8 concludes the paper.

#### 2. Related Work

Software effort estimation (SEE) plays a critical role in project management. Erroneous results may lead to overestimating or underestimating effort, which can have catastrophic consequences on project resources [16]. Many researchers have studied SEE by combining fuzzy logic (FL) with other techniques to develop models that predict effort accurately. Table 1 lists research in FL related to our work.