Abstract

Project cost prediction is one of the key elements in the civil engineering activities development. Project cost is a highly sensitive component to diverse parameters and hence it is associated with complex trends that make it difficult to be predicted and fully understood. Due to the massive advancement of soft computing (SC) and Internet of things (IoT), the main research objective of the current study was initiative. Several machine learning (ML) models including extreme learning machine (ELM), multivariate adaptive regression spline (MARS), and partial least square regression (PLS) were adopted to predict field canal cost. Several essential predictors were used to develop the prediction network “the learning process” including the total length of the PVC pipeline, served area, geographical zone, construction year, and cost and duration of field canal improvement projects (FCIP) construction. Data were collected from the open source published literature. The modeling results evidenced the potential of the applied SC models in predicting the FCIP cost. In numerical magnitude evaluation, MARS model indicated the least value for the root mean square error (RMSE = 27422.7), mean absolute error (MAE = 19761.8), and mean absolute percentage error (MAPE = 0.05454) with Nash–Sutcliffe efficiency (NSE = 0.94), agreement index (MD = 0.89), and coefficient of determination (R2 = 0.94), with best precision of prediction using all predictors, except geographical zone parameter in which less influence on the cost construction is presented. In general, the research outcome gave an informative primary cost initiative for cost civil engineering project.

1. Introduction

The scarcity of freshwater has been a global problem recently and expected to worsen in the future due to the increasing human population and decline in annual water allocation per capita [1, 2]. The present scenario portrays water unsustainability due to the drastic increase in water utilization (>6 folds) in the 20th century [3]. It is presently estimated that about 1.2 billion people globally have no access to a clean water supply [4]. Hence, several policies and projects are being implemented globally to ensure water sustainability. One of such projects aimed at water sustainability is the FCIP which aims at increasing the conveyance efficiency of field canals by about 25% via improvement of the field canals during irrigation processes in farmlands [5]. The project requires the construction of a burden PVC pipeline rather than relying on earthen field canals for the reduction of water seepage or losses during field operations [6]. FCIP is comprised of several simple components and structures which include concrete pain intakes for water collection from the source; water is channelled through the suction pipes to a plain concrete sump [7]. Water is first accumulated in the sump before being pumped by the pumping sets through the PVC pipelines by the irrigation valves. The FCIPs are comprised of civil works, mechanical components, and electrical components as the major components. The components of the civil works are the pump house, pipelines, suction pipes, intake, and sump structure while the mechanical components are the irrigation valves, pump sets, and mechanical connections. The electrical boards and connections make up the electrical components of FCIPs [8].

The most interesting part of FCIPs is the cost estimation aspect that must be performed; manual cost estimation processes are time-consuming [9]. However, in some cases, scan be attained based on personal engineering and decision-makers’ expertise. Cost estimation is highly associated with bias and inaccuracy and to overcome these issues of bias and inaccuracy during cost estimation [10]. Therefore, SC models have been proposed as the potential solution. In line with this, the aim of this work is to come up with a robust ML-based SC model for FCIP cost estimation. The proposed models are expected to help decision-makers and management engineers in making decisions from the perspective of the stockholders.

Literature review studies suggested that numerous researches have focused on the development of reliable regression and mathematical techniques that can be used for cost estimation in civil engineering projects [1116]. The nagging problem in this domain still relates to the performance accuracy of these models as the predicted cost is required to be highly accurate before the conception of the project. The weighted ANN has been developed for unit cost prediction in highway projects by [16], while a parametric cost model was developed based on a questionnaire survey for the estimation of the final cost of pump stations by [17]. A fuzzy logic- (FL-) based parametric cost estimate model has been presented by [18], for the prediction of the cost of building projects in the Gaza Strip. The study by [19] presented a hybrid ANN-FL model for cost prediction of water infrastructure. The prediction of the unit cost of the highway project in Libya using the ANN model has been presented by [20] and the performance of the ANN model was excellent. A conceptual cost model for the German residential building project was developed by [21] using historical data for 75 residential projects sourced from the building cost information center. The use of ANN to determine the relevant parameters for cost prediction during tunnel construction in Greece was reported by [22] based on survey questionnaires. The survey was based on expert opinions and interviews in relation to the key cost drivers.

The reviewed literature suggests the need for intelligence models that are robust and capable of understanding the civil engineering complexity in more realistic manners. Several ML models have been reported recently, such as ANN [23], SVM [24], ANFIS [25], genetic programming [26], decision tree [27], and gradient boosting [28], and several others were reported in the latest review [29]. However, the fact remains that each of these models behaves differently in terms of prediction accuracy. Some existing models are also capable of providing accurate results interpretation; for instance, the variable coefficients of the regression models can explain the influence of each variable on the response of the model.

Numerous studies have focused on building projects without giving much attention to the conceptual cost of FCIPs. Hence, the attention of this study is on the pipeline construction projects which have not attracted appropriate research attention, especially on the provision of detailed model development steps in terms of sample size, multicollinearity, outliers, and singularity. For instance, the study by [16] only applied 14 and 4 cases for the training and validation of their neural network model. This may have elicited concerns about the sample size in this study as stated by [30]. The motivation of the current study was inspired from the exhibited literature on the prediction of the FCIP cost using newly explored machine learning models including ELM, MARS, and PLS. These models are proven to be advantageous as they have very quick learning speeds with good performances and are useful in capturing complicated data mapping in very high set of predictors which produces interpretative results [3134]. Modeling structure was adopted based on the correlation statistic to identify the input predictors for the built ML models. Based on the reported modeling results, comprehensive comparative analytical aspects were reported and discussed.

2. Soft Computing Models

2.1. Extreme Learning Machine

ELM model is one of the new methods of training recently developed single-layer feedforward neural networks [35]. The traditional ELM, as shown in Figure 1, has one input layer, one hidden layer, and one output layer; each of these layers has a specific number of neurons. The linear function is generally selected as the activation function of the input and output layers of ELM while the sigmoid function is selected for the hidden layer [36]. The first step of the standard ELM is a random input weight and hidden biases determination, followed by the determination of the hidden weights using the Moore–Penrose generalized inverse method to achieve the optimal solution of the linear system [37]. The advantages of the ELM over the other gradient-based methods are its strong generalization capability, no parameter tuning, and fast learning; these have made ELM more popular in numerous engineering tasks [3840]. Consider a training dataset with N samples; the first process is to linearly map the input vectors into an L-dimensional feature space via nonlinear transformation; the expression of the simulated values of the ELM model is as follows:where N represents the number of samples for training, represents the output vectors that are associated with the input vector ; stands for the weight vectors that connect the hidden neuron to the output layer; is the weight vectors that connect the hidden neuron with the input layer; is the bias; and is the activation function.

In the ELM, the idea is that the classical single-layer ANN can approach all the samples with zero deviation as mathematically expressed in the relation:where is the target output vector that is related to the input vector . The reconstruction of the above expression gives the following:wherewhere is the weight of the matrix that connects the hidden and output layers; is the hidden layer output matrix based on N samples; and is the target output matrix based on N samples.

Assume that the hidden biases and input weights are constant; it implies that the model may be considered a special linear system in which H and T are equal to the matrixes of the known dependent and independent parameters, while β is considered the coefficient matrix that should be optimized. Hence, the least-squares solution of the represented linear system mentioned above can be derived aswhere is the Moore–Penrose generalized inverse matrix of H.

2.2. Multivariate Adaptive Regression Spline Model

MARS algorithms are nonlinear-nonparametric flexible regression models that were first developed by [41] and have found application in many fields of engineering due to their robustness [42]. This model is built with three major components, which are the basis functions (BFs), the knots, and the spline function [43]. The role of the BFs is to capture the relationship between the predictands and the predicted variables, amounting max (0, ) or max (0, ), where x is the threshold value, while c is the input variable value. The knots also represent the function of the base and base endpoints. A regression model is developed for each node by applying a spline function that consists of 1 or more BFs, followed by the substitution of the principal predictors [44]. In the MARS model, the predicted value is based mainly on linear BF elements combination. The MARS model can be reviewed as follows: consider Y as the target variable and as the input variable matrix; then, the equation of the MARS model can be as follows:where is the initial fixed value; is the applied BF for the fitting of the MARS model; and M is the total number of BFs [45]. The two major phases of the MARS model are the selection phase (or forward search) and the reversal pruning phase, as seen in Figure 2. The forward phase or selection phase can be regarded as a set of optimum input parameters. A complicated over fitted model normally results from an excessive forward stepwise selection process due to a series of splits and such models cannot perform well predictively despite fitting the data perfectly. Hence, the backward procedure is normally applied to improve the predictive performance of the model by removing the unwanted variables that have been selected in the selection phase. The generalized cross-validation (GCV) is calculated as the deletion criterion as it is the basis for the backward pruning process [46, 47].where is the observed values; N is the number of data; is the predicted values for pattern i; M is the number of BFs; and is the penalty factor. In equation (7), the quantity of parameter d significantly impacts the procedure as it is the optimization cost of each BF; its range is 2 ≤ d ≤ 4. The inclusion of several BFs can result in overfitting; therefore, it is important to omit some BFs during the pruning phase to enable the emergence of a well-fitted model with the least GCV value [48].

2.3. Partial Least Square Regression (PLS) Model

The first application of the PLS regression model was introduced over the literature by [49], and since then the model has been widely considered a new multivariate analysis technique in many fields [50, 51]. It combined the features of principal components, typical multiple regression, and linear regression analyses; hence, it is suitable for finding the solution to numerous problems, especially problems that cannot be solved using the conventional multiple regression methods and problems with multiple correlations [52]. The efficiency of PLS in such cases is based on its ability to decompose and screen the variables that mostly explain the dependent variables [53]. The first step of the PLS method is to extract the new variable called the component which serves as the independent variable, followed by the determination and establishment of the linear relationship between the dependent and independent variables [54]. After calculating the coefficient using PLS, the next step is the construction of the regression equation of the dependent variable. The regression model developed by using the PLS method is represented aswhere represents the linear combinations of the remote sensing variables and are the PLS-computed regression model parameters. A higher number of principal components in the established model by PLS translates to better model accuracy; however, an excessive number of principal components results in overfitting and higher error. Thus, the optimal number of principal components must be determined to achieve a balanced PLS model. The cross-validation method was used to calculate the sum of squared residuals in this study. The prediction ability of the resulting model is a function of the extent of predictive residual errors sum of square (PRESS) value. So, the optimal number of principal components can be determined based on the minimum PRESS value and this PRESS value can be calculated aswhere represent the measured value of the ith sample and the estimated value upon exclusion of the ith sample and k is the number of iterations for validation.

3. Case Study and Data Explanation

For the modeling purpose, datasets were collected from the open source of literature [7]. The datasets are explained the key cost derived from the FCIPs. The data were including , the served area; , the total length of the PVC pipeline; , irrigation valve number; , construction year; , geographical zone; and cost and duration of field canal improvement projects (FCIP) construction. The significance of the dataset is contributing to the best knowledge of irrigation authorities and decision makers to have a prior understanding on the FCIP cost. The biodata of the current research were collected from the survey conducted for Soltani Canal, Egypt. The quantitative costs are related to construction sites recorded between 2011 and 2018. The polyvinyl chloride (PVC) pipeline system is explained in Figure 3 with diameter ranging between 22.5 and 35 cm. The statistical properties of the dataset over the training and testing phases are reported in Tables 1 and 2. It is seen that all together of 228 data were taken for both training and testing phase. In Tables 1 and 2, the parameters that are collected for training and test phase are mean, standard error, median, mode, standard deviation, sample variance, kurtosis, skewness, range, minimum and maximum, sum, count, and confidence interval from C to . The mean value of FCIP cost is 353463.0 for training modeling phase whilst a mean cost of 352714.35 was taken for testing phase. From Tables 1 and 2, it can be seen that the duration for FCIP construction ranges from 58 days to 127 days in the training model dataset while it ranges from 59 days to 126 days in the testing model dataset. The datasets in both training model and testing model are well distributed and almost resemble a normal distribution, as for most of the datasets, the mean and median are very close to each other.

4. Application Results and Analysis

The feasibility of three machine learning models (ELM, MARS, and PLS) was evaluated to predict cost of FCIP construction. The models were built based on different input combinations, as reported in Table 3. Based on the correlation statistics, the input combinations were constructed as shown in Figure 4.

Based on the tabulated input parameters, it can be recognized that the total length of the PVC pipeline has the substantial correlation to the construction cost followed by the time duration, served area, irrigation valve number, and geographical zone.

Different statistical performance metrics including determination coefficient (R2), root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), Nash–Sutcliffe efficiency (NSE), and agreement index (MD) were calculated to validate the applied models statistically [55, 56].where and are the observed and predicted values of the FCIP cost; and are the mean values of the observed and predicted values of the FCIP cost; N is the number of observations; and j is the exponent term.

Tables 4 and 5 report the statistical measures over the training and testing phases, respectively. In general, prediction performance of the models indicated less accuracy by using few predictors. However, MARS model exhibited better predictability performance over both the training and testing phases. It has been noticed that the maximum determination coefficient was achieved for model M6 (R2 = 0.94) with a minimum RMSE of 28458.17 in the training phase while 27422.7 in the testing phase using all the predictor parameters, excluding the geographical zone in which less influence on the cost phenomena was revealed when compared to ELM and PLS whose coefficient of determination (R2) maxed out at 0.90 with RMSE of 36011.43 and 36013.16, respectively, for model M6 in the training phase. Similarly, in testing phase ELM and PLS, coefficient of determination (R2) maxed out at 0.89 with RMSE of 37141.8 and 37140.3 for model M6. In addition, it is seen that the ratio of the MSE and the potential error which is denoted by MD is 0.89 for MARS M6 model on both cases, i.e., training and testing phases.

The model performances were assessed using graphical presentations such as scatter plots and Taylor diagram. Figure 5 shows the scatter plots between the actual observations and the predicted values. Among the three applied prediction models, MARS model is indicated as the best identical match with high correlation value. On the other hand, Figure 6 shows the Taylor diagram map in which the prediction models were evaluated based on the distance coordination in accordance with multiple statistical metrics (i.e., standard deviation, RMSE, and correlation value).

5. Discussion

Various studies have been conducted to estimate a reliable parametric cost model, but there is no available study carried out for FCIP [5]. However, prediction of cost is not new; a simplex optimization of ANN weights was used to create a model for estimating the unit cost of highway projects with a mean absolute percentage error (MAPE) of 1% [16]. Another study used a combination of ANN and fuzzy logic to create a high-precision cost prediction model for water infrastructure based on the sum of squares of mistakes. During the validation phase, the researchers produced multiple prediction models with perceptions ranging from 4.6 percent to 0.6 percent [19]. Furthermore, by varying the ANN structure, training function, and training algorithm until an optimum model was found, a researcher built a prediction model with a MAPE of 1.4 percent for the unit cost of the highway project in Libya [20].It is seen that, in this study, the value of MAPE for MARS model M6 in both training and testing phases ranges from 5% to 6% when compared with other models.

6. Conclusion and Remarks

The prediction of cost related to civil engineering project is considered as vital topic to be studied comprehensively. In this study, couple of machine learning models including extreme learning machine (ELM), multivariate adaptive regression spline (MARS), and partial least square regression (PLS) were developed to predict field canal improvement project (FCIP) cost. For the purpose of the modeling development, datasets related to irrigation projects were collected from the open source published literature. Input combinations were initiated based on the total length of the PVC pipeline, served area, geographical zone, construction year, and cost and duration of FCIP construction. The prediction results showed that MARS and ELM models were presented positively in comparison with the PLS model. However, MARS model reported the superior results. Also, the research finding exhibited that all the predictors are substantial toward the cost calculation with almost no influence for the geographical zone of the pipeline network.

Nomenclature

ANFIS:Adaptive neuro-fuzzy inference system
MD:Agreement index
ANN:Artificial neural network
BF:Basic function
R2:Determination coefficient
ELM:Extreme learning machine
FCIP:Field canal improvement project
FL:Fuzzy logic
GCV:Generalized cross validation
IoT:Internet of things
NSE:Nash–Sutcliffe efficiency
ML:Machine learning
MAE:Mean absolute error
MAPE:Mean absolute percentage error
MARS:Multivariate adaptive regression spline
PLS:Partial least square regression
PVC:Polyvinyl chloride
PRESS:Predictive residual errors sum of square
RMSE:Root mean square error
SC:Soft computing
SVM:Support vector machine.

Data Availability

The data used in this study can be provided upon request from the authors.

Conflicts of Interest

The authors report no conflicts of interest.