Abstract

Construction cost estimation is one of the essential processes in construction management. Project cost is a complex engineering problem due to various factors affecting the construction industry. Accurate cost estimation is important in construction management and significantly impacts project performance. Artificial intelligence (AI) models have been effectively implemented in construction management studies in recent years owing to their capability to deal with complex problems. In this research, extreme gradient boosting is developed as an advanced input selector algorithm and coupled with three AI models, including random forest (RF), artificial neural network (ANN), and support vector machine (SVM) for cost estimation. Datasets were gathered based on a survey conducted on 90 building projects in Iraq. Statistical indicators and graphical methods were used to evaluate the developed models. Several input predictors were used, and XGBoost highlighted inflation as the most crucial parameter. The results indicated that the best prediction was attained by XGBoost-RF using six input parameters, with r-squared and the mean absolute percentage error equal to 0.87 and 0.25, respectively. The comparison results revealed that all AI models showed good prediction performance when applied to datasets affected by more than two parameters. The outcomes of this research revealed an optimistic strategy that can help decision makers select the influencing parameters in the early phases of project management. Also, developing a prediction model with high precision results can assist the project’s estimators in decreasing the errors in the cost estimation process.

1. Introduction

The construction industry is complex and comprises parties such as owners, contractors, and consultants [1, 2]. The construction sector affects the global economy of countries, so many studies have explored methods to improve the performance of construction projects [35]. Due to its global impact, several scholars measure the project performance’s success to understand it better. The construction project’s success can be measured by achieving the project within the estimated cost, duration, and specifications [6, 7]. Implementing a project with successful performance is challenging with growing awareness of the environment and customer requirements changing [8]. The construction industry is dynamic and complex since it needs the implementation of successful project management strategies [9, 10]. Unsuccessful strategies cause cost and schedule overruns, leading to undesirable results and reducing customer satisfaction [1113].

Consequently, there is a need to develop effective strategies and methods to mitigate risks and uncertainties in a construction project [14, 15]. The construction project management sector mainly includes the initial design phase, detailed design phase, construction cost estimation, project bidding phase, construction phase, and final delivery after completion [1618]. Cost is an essential criterion of project performance due to its impact on feasibility studies and choosing design alternatives [19]. Cost studies describe and evaluate the costs of buildings and other construction projects [20]. These studies seek to maximize the project’s revenue by using the available resources. Cost estimation is essential in construction and significantly impacts project management [21]. Accuracy of cost estimation is considered a necessary factor of project success during various phases of project construction [22, 23]. Accurate cost estimation affects the project’s profitability, owner’s satisfaction, and financial decision [24]. Inaccurate cost estimation leads to problems like cost overruns and project parties’ disputes [25, 26]. Several studies on cost management indicated that accurate cost estimation affects the profitability of construction projects at the tender phase and is a vital part of project survival [27, 28]. They also showed that establishing accurate cost estimation ensures the contract’s profit at the tendering phase.

Cost estimation is performed by a coordination role of tender managers and technical experts called estimators. The existing cost estimation methods require complete information about construction projects and are costly and time consuming [29, 30]. During the tendering phase, estimators have little information; therefore, they depend on their knowledge and expertise to attain the estimated cost [31]. The level of knowledge is different among estimators, which can affect the accuracy of the cost estimation process [13]. Many studies have used statistical and artificial intelligence techniques to prevent these problems and accurately estimate construction costs [13, 32, 33]. Several studies performed the regression method as a traditional technique for cost estimation [3437]. This method is easy and has achieved simple results. However, this technique cannot handle a complex system’s nonlinear relationship among parameters. Recently, computer-aided algorithms have been successfully conducted in construction management studies. AI models can handle the complexity and nonlinearity of construction projects and help the project’s parties understand the uncertainties and incomplete information at the early stage of the construction process [3841].

The capacity of artificial neural network (ANN) and support vector machine (SVM) models in estimating the cost and duration of road projects was analysed [42]. The comparison analysis showed that SVM has higher accuracy and fewer errors than the ANN model. Two researchers presented the ordinary least square regression (OLSR) for construction cost forecasting of the Pune region in India [43, 44]. The model was applied over a 12-year prediction period and attained 91%–97% accuracy. The SVM model’s capability in estimating the residential building’s conceptual cost was studied by the authors in [45]. The results showed that the model achieved low mean absolute percentage errors, with values ranging between 7% and 8.19%. Another study used SVM to estimate the cost of bridge construction [46]. The study showed that the SVM model could estimate the cost during the initial process of project construction. The model obtained the highest performance results with a correlation coefficient equal to 0.974. An ANN model was trained with a backpropagation algorithm for early performance estimation of buildings in India [47]. The study revealed the ability of the ANN model in cost prediction and its importance to the financial investors in the construction project. The precision of construction cost prediction was improved by integrating a genetic algorithm (GA) with the ANN model [48]. The study reported that the developed model attained a high predictive performance equal to 0.9471 and assisted project managers at the beginning of the project. Principal component analysis (PCA) and particle swarm optimization (PSO) were combined with the SVM model to predict cost of substation projects [49]. The developed model was compared with PCA-SVM and PSO-SVM, and the study demonstrated that the integration of three algorithms achieved better prediction outputs than other models, which can help decision makers in substation projects. Twenty AI techniques for construction cost estimation of field canal improvement projects were compared by the author in [50]. The author concluded that the extreme gradient boosting (XGBoost) algorithm gained the best prediction results with r-squared equal to 0.929. Three AI methods, namely, multilayer perceptron (MLP), radial basis function neural network (RBFNN), and general regression neural network (GRNN), were developed for cost prediction of road projects [51]. The study showed that GRNN gained better prediction results than other models, with R2 equal to 0.9595. The study also indicated that ANN obtained fewer prediction errors and could handle limited information during the early phases of the construction process. The study by the authors in [52] used three machine learning algorithms: MLP, GRNN, and RBFNN, with a process-based method for cost prediction at the early stage of project management. The study confirmed that the GRNN algorithm provided better outcomes than other models, which can help project managers predict construction costs in the contracting phase, where several input parameters are unknown at this stage. The labor cost of the BIM project was explored by integrating a simple linear regression (SLR) with the random forest (RF) model [53]. The study demonstrated the effectiveness of the hybrid model in the cost estimation process. An optimization algorithm called PSO was integrated with the ANN model to improve the performance of the cost prediction of high-rise residential buildings [54]. The study concluded that the PSO-ANN model has higher prediction precision and generalization than the single ANN algorithm.

The ability of RF, SVM, and multilinear regression (MLR) to predict the cost overrun of high-rise buildings’ engineering services was examined [55]. The authors showed that the RF model achieved better prediction results than the other two AI models. Based on the reported studies, AI algorithms can be applied successfully for cost estimation. These studies indicated that the performance of these algorithms is affected by the algorithm’s structure and the abstraction of input variables. The correlation statistics, factor analysis, and relative importance index are the popular methods used by past studies [5659].

In these studies, data were collected based on personal opinions and expert surveys, leading to bias in the existing approaches. Moreover, the current method explores the linear relationship between input and output predictors. Based on that, exploring an advanced method that can investigate the complex system of cost estimation parameters is very important to achieve accurate results [38]. Furthermore, integrating a new feature selection method with AI algorithms is significant for construction management engineering to get accurate prediction performance [60]. Recently, the XGBoost algorithm has been explored as an advanced version of the feature selection approach in engineering problems. XGBoost is a recent version of gradient boosting and has been applied effectively as an input selection method by civil engineering scholars [61, 62]. The research scope of applying AI algorithms in construction management is still limited, so exploring a new cost estimation method is the motivation of this research.

Developing effective predictive models that can achieve accurate performance is a vital issue in the early phases of engineering management. The construction industry in Iraq has a special issue due to the risky conditions and exceptional political circumstances that had happened in this region [2]. The instability of political and economic conditions has a massive impact on the performance of construction projects. Due to economic and political circumstances, most constructed projects have failed to be completed within the specified budget [63]. Thus, developing an integrative model using AI algorithms in cost management can improve cost performance by evaluating, controlling, and monitoring the project cost under uncertain conditions.

The current study is achieved by integrating the XGBoost algorithm with three AI algorithms, namely, RF, SVM, and ANN. XGBoost was used to select the influencing parameters of the prediction process, and then, the AI model used these parameters. The attained results are analysed and discussed by using statistical and visualization methods. The output of this study can assist a project manager in selecting the influencing parameters at the early stage of a construction project. Also, introducing an integrated model with high precision results helps project estimators reduce errors in the cost estimation phase.

2. Construction Cost Data Explanation

This study used public building projects in Iraq as a case study for the modelling process and used their dataset. These projects are managed by the Iraqi government, so developing an accurate precision model can help decision makers by producing precious results. The information in the dataset was collected from the survey conducted on 90 construction projects between 2016 and 2021. The information was gathered from historical records of construction projects, including project drawings, bills of quantities, and project schedules. The dataset includes information on ground floor area (GFA), total floor area (TFA), floor number (FN), elevator number (EN), footing type (FT), inflation (F), duration (D), and construction cost (C). The inflation data were collected from the Iraqi central bank (https://cbiraq.org/). Tables 1 and 2 show the descriptive measures of the cost data for the training-testing phase. The statistical descriptions include minimum, maximum, mean, median, standard deviation, skewness, and kurtosis. For the training process, the mean value of construction cost is 1623076 $, while for the testing phase, the mean value is 2571431 $. The maximum and minimum values of project duration are 731 days and 122 days for the training phase, while for the testing phase, they are 150 days and 787 days. It can be seen that the datasets for the training and testing phases are well distributed. It is near the normal distribution because the mean and median values of the datasets are close to each other.

3. Method Overview

3.1. Extreme Gradient Boosting (XGBoost)

XGBoost is an advanced version of tree-based boosting modelling introduced by Chen and Guestrin [62], which is applied effectively in input selection problems [64, 65]. The boosting algorithm’s concept uses an iteration process for learning the functional relationship between the target and predictor values [66]. Through this iterative process, the individual trees are trained sequentially on the residual output of the previous trees to reduce the training errors [67]. The algorithm uses a cache-aware structure and a regularized method for boosting learning. The mathematical expression of prediction can be shown as follows:where is the predicted value of the target, X represents the input variable, K is the value ranging between 1 and n, is the function between input and output variables, and n is the number of trained functions by boosting trees. The loss function in XGBoost must be minimized to train several functions , as shown in the following expression:where is the regularized function, i represents the loss function measurement between (prediction value) and (actual value), is a regularization term that prevents the building of additional trees in the model from decreasing overfitting and error. is the leaf’s complexity, is the leaves’ number in the tree model, is the penalty parameter, and is the score vector on the leaves.

In the input selection process, the main aim of XGBoost is to produce the feature importance of input variables [68]. According to Hastie et al. [69], the algorithm uses gain, frequency, and cover to calculate feature importance. The gain method calculates the role of each feature in the model’s development. Frequency is the weight representing the occurrence number for each feature in the boosted trees. The cover method shows the number of samples related to each feature. XGBoost uses the following expression to calculate the importance of the feature:where is the tree, number, N is the number of a node for each leaf, represents the feature of node , and is the indicator function.

3.2. Random Forest (RF)

Random forest (RF) is an ensemble algorithm introduced by Breiman [70] and based on combining multiple decision trees to produce a robust prediction model. It has been used effectively for classification and regression problems in several areas of construction management [7173]. The RF model can deal with many input variables and work efficiently with outliers and unbalanced datasets. The algorithm reduces overfitting results and performs accurately with simple computation processes [74]. The RF model uses bootstrap and random space techniques to improve the predictive model’s performance [75, 76]. In the RF model, the algorithm uses a bootstrap method to choose new training sets from the original data randomly, and these new data will be utilized to develop a regression tree . The number of split for each node in the regression tree is computed using a stochastic random space technique.

The modelling steps to develop the RF model are as follows: First, we generate a new training dataset using a bootstrap algorithm where two-thirds of the original data (in bag data) are used to train the developed model. After that, several regression trees were built based on bootstrap samples, and these regression trees were used to develop the RF model. The RF model is created by training a sequence of regression trees. The variance between the trees can be measured by randomly choosing the optimal number of attributes based on maximum depth values. These computations increase the ability of the RF model to reduce the errors in prediction results. The RF model is built by training a sequence of regression trees. Finally, the algorithm collects the output value for each tree and calculates the final prediction using the average method [77]. The mathematical calculation of the RF technique is expressed as follows:where represents the prediction value of the RF model, is the number of regression trees, and is the regression tree model based on input value (x). The schematic diagram of the RF algorithm is described in Figure 1.

3.3. Artificial Neural Network (ANN)

Artificial algorithms such as ANN and other machine learning algorithms have been introduced recently. These models have been characterized by their capabilities in handling complex datasets and producing accurate results [78]. The artificial neural network is a mathematical expression, building its components by emulating the biological structure of the human brain [79]. The main element of the ANN model is the series of connected layers called neurons. Several types of ANN exist; scholars commonly apply a feedforward ANN with a backpropagation algorithm [80]. The popularity of the backpropagation algorithm in ANN applications is gained by its capacity to learn ANN networks based on a supervised learning algorithm [81]. In this method, the error in prediction results is computed by comparing the predicted values with actual variables. The weights in the ANN model are updated by backpropagation to reduce the expected error to the allowable value. Feedforward ANN consists of input, hidden, and output layer components. In this type of ANN, moving information is performed directly from the input to the output neurons without returning in the reversed direction. The number of neurons in the input and output layer is based on the number of input and output variables in the ANN model. In the hidden layer, neurons nonlinearly transform input variables into the output layer [82]. Hidden layers in the ANN algorithm can be expressed mathematically as follows:where represents the hidden layer, is the input parameters, and refers to the weight between input and hidden layers. The value of the output layer can be computed as follows:

The design of an ANN network requires identifying the number of neurons and hidden layers. According to previous studies, the best prediction results can be achieved using one or two layers [83, 84]. The optimum input variables can attain the best performance during the training process. The relationship between input and output variables is designed to improve prediction performance by training ANN. For each repeated process, the biases and weights are modified by the algorithm to reduce the error measures between the original and predicted values. The errors of the expected value can be presented as follows:where represents the actual value and is the predicted value achieved by the ANN model. The presentation of the ANN model is shown in Figure 2.

3.4. Support Vector Machine (SVM)

Support vector machine is a supervised machine learning algorithm developed as a method that uses a hyperplane to divide the data and measure the nearest position between the external point and the hyperplane [85]. SVM is a popular algorithm commonly used by scholars to improve the estimated process of engineering problems [86, 87]. The algorithm simulates the errors between actual and predicted parameters by measuring the distance from the SVM margin. The SVM model can be expressed mathematically as follows:

represents the training dataset and and are the input and output parameters. SVM uses the following function to learn the dataset during the training phase:where represents the weight indicator, is the nonlinear function of input parameter and is a scalable term. The standard error of the prediction process is minimized using the following equation:where refers to the slag variable, is a penalty variable controlling the error between regularization and empirical prediction, and is the function corresponding to the accuracy of the training process. The SVM model can be optimized by using Lagrange multipliers and optimum generic functions as shown in the following expression:where represents the kernel function. The main feature of the SVM algorithm in regression problems is that it correlates input and output parameters using a nonlinear relationship. The SVM model’s kernel function helps the algorithm generate nonlinear mappings in high-dimensional space. The SVM model has four kernel functions: linear, sigmoid, polynomial, and radial basis function (RBF) [87]. The RBF kernel function is simple, effective, and reliable and has been used in several complex studies [88]. The RBF nonlinear equation was used in this study, and the kernel function was defined depending on three parameters: , and where the optimal values of this equation can be reached using the trial and error method. The illustrative diagram of the SVM algorithm is depicted in Figure 3.

4. Model Development and Performance Assessment

In the present study, the construction cost estimation of the building projects in Iraq was explored. First, input selection was made to abstract the appropriate features for the prediction process. Due to the complex nature of cost estimation, the XGBoost algorithm was developed to choose the most important parameters. XGBoost was integrated with popular AI algorithms, namely, RF, ANN, and SVM. The hybrid model was developed using the R programming language (version 4.1.1). Three libraries called XGBoost, Matrix, and ggplot2 were applied to construct the XGBoost algorithm. The best results of feature selection were attained using the xgb importance function. For the SVM model, library (dplyr), library (caret), library (ggplot2), and library (kernlab) were used. Function train control was applied to control the following parameters: method (cv), number (), and set as 5-fold cross-validation. The radial basis function was applied in this study using the svmRadial function. The RF algorithm was designed using the library (ranger) and ranger function. The designed parameters are as follows: num.trees set as 200, mtry equal to 3, and min.node size used as 3. In the case of the ANN model, library neuralnet with one hidden layer and resilient backpropagation were applied to enhance model prediction. The performance of the developed model was assessed by using several statistical evaluators, including coefficient of determination (R2), error measures (i.e., mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE)), Nash–Sutcliffe efficiency (Nash) and Willmott’s index (WI) [89, 90]. The process of the presented AI models is shown in Figure 4.

5. Result and Discussion

5.1. Results Analysis

In this research, the ability of three AI algorithms, namely, RF, ANN, and SVM, was examined to estimate the cost of construction projects. The authors introduced the XGBoost model to determine the best combination of input parameters. The developed models were built based on different combinations extracted from the XGBoost model. Input combinations were constructed using the important parameters selected by the XGBoost algorithm, as shown in Figure 5. Figure 5 shows the relative importance values of input features using the XGBoost algorithm. The XGBoost results show that the inflation parameter (F) is the most important cost estimation, followed by the total and ground floor area. The results also indicated that the elevator number and the footing type gained the least significant scores by XGBoost. To determine the impact of input variables on the performance of predictive models, several combinations were constructed and tested by each model. Seven models were developed for each algorithm (Model I, Model II, Model III, …, Model VII), including variables from one to seven input parameters.

Tables 3 and 4 show the statistical measurements of the presented computer-aided algorithms for the training and testing phase. The tabulated measurements revealed that both the XGBoost-RF and XGBoost-ANN models have excellent performance for the training phase when using more than two input variables. The XGBoost-SVM model achieved less predictive performance than the other AI models for all input combinations, and the best accuracy was attained by using three input parameters. The best results for the training phase are demonstrated by XGBoost-ANN–Model V with R2 = 0.97551, RMSE = 253464.6776, MAE = 151999.7328, MAPE = 0.40876, Nash = 0.97522, and WI = 0.99376. For the testing phase, the results revealed that increasing the number of input variables leads to increased predictive accuracy for all AI models. RF outperformed SVM and ANN models, and the best performance was achieved by XGBoost-RF-Model VI with R2 = 0.87211, RMSE = 693311.4488, MAE = 424619.6505, MAPE = 0.25539, Nash = 0.86739, and WI = 0.962557.

The developed AI models are also evaluated by graphical presentations like scatter plots, box plots, and Taylor diagrams. Figures 68 illustrate scatter plots for the testing phase for the three hybrid models (XGBoost-RF, XGBoost-ANN, and XGBoost-SVM). The XGBoost-RF model exhibits a good prediction with R2 greater than 0.81 for all combinations except for Model I, where R2 decreases to 0.6215. In the ANN algorithm, the developed model performs well with R2 maxed out at 0.83 when increasing the number of input combinations for models V, VI, and VII, as shown in Figure 6. The SVM model shows an enhancement in the prediction accuracy when the number of combinations increases, except for Models VI and VII, where R2 reduces to 0.7579.

Figure 9 shows a box plot presentation to illustrate the residual error between the observed and estimated values of cost estimation. The results showed that XGBoost-RF-Model III and XGBoost-RF-Model V gained the minimum residual plot with an error value of less than 50%. For the ANN model, the minimum positive error was attained by Models V and VII, while Models V and VI gained the minimum negative error. In the XGBoost-SVM model, combinations with 2 to 7 input parameters show a reduction in the error value, and the minimum error was achieved by Models III, IV, and V with a residual error of less than 50%. The maximum residual error was demonstrated by Model I for all the developed models with a negative error value of less than 85%.

Another graphical method (i.e., the Taylor diagram) was constructed to evaluate the developed models based on correlation and standard deviation [91]. Figure 10 illustrates the Taylor diagram for the three AI algorithms with different input combinations for tested data. Based on the constructed Taylor diagram, the XGBoost-RF model attained the nearest position to the actual cost using three and six input parameters (i.e., Model III and Model VI). The rest of the combinations of the RF model demonstrate good prediction performance, except for Model I, which attained less correlation value than the other combinations.

XGBoost-ANN and XGBoost-SVM models achieved the best performance using five input variables (i.e., Model V). XGBoost-ANN attained the nearest distance to the actual cost using three models (i.e., Model V, Model VI, and Model VII), whereas the remaining models achieved the farthest distance with a correlation value of less than 0.9. For XGBoost-SVM, only three models gained the best performance using three, four, and five input parameters, whereas the poorest performance was achieved using one input variable.

5.2. Validation against Previous Studies

To confirm the ability of the introduced AI models in cost estimation, it is necessary to validate these models against the developed AI models in past studies. An ANN algorithm was developed to estimate the construction cost of highway projects in India [92]. The study showed that the ANN model could estimate construction cost with an R of 0.94. In another study, the developed approach gained a correlation coefficient of 0.97 for the cost prediction of bridge construction by using an SVM model with 27 input variables [46]. The investigation of three AI algorithms called multivariate adaptive regression spline (MARS), extreme learning machine (ELM), and partial least square regression (PLS) was performed to estimate the construction cost of field canal projects [93]. According to the reported results, the MARS model attained the best results with an R2 = 0.94 and five input parameters. Previous studies also reported hybrid models as effective models for cost estimation, such as ANN-GA and RF-SLR [48, 53]. The previous studies gained an acceptable performance in cost estimation. However, they only focused on single models and gave little attention to hybrid models. Also, they developed AI models based on all input parameters. This study combined the XGBoost algorithm with AI models to enhance the cost estimation accuracy. It can be noticed that XGBoost-RF achieved good estimation performance with r-squared ranging from 0.87 to 0.91 for both testing and training phases using only three input variables.

5.3. Discussion

Using an AI approach in complex construction projects is highly recommended to get accurate estimation process results and simulate the nonlinear relationships between input and output parameters. Using XGBoost as an advanced input selector revealed that inflation, total floor area, and ground floor area are the most important variables in cost estimation. The comparison results between AI models showed the ability of the developed algorithms to predict construction costs because all models attained good predictive performance except for models that used one input variable. The XGBoost-RF model showed a significant enhancement in the prediction process using only three input parameters where R2 = 0.87 and MAPE = 0.308 and minimum negative error, as shown in Table 4 and Figure 9. Applying an RF model with six input parameters (i.e., XGBoost-RF-Model VI) led to reducing MAPE to 0.25 and producing a residual plot with no outlier points.

The poor performance was achieved by the RF algorithm in Model I and Model II, where the gained R2 was less than 0.7 and the residual error was high, as illustrated in Figures 8 and 9. For the ANN algorithm, the model increased its prediction performance by increasing the number of input parameters, and the best R2 was achieved by XGBoost-ANN-Model V with RMSE equal to 750698.034, as reported in Table 4. The XGBoost-ANN-Model revealed the poorest results of Model I with high residual errors and the farthest distance to actual cost (see Figures 9 and 10). In the case of the SVM model, three models (i.e., Model III, Model IV, and Model V) illustrated good performance with r-squared maxed out at 0.8, as depicted in Figure 7. The other combinations of the XGBoost-SVM model achieved good predictive accuracy with R2 greater than 0.7, except for Model I that showed the lowest correlation coefficient and farthest position to the observed value, as shown in Figures 8 and 10. Based on the evaluation results, all AI models exhibited good performance when the number of input variables was increased in the estimation process. RF and SVM models performed better than ANN when using a few input variables, especially when applied to one input parameter. The comparison results revealed that integrating XGBoost with AI models enhanced the prediction accuracy by selecting the appropriate parameters for the modelling process. The integrated XGBoost algorithm with AI models revealed that using three input parameters, i.e. inflation, the total floor area, and the ground floor area, is necessary to get the accurate performance of cost estimation. The results revealed that increasing input variables from three to six reduced the error percentage and increased modelling efficiency. The results also showed that tree-based models outperformed classical models in their ability to handle complex models based on a few input variables. The reported results indicated that the RF model was able to understand the complex nature of construction cost estimation. The integrated XGBoost algorithm with AI models revealed the robustness of using predictive models when limited input variables are known by the project’s stockholders. These results indicate the ability of the developed model to be used under uncertain circumstances. For future studies, other advanced selection algorithms like GA can be tested to simulate the complex behaviour among modelling parameters. Also, recent algorithms such as deep neural networks can be integrated with input selector algorithms to get low errors and more accurate results [94, 95].

6. Conclusions

Developing a reliable predictive model is an essential issue in construction cost estimation. In this research, the XGBoost algorithm was used to select the correlated parameters of the modelling process and hybridized with three AI models, namely, RF, ANN, and SVM, to estimate the construction cost. Datasets were collected based on the survey of 90 building projects constructed between 2016 and 2021. The results showed that the most correlated variables selected by XGBoost are inflation, the total floor area, and the ground floor area. For prediction performance, all AI models showed good reliability in the prediction process when applied to input variables of more than 2. The XGBoost-RF model revealed a high correlation coefficient in all combinations except for Model I, where R is below the acceptable performance level. The graphical evaluation showed that XGBoost-RF-Model VI gained the best performance with an r-squared of more than 0.8 and low residual error. The results also indicated that tree models could deal with complex systems and get accurate results based on the limited number of input variables. More input parameters should be investigated for future direction, and the GA algorithm can be explored to generate significant feature selection. Also, a new deep learning algorithm can be presented to enhance the capability of predictive models.

Data Availability

The used data in the current research study can be obtained from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

The authors would like to thank the Institute for Big Data Analytics and Artificial Intelligence (IBDAAI), Kompleks Al-Khawarizmi, Universiti Teknologi MARA (UiTM), 40450 Shah Alam, Selangor, Malaysia, for the support in publishing this paper. In addition, the authors would like to thank Al-Mustaqbal University College (MUC-E-0122) for providing technical support for this research. This research was funded by Universiti Teknologi MARA. The authors would also like to thank the support received by the University of Baghdad.