Abstract

The main objective of this study is to evaluate and compare the performance of different machine learning (ML) algorithms, namely, Artificial Neural Network (ANN), Extreme Learning Machine (ELM), and Boosting Trees (Boosted) algorithms, considering the influence of various training to testing ratios in predicting the soil shear strength, one of the most critical geotechnical engineering properties in civil engineering design and construction. For this aim, a database of 538 soil samples collected from the Long Phu 1 power plant project, Vietnam, was utilized to generate the datasets for the modeling process. Different ratios (i.e., 10/90, 20/80, 30/70, 40/60, 50/50, 60/40, 70/30, 80/20, and 90/10) were used to divide the datasets into the training and testing datasets for the performance assessment of models. Popular statistical indicators, such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Correlation Coefficient (R), were employed to evaluate the predictive capability of the models under different training and testing ratios. Besides, Monte Carlo simulation was simultaneously carried out to evaluate the performance of the proposed models, taking into account the random sampling effect. The results showed that although all three ML models performed well, the ANN was the most accurate and statistically stable model after 1000 Monte Carlo simulations (Mean R = 0.9348) compared with other models such as Boosted (Mean R = 0.9192) and ELM (Mean R = 0.8703). Investigation on the performance of the models showed that the predictive capability of the ML models was greatly affected by the training/testing ratios, where the 70/30 one presented the best performance of the models. Concisely, the results presented herein showed an effective manner in selecting the appropriate ratios of datasets and the best ML model to predict the soil shear strength accurately, which would be helpful in the design and engineering phases of construction projects.

1. Introduction

Soil is a crucial material in civil engineering, as most of the structures are built on soil ground [1]. The failure of the ground and collapse of the buildings are often associated with soil shear strength. Under different loading conditions, the soil shear strength, or the shear resistance, is dependent on the cohesion, friction, and interlocking between particles [1]. The mechanical property of soil is complex due to the fact that soil often contains different particle sizes, high water content, and large voids [1]. Soil shear strength is dominated by basic parameters such as soil mineralogy, overburden pressure, water content, density, and void. Commonly, the soil shear strength is calculated by determining the effective stress and soil parameters, such as internal friction angle and cohesion [1, 2]. These soil parameters can be determined in the field by Standard Penetration Test (SPT) or shear vane test and in the laboratory by conducting direct shear test, ring shear test, triaxial test, and unconfined compression [3, 4]. These tests are time-consuming and involve a lot of cost on conducting tests on an important number of samples.

Over the last decades, many researchers have tried to improve and find alternative methods to determine the shear strength of soil [3, 510]. Nam et al. [11] used a multistage direct shear test for determining the shear strength of unsaturated and saturated soils. Such a method could reduce some disadvantages of conventional direct shear tests and produced high accuracy results. Besides, many researchers have attempted to establish a relationship between soil indexes, such as clay fraction, liquid limit, plastic limit, and clay mineralogy [9, 12]. Also, many efforts have been made to evaluate the shear strength of soil through other soil parameters, such as establishing a correlation between suction and shear strength [10, 13]. In addition, several conventional procedures were introduced to estimate the shear strength of soil, where the relationship between the water content and suction is employed as a tool in the prediction process of unsaturated soil shear strength [6, 1416]. Another effort has been carried out to estimate the soil shear strength in situ through shear wave velocity [1618]. Overall, the conventional and traditional techniques possess some disadvantages and limitations, such as limitations in using basic soil parameters or considering a small range of soils. As an example, Kaya [2] indicated that the empirical formula, as suggested by Wright [19], is only limited to the soil containing a clay fraction superior to 50%.

In the recent time, Machine Learning (ML) techniques have been developed expeditiously and successfully applied in many fields of civil engineering [2027] and Earth sciences [2831], including geotechnical engineering such as landslide susceptibility [3241] and estimation of soil parameters [4247] including shear strength of soil [4752]. In the work of Das et al. [53], the authors successfully applied an Artificial Neural Network (ANN) for estimating the residual friction angle of tropical soil in a specified area. Besides, it is found that the Support Vector Machine (SVM) showed a better performance than ANN for estimating the shear strength of soil using basic soil parameters, such as liquid limit, plastic limit, and clay fraction. In another work, Besalatpour et al. [54] showed that Adaptive-Network-based Fuzzy Inference System (ANFIS) and ANN models had higher ability than conventional regression methods. In another study, three new optimization techniques, namely, the Dragonfly Algorithm (DA), Invasive Weed Optimization (IWO), and Whale Optimization Algorithm (WOA), were employed to optimize the weights and biases of an ANN structure in estimating the shear strength of soil [50], where it was noticed that the learning error was significantly decreased. Thus, the IWO-ANN hybrid algorithm was found to be promising model instead of conventional methods in solving soil shear strength problems. Further, Moayedi et al. [49] used four neural-metaheuristic models for estimating the shear strength of soil and stated that the Salp Swarm Algorithm-Multilayer Perceptron (SSA-MLP) model is a potential alternative method for estimating the soil shear strength. In general, ML techniques have significantly improved the prediction ability compared to conventional methods.

Despite significant growing of researches in applying ML algorithms in soil science, it is surprising how few of these suggestions are dedicated to the investigation of the performance assessment under a combination of factors during the model development phase. These factors could be the choice of data splitting, the selection of sampling technique, or the ML algorithm. For instance, a study on the comparison of ML techniques in digital soil mapping found that sample design and model choice significantly affected the outputs [55]. With regard to the data splitting, the data sample is often divided into two datasets, including a training set for model training and a testing set for model validation. Many researchers proposed a ratio of 70/30 or 80/20 (training/testing set) for producing datasets in landslide susceptibility problems [5661]. Regarding studies on estimating the residual strength of soil using ML algorithms, previous works mainly used ratios of 70/30, 80/20, and 90/10 (training/testing) for generating datasets [22, 43, 4749, 5153]. Recently, Pham et al. [47] conducted a study on estimating the shear strength of soil in varying the training dataset size from 30% to 90% using the Random Forest (RF) algorithm. The study revealed that the increase in the size of the training dataset improved the training performance and made the model more stable. For the testing performance, the increase in the training set’s size from 30% to 80% could also enhance the testing performance. However, when training size increased from 80% to 90%, the opposite trend was found in testing performance. In general, the training set size had an important effect on the prediction ability of the ML models [62].

The main objective of the present study is to evaluate the performance of ML models considering different ratios of soil data splitting for the prediction of soil shear strength. In this research, three ML techniques, namely, ANN, Extreme Learning Machine (ELM), and Boosting algorithm, were adopted to estimate the soil shear strength based on different splitting ratios of input data for the training and testing phases. The main difference of this study compared with the previously published works is that it is the first time the influence of splitting strategy of training and testing datasets used in ML models was investigated to predict the soil shear strength. Results were evaluated using standard statistical measures, namely Mean Absolute Error (MAE), Correlation Coefficient (R), and Root Mean Squared Error (RMSE), for the selection of the best model in predicting the soil shear strength and study the influence of different ratios of training and testing data on the performance of models.

2. Research Significance

ML, which includes advanced soft computing based techniques, has been developed and applied successfully and efficiently to solve a lot of real-world problems [6368]. The main advantage of ML is that it can subjectively analyze unlimited amounts of data and give reliable outcomes and assessment [69]. However, its performance depends significantly on the quality of data and the strategy of using the data [7072]. Therefore, assessment of the influence of data splitting on ML models’ performance has a high significance, which will pave the way on how to select a suitable data splitting for better ML-based modeling. In this study, we have selected three popular ML models, namely, ANN, ELM, and Boosted, for modeling. In addition, we have selected a research problem, “the prediction of soil shear strength,” which is an important geotechnical engineering task [43, 46, 47, 73]. This will help the construction engineers and managers to quickly and accurately predict the soil shear strength, which can be used for the design and verification of construction projects.

3. Data Used

Soil investigation data of the Long Phu 1 power plant project, located in Soc Trang province, Vietnam (longitude of 9°59′07.3″N and latitude of 106°04′48.0″E), was used in this study for the development of the ML models. The construction of this power plant was started in June 2015, reflecting a key project under the Vietnamese Government’s 2011–2020 National Power Development Plan [73]. A database of 538 soil samples was used to build the training and testing data sets. Soil parameters such as clay content (%), void ratio, moisture content (%), liquid limit (%), plastic limit (%), and specific gravity were used as input variables, whereas the soil shear strength (kg/cm2) determined by direct shear test under the Undrain and Unconsolidated (UU) scheme was used as the output variable.

Statistical analysis of the input variables suggests that, in the samples, the clay content varied from 0 to 65 (%), plastic limit from 15 to 35 (%), liquid limit from 20 to 65 (%), specific gravity from 2.6 to 2.7, and void ratio from 0.5 to 1.0 (Figures 1(a)1(g)), whereas the output variable varied from 0.45 to 0.7 (kg/cm2) (Figure 1(g)).

Considering different ranges of variables (Figure 1), these values were scaled in the range of [0, 1] to avoid unexpected jumps and reduce fluctuations within the datasets used for modeling.

4. Methods Used

4.1. Artificial Neural Network (ANN)

ANN has been known as a popular and powerful machine learning technique (computational model) [74, 75], based on structures and functions of biological neural networks: the nervous system of the human brain [20, 7678]. This method has been used successfully in solving a wide range of civil engineering problems, including geotechnical engineering problems. ANN method is used to identify the relationship between input and output neurons in both linear and nonlinear patterns [21, 22, 79]. Thus, ANN could make a decision by analyzing patterns and relationships in data by itself [2, 43, 80]. In this study, a multilayered perceptron neural network, a popular ANN [81], was employed as a regression technique to estimate the soil shear strength.

4.2. Boosting Trees (Boosted)

Boosted (Trees) is a hybrid method that combines the decision trees and boosting method. In this ensemble-type method, decision trees are employed to link input and output variables through recursive dual separations, while the boosting method is adopted to associate many individual models for improving the performance of the hybrid model [82]. The Boosted method, having the merits of tree-based techniques, can overcome the disadvantages of a sole tree model because of the following reasons. Firstly, this ensemble can choose a proper variable to match the appropriate functions. Secondly, it is suitable for various types of data using random boosting, and finally, this method can mitigate both bias and variance via model averaging [83].

4.3. Extreme Learning Machine (ELM)

ELM was firstly suggested by Huang et al. [84, 85], which is a modern algorithm and employed as a Single hidden Layer Feedforward Neuron Network (SLFN) [86]. ELM algorithm produces better performance in terms of learning speed compared to a conventional algorithm, for instance, backpropagation and least-square support vector machine [61, 84, 87]. The main aim of ELM is to get the smallest norm of weights on which the smallest training error can be reached for optimization of the model performance. A detailed description of ELM algorithm is available in published papers [84, 8890].

4.4. Monte Carlo Approach

Monte Carlo method has been widely introduced to solve problems relating to the variability of input parameters in various fields, including geotechnical engineering [45, 91, 92]. Monte Carlo methods are a broad class of computational algorithms that rely on the repeated random sampling process to obtain numerical results. Basically, this technique could produce a high ability to compute, statistically, the relationship in data for both linear and nonlinear problems [45, 91]. Monte Carlo technique is implemented by repeating randomly input variables based on the distribution of probability density, and the outputs are computed correspondingly via a simulated model [93, 94]. A concept of the Monte Carlo method includes the following: (i) variability of input parameter could be completely spread by predetermined models and (ii) sensitivity analysis of inputs can be evaluated using statistical analysis of the output results.

4.5. Performance Evaluation Criteria

In this paper, standard statistical measures, namely, Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Correlation Coefficient (R), were used to compare and validate the performance of ML models [47, 95]. In general, RMSE is the mean squared difference between the estimated and actual values, while MAE is the mean amplitude of errors. Lower values of RMSE and MAE mean higher prediction ability of the models. Besides, R is employed to evaluate the correlation of the predicted and actual values of soil shear strength. The values of R are between −1 and +1, where the absolute values of R close to 1 mean higher prediction ability. These indicators can be computed using the following formulas [45, 96]:where ycoi and represent the output value of the ith sample and the corresponding output mean value computed by the ML model, respectively; yaci and denote the measured value of the ith sample and the measured mean value, respectively; and n indicates the total number of samples.

5. Results and Analysis

In this section, the prediction results of the soil shear strength are presented using various ML models (ANN, ELM, and Boosted). In the modeling, clay content, void ratio, moisture content, liquid limit, plastic limit, and specific gravity were considered as input variables, whereas soil shear strength was considered as the output variable. As a first step, the influence of training and testing ratio on the performance of the ML models is presented, followed by the study of the random sampling effects on the performance of ML models, and finally, comparisons of different ML models are performed.

5.1. Influence of Different Training and Testing Ratios on the Performance of the ML Models

To evaluate the influence of different ratios on the performance of ML models, ANN model was used to select the best train-to-test ratio for the estimation of soil shear strength. Using ANN to perform the study, six parameters (Table 1) were selected using trial and error tests to train the model. The dataset was divided into two parts, with different ratios: 10 : 90, 20 : 80, 30 : 70, 40 : 60, 50 : 50, 60 : 40, 70 : 30, 80 : 20, and 90 : 10 train/test split. Basically, a training dataset was used to construct the model, whereas the testing dataset was used to assess the model’s predictive capability. Finally, the performance of ANN model on different ratio-based training and testing datasets using various statistical indices was evaluated, as shown in Figure 2.

It can be seen that as the number of data in the training datasets increased, the errors (RMSE and MAE) of the ANN model increased, and R values of the ANN model decreased, showing the accuracy of ANN decreased (Figures 2(a), 2(c), and 2(e)). In contrast, as the number of data in the testing datasets increased, the errors (RMSE and MAE) of ANN decreased, and R values increased, reflecting an increase of the ANN accuracy (Figures 2(b), 2(d), and 2(f)). It can be observed that the performance of the ANN model on both training and testing datasets was the best on the training/testing ratio of 70/30, based on the values of mean, standard deviation, and quantile levels of the three criteria.

5.2. Random Sampling Effects on the Performance of ANN

To validate the random sampling effects on the performance of the ML models, the ANN model was used and trained on different training/testing ratios using Monte Carlo simulation. In this process, the 1000 simulation was carried out to validate the statistical convergence of the model, as shown in Figure 3. It can be seen that RMSE and MAE values were stable at 10% of the average values with only 10 iterations, whereas these values were stable at 5% average from 20 Monte Carlo iterations. Besides, the values of R were statistically stable at 2% average with 8 iterations and at 1% average from 50 iterations.

In addition, the analysis of the probability density of R, RMSE, and MAE values was also carried out to study the random sampling effects on the performance of ANN model (Figure 4). It can be observed that the distribution of the probability density of R, RMSE, and MAE values was different on various training/testing ratios.

In general, it can be stated that the performance of the ANN model is sensitive to the random selection of data in the datasets used for training and validating the model. In this study, the ANN model was converged with above 700 Monte Carlo simulations, and the train-to-test ratio of 70 : 30 was found as the best option for ML modeling.

5.3. Validation and Comparison of Different ML Models

Validation and comparison of three ML models (i.e., ANN, ELM, and Boosted) were conducted using the best ratio of 70/30 of training and testing datasets. The ANN was trained with the parameters provided in Table 1, whereas ELM was trained with the network constructed by one input layer (6 neurons), one hidden layer (8 neurons), and one output (1 neuron). Regarding Boosted algorithm, the minimum leaf size was taken as 8, the number of learning cycles was 20, and the learning rate was set at 0.1. Values of R, RMSE, and MAE of the models using the testing dataset are shown in Figures 46. On the basis of RMSE indicator, it can be observed that the range of RMSE of ANN model was from about 0.05 to 0.1, whereas this value ranged from about 0.08 to 0.125 for Boosted algorithm and from 0.07 to 0.3 for ELM model over 1000 Monte Carlo simulations (Figure 5). Regarding MAE indicator, it can be seen that the range of MAE of ANN model was from 0.04 to 0.07, whereas this value ranged from 0.06 to 0.09 for the Boosted model and from 0.075 to 0.25 for ELM model over 1000 Monte Carlo simulations (Figure 6). In terms of R indicator, ANN model had the R values ranging from 0.95 to 0.97, from 0.88 to 0.95 for Boosted model, and from 0.62 to 0.95 for ELM model (Figure 7). Based on these results, it can be generally seen that the ANN model got the lowest error values (RMSE and MAE) and highest R values compared with other models (Boosted and ELM), whereas the EML got the most unstable values of RMSE, MAE, and R. ELM also got the highest values of errors and lowest values of R over 1000 Monte Carlo simulations. A summary of the main results of the three methods is presented in Table 2. Overall, it can be stated that the ANN model is the best and most stable model compared with other models (Boosted and ELM) for the prediction of soil shear strength.

6. Discussion

ML models are known as advanced techniques and approaches for quick and accurate prediction of real-world problems. These models, based on the objective computational algorithms, can handle complex relationships between input and output variables [97]. However, it is observed that ML models are quite sensitive to the quality of data and the way they are used in the modeling process, especially the ratio used to divide the datasets for training and validating the ML models [98]. In this study, this problem is analyzed by investigating the influence of training/testing ratio on the performance of three different popular ML models, namely, ANN, EML, and Boosted, to predict the soil shear strength.

Overall, the results showed that the ML models’ performance was significantly changed under different training/testing ratios. The results showed that the training/testing ratio of 70/30 was the most suitable one for training and validating the ML models. This finding is in line with other published works, such as Pham et al. [99], who investigated different training/testing ratios for training and validating various ML models (SVM, Logistic Regression, ANN, and Naıve Bayes) for spatial prediction of landslides and proved that 70/30 was the best training/testing ratio for getting the best performance of the ML models. Other studies and researches also confirmed the finding of this study [100105]. In addition, it is noticed that when the percentage of data in the training dataset increased, the errors (RMSE and MAE) of the models increased, and R values decreased. Thus, an increase of data (or samples) in the training dataset might have a negative influence on the prediction accuracy and difficulty in applying the models.

Besides, the validation and comparison results showed that all the ML models performed well, but ANN was the best model for the prediction of soil shear strength. It can be stated that ANN model has been reaffirmed as the best single ML model for solving most of the real-word problems [106, 107]. ANN has several advantages compared with other ML models, such as (i) capable of extracting the essential process information from data for analyzing and prediction, (ii) an ability of generalization of data, (iii) able to correctly process information that only broadly resembles the original training data, and (iv) its essential features being related to nonlinearity, fault tolerance, independent assumptions, and universality. Thus, ANN algorithm is particularly reasonable for extremely complex data. Last but not least, ANN is an adaptive algorithm, so that the learning process can be more effective [108, 109]. Therefore, it can be stated that the ANN was the best predictor for the prediction of soil shear strength.

7. Conclusions

Soil shear strength is one of the most critical geotechnical engineering properties used for designing and constructing civil engineering structures and constructions. Prediction of this parameter using advanced ML models might help in saving time and reducing cost for construction projects. In this study, three popular ML models, including ANN, ELM, and Boosted, were applied and compared to predict the soil shear strength using a database collected from Long Phu 1 power plant project, Vietnam. In addition, the performance of these models was also investigated under the influence of different training and testing ratios over 1000 Monte Carlo simulations.

Validation and comparison results showed that even the performance of all models was good and the performance of ANN was the best compared with other models. It can also be observed that the performance of the models was significantly changed under the different training and testing ratios used for training and validating the models. Based on the statistical analysis, a ratio of 70/30 for training and testing datasets was considered as the best ratio for training and validating the models. In addition, Monte Carlo simulations showed that the performance of the models is different under the random sampling effect over 1000 simulations. ANN was found as the best and most stable method under the variability of the input space.

In short, civil engineers can use the results of this study for quick and accurate prediction of soil shear strength for designing purposes, for instance, road, bridges, retaining walls, and other geotechnical and civil structures. Although the one group of data used in this study is sufficient for the development of the ML models, it is recommended that these ML models should be applied and validated with various data in different regions for better justification and verification. However, it is noticed that these applied models are considered as black-box models and do not provide the equations for engineer’s calculation; therefore, other ML models like GEP, GMDH, and EPR, which can provide the equations, can be considered for future application and comparison.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the Ministry of Transport, project titled “Building Big Data and Development of Machine Learning Models Integrated with Optimization Techniques for Prediction of Soil Shear Strength Parameters for Construction of Transportation Projects” under Grant no. DT 203029.