Abstract

Determination of the permeability coefficient (K) of soil is considered as one of the essential steps to assess infiltration, runoff, groundwater, and drainage in the design process of the construction projects. In this study, three cost-effective algorithms, namely, artificial neural network (ANN), support vector machine (SVM), and random forest (RF), which are well-known as advanced machine learning techniques, were used to predict the permeability coefficient (K) of soil (10−9 cm/s), based on a set of simple six input parameters such as natural water content (%), void ratio (e), specific density (g/cm3), liquid limit (LL) (%), plastic limit (PL) (%), and clay content (%). For this, a total of 84 soil samples data collected from the detailed design stage investigations of Da Nang-Quang Ngai national road project in Vietnam was used to generate training (70%) and testing (30%) datasets for building and validating the models. Statistical error indicators such as RMSE and MAE and correlation coefficient (R) were used to evaluate and compare performance of the models. The results show that all the three models performed well (R > 0.8) for the prediction of permeability coefficient of soil, but the RF model (RMSE = 0.0084, MAE = 0.0049, and R = 0.851) is more efficient compared with the other two models, namely, ANN (RMSE = 0.001, MAE = 0.005, and R= 0.845) and SVM (RMSE = 0.0098, MAE = 0.0064, and R = 0.844). Thus, it can be concluded that the RF model can be used for accurate estimation of the permeability coefficient (K) of the soil.

1. Introduction

The permeability of soil is one of the most important factors that govern the fluid flow characteristics of the soil. Generally, the permeability is represented by an amount of water transmit via interconnected void of a soil mass in a certain period, and it can be determined using field and laboratory tests. It is accepted fact that determination of the soil permeability coefficient is very crucial, and this task is difficult, time-consuming, and expensive [1, 2].

In geotechnical point of view, the soil permeability depends on many factors such as the soil density, water content, void ratio, mineralogy, soil structures, and others. The permeability coefficient is used in many geotechnical problems such as slope stability, the failure of structures related to the ground settlement, seepage, and leakage. Thus, many authors have tried to establish empirical relationships between influencing factors with the permeability coefficient [35]. There are several direct relationships between grain size and the permeability coefficient of soil. Hazen [6] indicated that the permeability is proportional to the square of the effective grain size for the sand with uniform particles. Other authors proposed a regression that considers porosity, percentage of clay, and sand particle to estimate the permeability of soil [7]. Some other authors predicted soil permeability based on bulk density and grain-size particle and shape of the particle [8, 9]. As mentioned above, the permeability of soil is strongly dependent on the particle size distribution; however, it is not applicable for a wide range of soil [1, 10]. The study indicated that these empirical relationships have certain limitations as well as uncertainties.

Nowadays, machine learning (ML) and artificial intelligence (AI) techniques have been applied successfully in many fields including civil engineering. The ML techniques could enable engineers to estimate the unknown parameters relating to these problems with superior approximation abilities. Soft computer methods, for example, fuzzy logic, artificial neural networks (ANNs), and support vector machine (SVM) are now being used in geotechnical engineering for predicting soil compressive and shear strength, load bearing capacity of foundation, and so on [1113]. Several authors have used ML techniques to estimate tensile strength of rock as well as flyrock caused by blasting [14]. AI and ML techniques are also being used frequently for the landslide studies, flood management, and infrastructure development.

Regarding the prediction of the soil permeability coefficient, there are several studies using the ML method, for instance, ANN, adaptive neuro-fuzzy system (ANFIS), and hybrid optimization model of genetic algorithm-ANFIS (GA-ANFIS) [1, 2, 1517]. Sezer et al. [17] used an ANFIS to estimate the permeability of granular soil; they concluded that the ANFIS algorithm is superior to estimate the permeability of granular soil considering grain-size distribution and particle shape [2]. However, the hybrid model GA-ANFIs outperformed in terms of prediction accuracy compared with single ANN, ANFIS model, and hybrid GA-ANN model [15]. In general, soft computing-based models are great tools for the prediction of the properties of soil.

Random forest (RF) was firstly proposed by Breiman to solve unsupervised learning, regression, and classification problems [1820], which is known as a powerful algorithm, which has been successfully employed and applied in many problems of geotechnical engineering field [21, 22]. For example, RF has been utilized successfully in predicting soil parameters such as prediction of shear strength of soil and soil permeability coefficient [20, 23]. The RF algorithm has important merits in handling with large databases, and it can also deal with thousand input variables [24].

Based on the literature survey, it can be concluded that these ML techniques have many advantages in predicting soil parameters. To the best of the authors’ knowledge, there is no study on estimating permeability coefficient of soil using these techniques in Vietnam condition. Main difference with earlier studies is that here we have used different datasets to compare the performance of different models to select the best model for the estimation of permeability coefficient of soil (K). Moreover, first time, the RF model has been used in the determination of ‘K’ in the study area of Vietnam.

Therefore, main objective of this study is to apply popular soft computing techniques (ANN, SVM, and RF) at the Da Nang-Quang Ngai expressway project site of Vietnam for the estimation of the permeability coefficient (K) of soil and to select the best model for the prediction of “K.” Various statistical evaluation indicators such as RMSE and MAE and correlation coefficient (R) were used to validate and evaluate the models. Matlab software was used for the data processing and to simulate the models: ANN, SVM, and RF.

2. Materials and Methods

2.1. Data Used

The dataset consists of 84 soil samples collected from the detailed design state investigations of Da Nang-Quang Ngai expressway development project near Da Nang, central Vietnam (Figure 1). To predict the “K” of soil, the input data related to the permeability are selected, such as water content (%), void ratio, specific density (g/cm3), liquid limit (LL), plastic limit (PL), and clay content (%). All these input data are highly related to permeability especially void ratio which is the critical parameters for having a relationship with hydraulic conductivity in both Darcy’s equation and Kozeny–Carman’s equations (1) and (2).

Initial statistical analysis of the dataset is presented in Table 1. The natural water content values vary from 15.1% to 99%. The void ratio varies from 0.46 to 2.63. The distribution of the specific density ranges from 2.58 g/cm3 to 2.74 g/cm3. The liquid limit is from 18.9% to 88.93%, the plastic limit is from 12.2% to 54.8%, and finally the clay content is from 5.7% to 64%. Figure 2 shows the histogram of the input parameters.

2.2. Methods Used
2.2.1. Artificial Neural Network (ANN)

ANN is known as a common and powerful technique that imitates the activity and performance of the human brain and nervous system [1517]. This technique has many crucial abilities such as generalization and learning from data and can deal with a large variable. It was reported that the major characteristics of ANN comprises continuous nonlinear dynamics, high fault tolerance, collective computation, self-learning, self-organization, and real-time treatment [25]. Thus, this algorithm has been widely employed and applied successfully to solve many problems in geotechnical engineering. In both linear and nonlinear patterns, ANN is generally adopted to determine the hidden layer between output and input neutrons; as a result, ANN could decide analyzing relationships and patterns by itself in data. In order to predict the permeability coefficient of soil, a multilayer perceptron (MLP) was adopted as a regression technique. To calculate the weights of the input through the activation function, the sigmoid function is used in neutrons.where hi indicate the permeability coefficient (output) and x= (x1,x2,,xi) denote input parameters (i.e., affected factors of permeability coefficient).

2.2.2. Support Vector Machine (SVM)

SVM is known as a statistical-based learning algorithm that was firstly proposed by Vapnik to deal with the nonlinear problems with high dimension such as regression and classification [26, 27]. The concept of SVM is to build a hyperplane to separate the dataset into different classes. In the SVM, the original input space is transferred to a high-dimensional feature space using the training dataset [2830].

Then, the optimum plane is defined via optimizing the class boundary. Thus, the support vectors are defined as the trained points that are placed the most adjacent to the optimal plane [28, 29]. SVM has been popularly used in landslide prediction, and the results showed that this technique has high accuracy [28, 31]. In this study, the SVM was employed as a regression method by propositioning a function of δ-insensitive loss [32].

2.2.3. Random Forest (RF)

RF is known as a prevailing algorithm, which was firstly suggested by Breiman to solve classification, regression, and unsupervised learning problems [18, 22]. This algorithm is being employed commonly in different fields of civil engineering containing geotechnical engineering [21, 33, 34]. This machine learning comprises several merits such as high performance with complex datasets using small calibrating and can deal with high noise variables [35, 36]. In addition, it was reported that this algorithm is very user-friendly because it has only two parameters (including a number of variables and trees) and it is usually not sensitive to their values [22].

In a random forest, the bagging technique is always used to randomly select the variables from the whole dataset for model calibration. In this study, two kinds of errors, including reduction in Gini and reduction in accuracy, and an Out-of-Bag (OOB) were computed because these error factors can be employed to rank and choose variables [37, 38]. For each variable, when the values of the variable are transferred over the OOB observations, the error of the estimable model will be decided by the function.

2.2.4. Relief F for Attribute Importance

In general, evaluating attribute quality (feature quality) is known as a crucial task for both regression and classification problems in machine learning such as constructive induction, regression and decision tree, and feature selection [39, 40]. Each input variable in a huge number of a learning problem is governed by thousands of attribute (feature). Generally, many learning techniques cannot deal with this situation because of lack of information of features or variables with many irrelevances. An attribute (feature) selection is known as a task to choose a small subset, which is adequate to pronounce the target purpose. In order to decide which features need to be kept and which ones need to be removed, it is necessary to have a practical and reliable method for evaluating the related information to the target goal.

In recent years, many researchers have paid much effort to evaluate feature estimation. There exist many methods for estimating the quality of attributes. For the regression problem, mean square and mean absolute error [41] and Relief F [39] are used as estimation heuristics. Almost the heuristic methods used for evaluating the attribute quality of the attributes made the assumption of the conditional independence of the features. These methods are thus less suitable for problems that have much feature interaction. In opposite, Relief F does not assume the condition for the attribute. This algorithm is effective, to understand the circumstantial information, and can appropriately predict the attribute quality of problems with a high dependence between features [42]. It was reported that Relief F has been widely considered as an attribute selection method, which is used as the preprocessing step beforehand the model is learned and trained [43]. This method is known as one of the most effective algorithms until now [44]. Finally, Relief F could provide a unified assessment on evaluating the quality of features in regression problems. The detail of this algorithm can be found in the previous studies [39, 40].

2.2.5. Validation Indices

In this research, to assess, compare, and validate the performance of the model, RMSE, MAE, and R were employed. Generally, RMSE can be used to measure the mean squared difference between actual and estimated values, while MAE is used to determine the average error amplitude. When the values of RMSE and MAE are smaller, the model will have higher predictive ability. By contrast, higher values of R indicate the higher prediction ability of the model. These indicators (RMSE, MAE, and R) are usually applied for the regression problem that can be determined by using the following formulas [45, 46]:where q1 and q2 correspond to the measured and modeled values, indicates the average permeability coefficient value, and M is the summation of input.

2.3. Methodology

In this research, there are few main steps carried out to predict the “K” of soil as indicated in Figure 3.Step 1. First, the input dataset is generated and loaded, and then these datasets are randomly divided into testing (30%) and training (70%) groups. The split of this dataset in 70 : 30 ratio was done for the training and testing of the models, respectively, based on the experience of authors and similar studies carried out by other researchers for obtaining the best performance of the models [47]. In this step, the Relief F feature selection method was applied to validate the importance of the input variables on which the important parameters were selected for the generation of final training and testing datasets after removing irrelevant parameters.Step 2. In this step, a training dataset was used to train the soft computing-based models (ANN, SVM, and RF). To get the best performance of these models, the optimization of the hyperparameters used in each model was carried out using the trial-error process. In this study, the ANN was trained with 10 hidden layers with sigma loss function, the SVM was trained with Radial Basis Function (RBF) kernel function using the gamma value of 0.25, and the RF was trained with 100 iterations.Step 3. Validation of the models (ANN, SVM, and RF) was done in this step using testing dataset. Various statistical indicators (RMSE, MAE, and R) were calculated using both training datasets. While the values of these indicators using the training dataset indicate the goodness of fit of these models with the data used, the one using the testing dataset indicates the predictive capability of these models.

3. Results

3.1. Attribute Importance Using Relief F

We evaluated the importance of the input parameter by using the Relief F technique for the six input parameters including the water content, void ratio, specific density, liquid limit, plastic limit, and clay content (Table 2). The clay content was found to be the less important variables of the permeability with the weight value of merely 0.025. The weights of the other index parameters including plastic limit, liquid limit, and specific density are 0.0753, 0.0762, and 0.0877, respectively. Finally, the water content and void ratio are shown to be the most important parameters with a weight of 0.096 and 0.0942, correspondingly.

3.2. Validation and Comparison of the Models

Validation of the models (ANN, SVM, and RF) was done using both training and testing datasets as indicated in Figures 46 and summarized in Table 3. With respect to the training dataset, the RF has the highest value of R (0.972), followed by the ANN (0.948) and the SVM (0.861), respectively. In contrast, the RF has the lowest value of RMSE (0.0035) and MAE (0.0023), followed by the ANN (0.0047 and 0.0027) and the SVM (0.0078 and 0.0056), respectively. These results on the training dataset show that the RF has the highest goodness of fit with the data used compared with other models (SVM and ANN). In terms of the testing dataset, similarly, the RF has the highest value of R (0.851), followed by the ANN (0.845) and the SVM (0.844), respectively. However, the ANN has the lowest value of RMSE (0.001), followed by the RF (0.0084) and the SVM (0.0098), respectively, and the RF has the lowest value of MAE (0.0049), followed by the ANN (0.005) and the SVM (0.0064), respectively. Figure 5 shows the visualization of the actual and predicted values of the permeability coefficient of soil through experiments and models, respectively.

4. Discussion and Conclusion

In the geotechnical study, the permeability coefficient (K) of soil is an important factor for designing civil engineering structures on soil. However, determining the “K” in the laboratory or in the field is time-consuming and expensive. Indirect estimation of “K” using empirical equation and correlating with other engineering properties of soils may not be accurate [35]. Moreover, they may be applicable to specific soil only. Therefore, in this study, we have applied three popular cost-effective soft computing-based models such as ANN, SVM, and RF to predict “K” of the Da Nang-Quang Ngai expressway development project site soil by using six soil parameters, namely, water content, void ratio, specific density, liquid limit, plastic limit, and clay content as input in the studied models.

The Relief F feature selection method results showed that the void ratio and the water content were found to be the most important input variables (parameters) in the prediction of the “K” of the soil. It is reasonable because the void ratio is highly correlated to the permeability in several studies [48]. On the other hand, the water content represents the level of saturation, which directly links to the fluid flow in the porous media [49].

The validation results showed that all three models are good at estimating the prediction of soil coefficient of permeability. However, the RF is found to be the most accurate method to predict the “K″ of soil in comparison with SVM and ANN. This can be attributed to the ability of the RF algorithm in processing large databases with a large number of input parameters also [49]. The results of this study also are in a good agreement with the results of other studies on estimating the shear strength of soil where performance of the RF model was the best in comparison with other ML models [20, 23].

In general, the soft computing-based models developed in this study contribute a powerful tool to estimate the permeability coefficient of the soil accurately. However, the performance of the model depends on the input parameters, so it is necessary to carry out the various strategies to improve the input samples to improve the performance of the model. In addition, it is necessary to consider the over-fitting problem [50]. Therefore, the data for the training are crucial for accurate prediction. Once need to make sure that the data are required to be reliable and sufficient to apply the machine learning technique into practice. In this study, we have used 70% of the total data as training data for obtaining optimum results based on the earlier studies [5153].

Development and improvement of the performance of models are a continuous process. The findings of this study are that the RF model can be used to estimate accurate permeability coefficient of the soil using limited soil parameters but more studies at different sites are required for confirming its wider applicability.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors would like to thank the support of the Department of Science, Technology, and Environment (Ministry of Education and Training), University of Transport and Communications, and other agencies for providing data used in this research. This study was funded by the Ministry of Education and Training under grant number B2020-GHA-03 chaired by the University of Transportation.