Abstract

Soil erosion induced by rainfall is a critical problem in many regions in the world, particularly in tropical areas where the annual rainfall amount often exceeds 2000 mm. Predicting soil erosion is a challenging task, subjecting to variation of soil characteristics, slope, vegetation cover, land management, and weather condition. Conventional models based on the mechanism of soil erosion processes generally provide good results but are time-consuming due to calibration and validation. The goal of this study is to develop a machine learning model based on support vector machine (SVM) for soil erosion prediction. The SVM serves as the main prediction machinery establishing a nonlinear function that maps considered influencing factors to accurate predictions. In addition, in order to improve the accuracy of the model, the history-based adaptive differential evolution with linear population size reduction and population-wide inertia term (L-SHADE-PWI) is employed to find an optimal set of parameters for SVM. Thus, the proposed method, named L-SHADE-PWI-SVM, is an integration of machine learning and metaheuristic optimization. For the purpose of training and testing the method, a dataset consisting of 236 samples of soil erosion in Northwest Vietnam is collected with 10 influencing factors. The training set includes 90% of the original dataset; the rest of the dataset is reserved for assessing the generalization capability of the model. The experimental results indicate that the newly developed L-SHADE-PWI-SVM method is a competitive soil erosion predictor with superior performance statistics. Most importantly, L-SHADE-PWI-SVM can achieve a high classification accuracy rate of 92%, which is much better than that of backpropagation artificial neural network (87%) and radial basis function artificial neural network (78%).

1. Introduction

Soil erosion induced by water is the main culprit of the degradation of upland and mountain ecosystems [1]. The erosion process poses a threat to the capacity of the land to provide ecosystem services that are needed to reach the Sustainable Development Goal (SDG) target 15.3 [2]. Currently, soil erosion is a critical concern in many regions; this issue must be prevented to achieve sustainable development goals [3, 4]. Soil erosion is characterized by a complex and dynamic process involving detachment, transport, and deposition of soil material. There are many factors affecting erosion magnitude, namely, climate, soil type, soil structure, vegetation, and cropping on top and especially land management [5].

In tropical areas, such as Northwest Vietnam, soil erosion potential is high due to heavy rainfall and is currently accelerating under maize monocropping in the uplands [6, 7]. A recent increase in maize monocropping in the region is caused by market demand for the rapid growing livestock and poultry industry. The majority of the maize area expansion has been achieved by encroachment into steep forested mountain watersheds. Land preparation for a maize cropping season employed a weeding prior to burning, followed by several ploughs before sowing seeds. The practice is carried out at the onset of the monsoon rains; by the time, bare fields are exposed to high-intensity rains. This often results in high runoff leading to large erosion and longer-term degradation with declining crop production on-site and strong environmental impacts off-site [8, 9]. Erosion control measures are based on the following: covering the soil to protect it from raindrop impact; increasing the infiltration capacity of the soil to reduce runoff; improving the aggregate stability of the soil; and increasing surface roughness to reduce the velocity of runoff [1012]. Many attempts have been tested worldwide: contour ploughing or contour farming using stone, wood, or vegetation barriers/hedgerow; cover crop; minimum tillage or zero tillage; and mulching. Such measures can be effective in different geographic and climatic conditions under various soil characteristics and land management systems. Among these, land management is crucial in controlling erosion [11, 1316].

Cerdà et al. [17] investigate the hydrological and erosional impact as well as farmer’s perception on catch crops and weeds in citrus organic farming; this study recommends that farmers should be informed and given initial subsidies for implementing new measures for improving soil quality and preventing soil erosion. In the study of Tuan et al. [18], kinetic energy of rainfall is found to be one of the most important factors that cause soil erosion, particularly when heavy rains coincide with poor ground cover at the beginning of the cropping season. However, it should be noted that the importance of a factor varies from case to case and other comprehensive variables affecting erosion should also be studied.

Soil loss studies at the plot scale have been of crucial importance to identify the mechanism of the processes. The erosion plot experiments can help to introduce new erosion prevention as it provides access to dependable and faithful erosion measurements and large numbers of data necessary to test new models [19]. A number of empirical erosion models such as USLE and RUSLE [20], SWAT [21], and WEPP [22] used data from these types of studies for prediction. However, the performance and accuracy are not as good as expected [23].

In the last decades, machine learning-based models have proved to be a helpful alternative to deal with the multivariate and complex nature of the phenomena in various disciplines of engineering [2438]. Optimized kernel logistic regression models were employed in [39] for landslide susceptibility assessment. The evidential belief function-based function tree (FT), logistic regression, and logistic model tree were utilized for predicting landslide occurrences in [40]. Ensemble learning methods used for natural hazard risk assessment have been proposed in [4144].

In addition, ANN is used to estimate the soil erodibility in Malaysia [45] and to conclude that rainfall and runoff are important factors affecting the amount of soil eroded [46]. In [47], Kohonen neural networks (KNNs) are utilized to model runoff erosion with a better outcome than that of the conventional multiple linear regression models. It was also shown that ANN is generally more competitive than the WEPP model in quantitative prediction of soil loss [48]. Albaradeyia et al. [49] confirm that while WEPP underestimated the soil loss, ANN provided results that are in line with observation. Recently, Vu et al. [50] successfully constructed a machine learning model based on multivariate adaptive regression splines for predicting soil loss occurrences.

Based on the literature review, it can be seen that advanced machine learning methods can be a good option for soil erosion modeling. Even though previous research works extensively relied on ANN, other advanced machine learning algorithms are worth explored because the prediction of soil erosion is both complex and dynamic. We are particularly interested in the support vector machine (SVM) [5154] because it usually has comparable results with ANN and has a lower risk of overfitting. In addition, there has been a great success in using SVM for different but closely related hydrological and geological problems [5561].

Furthermore, to enhance the predictive performance of the SVM, this study relies on the approach of using metaheuristic approaches. It is noted that the process of fine-tuning a machine learning model can be formulated as a global optimization problem. In recent years, various metaheuristic approaches have been successfully employed, including monarch butterfly optimization [62, 63], slime mould algorithm [64], moth search algorithm [65, 66], Harris hawks optimization [6770], differential flower pollination [71], symbiotic organisms search [72], Henry gas solubility optimization [73], and satin bowerbird optimizer [74]. As can be seen from the literature, there is an increasing trend of hybridizing metaheuristics and machine learning to tackle complex problems in the field of engineering.

Therefore, the current study aims at extending the body of knowledge by establishing soil erosion prediction models for tropical hilly regions based on an integration of the history-based adaptive differential evolution with linear population size reduction and population-wide inertia term (L-SHADE-PWI) metaheuristic and the support vector machine pattern classification method. A dataset, featuring ten explanatory variables, collected from plot experiments in Son La province (Vietnam) is employed to construct and verify the prediction models. In this study, the problem of erosion status prediction is formulated as a pattern classification task within which a vector of explanatory variables is assigned to either “erosion” or “nonerosion.” We hypothesize that the structure of the L-SHADE-PWI-SVM, an integration of machine learning and metaheuristic optimization, is capable of predicting rainfall-induced soil erosion.

In summary, the innovative points of the current study can be stated as follows:(i)A novel integration of the L-SHADE-PWI metaheuristic and the SVM pattern classifier is first proposed to predict the complex phenomenon of soil erosion.(ii)The effort to fine-tune the machine learning model is eliminated via the utilization of metaheuristic optimization.(iii)The model construction phase of the proposed L-SHADE-PWI-SVM is entirely data-driven. Therefore, it is convenient to be used by land-use planners and hazard prevention agencies without much domain knowledge in machine learning.(iv)The superiority of the proposed hybrid approach is validated using experiments based on a repeated subsampling of the collected data.

The rest of the paper is organized as follows. In Section 2, we review the related research methods. In Section 3, we present the proposed L-SHADE-PWI-SVM model. Section 4 is devoted to experiments and result comparisons. We end with the conclusions in Section 5.

2. Material and Methods

2.1. General Description of the Study Area and the Collected Dataset

For this study, data in Son La province (Northwest Vietnam) were collected from 2009 to 2011 at two experimental sites, which are two catchments with bounded plots. Figure 1 shows the location of the study area in the maps of Son La and Vietnam. The area is governed by tropical monsoon climate with mild and generally warm temperature (the annual average is 21°C, and the lowest and highest average temperatures are 16°C in February and 27°C in August, respectively [75]). The winter (from November to March) is relatively dry while the summer (from May to October) can have a lot of rain.

At the experimental sites, erosion plots of 72 m2 (4 m × 18 m) with boundaries were installed to avoid the exchange of water runoff from outside. There were 24 of them in total: four treatments, three replicates arranged in a completely randomized block design. In order to gather deposited sediment from soil erosion at the plots, a system of buckets was employed. Erosion data were collected on storm basic during 3 years from 2009 to 2011. Detailed information on soil, crop, and field managements can be found in the previous study of Tuan et al. [18].

There are many factors contributing towards rill and interrill soil erosion induced by raindrop impact and surface runoff. They include climate, soil, topography, and land use. The amount and intensity of rainfall affect erosivity, while soil properties, land use, and topography (described via slope length and steepness) can have a big influence on the degree of erodibility. To represent these factors, a set of ten explanatory variables has been chosen. In Table 1, these variables are listed together with their statistics.

First, I30 is the prolonged peak rates of detachment and runoff in 30 minutes (30 min intensity). The storm energy is calculated according to the following formula [76]:where i is the maximum intensity of 30 minutes. The product of E and I30 gives us the kinetic rainfall energy (EI) which is the combined interaction of total energy and peak intensity in each particular storm. This quantity represents how particle detachment is combined with transport capacity [77].

The second and third variables are topographic factors, namely, slope length and steepness. They all have a great influence on soil loss. The bigger the slope is, the higher potential of soil loss will be. Similarly, the larger the slope length is, the more likely there will be soil erosion as there is a greater accumulation of runoff. In this study, a Nikon Forestry 550 Inclinometer was used to measure the slope of the plots.

Soil erodibility is heavily influenced by texture, but permeability, structure, and organic matter also play a role. In this study, wet analysis [78] was used to analyze soil texture with a two pseudoreplicate plot. In addition, it is known that aggregate stability is the index that is closely related to soil erodibility. However, there is currently no efficient method to calculate this index. Therefore, OC and pH, two simple characteristics of the soil, were used instead. Total OC was obtained using a C/N analyzer with carbonate content removed by HCl. pH was measured using a glass electrode with a soil-to-water ratio of 1 : 2.5.

The final variable chosen is ground cover rate as this is one of the most influential factors of soil erosion rate in tropical areas [18]. For more information on how this variable is measured, we refer the reader to Tuan et al. [18].

In our model to predict soil erosion, we aim to classify each sample, which is associated with a vector of values for the explanatory variables, into two classes: “erosion” or “nonerosion.” In order to provide ground truth classification to train the model, we follow the same criterion on soil loss as in [79]: “erosion” label is assigned to any sample with more than 3 tons per hectare soil loss, and other records are classified as “nonerosion.” A total of 236 data samples have been collected within which 118 records were classified as “erosion.”

2.2. Support Vector Machine (SVM) for Pattern Classification

The original SVM algorithm was first introduced in [80] for linear binary classification. The algorithm builds a hyperplane to separate positive and negative samples with the margin as large as possible. However, in practice, it is often the case that samples are not linearly separable and such hyperplane does not exist. This can lead to the poor performance of the algorithm. Accordingly, the original SVM algorithm is extended for nonlinear classification via the use of kernel functions [81]. Kernel functions, which are often nonlinear, map the space of input variables into a much higher-dimensional space, where separable hyperplanes are more likely to exist [82, 83]. In addition, soft margin is proposed to handle the case the hyperplane does not exist even in the higher-dimensional space [53]. With these improvements, the SVM has become one of the most popular and successful classification algorithms [8486].

Given a training set consists of N samples where is a feature or input vector and is the corresponding response or class label. The training phase of the SVM algorithm can be mathematically formulated as the following optimization problem [53, 87].

Find to minimizesubjected towhere the first two outcomes Rn and bR define the classification hyperplane (they are a normal vector and the corresponding offset, respectively); the third outcome is the vector of slack variables introduced to handle the case data cannot be separated without error; is the penalty constant decides how much weight should be put on classification error; and denotes the nonlinear mapping from the input space to a much higher-dimensional space.

The explicit formula of is not normally required. However, one needs to know the dot product of in the higher-dimensional space. It is referred to as the kernel function given by

There are different choices of kernel functions, such as linear, polynomial, sigmoid, and radial basis functions. In this work, radial basis function (RBF) is employed due to its good performance in pattern recognition tasks [88]. It is defined aswhere and is a parameter that can be tuned.

Once training is finished, the final classification is performed as follows:where is the solution of the dual problem of equations (2) and (3) and SV is the number of support vectors (the number of ).

2.3. The History-Based Adaptive Differential Evolution with Linear Population Size Reduction and Population-Wide Inertia (L-SHADE-PWI)

From Section 2.2, it is obvious that, before using the SVM algorithm, appropriate values of the penalty coefficient (C) and the parameter in the kernel function () need to be determined. These hyperparameters play an important role in the learning phase of the model. As a consequence, they can have essential influences on the predictive capability of the soil erosion prediction model. The step of selecting these hyperparameters is usually referred to as model selection. It is problem-dependent. It is also very challenging because the hyperparameters must be selected from continuous domains, where there is an infinite number of possible choices [33, 8992].

Traditional approaches for model selection include grid search and random search. In grid search, hyperparameters are exhaustively searched from a specified subset of the space of hyperparameters. Random search, on the other hand, randomly select combinations of hyperparameters for consideration. Both grid search and random search are simple and can be done in parallel. However, they cannot deliver desired predictive accuracy since hyperparameter selection is highly data-dependent and complex. For the case of an SVM model, the complexity coefficient (C) and the kernel function parameter (γ) are both real numbers. Hence, there are an infinite number of possible combinations of C and γ. This means that an exhaustive grid search is infeasible and researchers have turned to metaheuristic methods to address the model selection problem [72, 91, 93, 99].

In this work, we propose to use the history-based adaptive differential evolution with linear population size reduction and population-wide inertia (L-SHADE-PWI) algorithm to select the hyperparameters for the SVM algorithm. The L-SHADE-PWI [100] is a newly proposed metaheuristic with promising optimization capability. With the L-SHADE-PWI algorithm, the model selection of an SVM model is formulated as a global optimization problem where the objective function measures how well the model fits the training and validating sets using a set of the hyperparameters of the penalty coefficient (C) and the kernel function parameter (γ).

The L-SHADE-PWI is essentially an extension of the history-based adaptive differential evolution with linear population size reduction (L-SHADE) algorithm proposed in [101]. This newly developed method inherits the effective mutation-crossover operator of the standard differential evolution (DE), the adaptive tuning strategy of the L-SHADE, the DE/current-to-pbest/1 mutation strategy described in [102], the linear population size reduction proposed in [101], and a population-wide inertia term (PWI) incorporated in the mutation operation [100].

A brief summary of the L-SHADE-PWI algorithm is presented in Algorithm 1. In the heart of the algorithm, there are three main steps, including mutation, crossover, and selection. The mutation step generates a trial vector. The mutation operator requires the computing of the population-wide inertia term . The PWI term indicates an averaged direction and size of successful moves that lead to better solutions in the preceding generation. The incorporation of the population-wide inertia aims at enhancing the variations of population members in the potential direction that on average brought about cost function’s reduction [100]. The PWI term is calculated as follows [100]:where denotes the move of a population member in the search space occurred in generation  − 1. It is worth noticing that individual moves that do not lead to better cost function values are not taken into account. Therefore, the sum described in equation (7) is divided by which represents the number of beneficial moves [100].

Define PS, D, CF(x), Gmax, an archive A = 
Define the two archives of MF and MCR and the two sets of SF and SCR
Create a population P with PS members randomly
Identify and record xbest,g which is currently the best solution
For each generation
For each individual xi,g
  Select MFk and CRk randomly from MF and MCR
  Calculate Fi = Gaussian(MFk, 0.1)
  Calculate CRi = Gaussian(CRk, 0.1)
  Select a xpbest,g randomly from A
  Select two random members and
  Calculate the population-wide inertia term
  Create a trial vector according to [100]:
  
  Perform crossover according to [103]:
  
  Carry out the selection operator [103]
End For
 Collect successful Fi and CRi
 Update SF, SCR, MF, MCR, and A [104]
 Update PS [101]
End For
Return xbest,g

Accordingly, this trial vector is used in the crossover step to create a mutated vector u. The final step compares the mutated vector with its parent and replaces them in the next generation if better target is achieved. In Algorithm 1, PS is the population size; D represents the number of the searched variables (two in our case); CF(x) is the cost function; and x denotes the vector of the searched variables. The two vectors of length H MF and MCR are archives containing the mean values of the mutation scale (F) and the crossover probability (CR). The two vectors SF and SCR keep CR and F values delivering offspring better than the parent. Gmax is the limit (maximum number) of generations. Finally, a scheme for population size reduction is implemented to facilitate the convergence rate of the metaheuristic algorithm.

3. The Proposed L-SHADE-PWI Optimized SVM for Rainfall-Induced Soil Erosion Prediction

This section aims at describing the structure of the L-SHADE-PWI-SVM model used for rainfall-induced soil erosion. The newly developed model is an integration of machine learning and metaheuristic optimization. In detail, the SVM machine learning is employed to generalize a decision boundary that separates the input space into two distinctive domains: nonerosion (the negative class) and erosion (the positive class). The factors of EI30, slope, OC topsoil, pH topsoil, bulk density, total pore volume, texture-silt, texture-clay, texture-sand, and soil cover rate are employed as influencing factors for pattern classification. Because the model establishment phase of the SVM necessitates an appropriate specification of the penalty coefficient and the kernel function parameter, the proposed model relies on the L-SHADE-PWI to optimize the hyperparameter selection phase.

Figure 2 demonstrates the overall structure of the proposed L-SHADE-PWI-SVM model. A software program based on the model structure has been developed by the authors in Visual C# .NET (framework 4.6.2) environment and the Accord.NET Framework [105]. The newly established model has been tested on the ASUS FX705GE-EW165T (Core i7 8750H and 8 GB RAM) platform.

It is proper to note that the influencing factors of the collected dataset have been randomly separated into a training (90%) dataset and a testing dataset (10%). The first set is used for model construction and the second set is reserved for model validation. Moreover, the Z-score equation has been used to standardize the data range since it may enhance the classification performance [106]. The Z-score equation is given bywhere XZ and XD are the normalized and the original variables, respectively. MX and STDX denote the mean value and the standard deviation of the soil erosion influencing factors, respectively.

As described earlier, the model training and soil erosion prediction phases of the SVM necessitate an appropriate set of the penalty coefficient and the kernel function parameter. The first hyperparameter specifies how the SVM’s loss function increases due to misclassified data samples. The second hyperparameter influences the smoothness of the classification boundary. Thus, these two hyperparameters strongly affect the generalization and predictive accuracy of the SVM-based soil erosion prediction model. This research proposes to use the L-SHADE-PWI metaheuristic for optimizing the performance of the SVM model. At the first generation, the L-SHADE-PWI initializes a population of hyperparameters randomly. In each subsequent generation, this metaheuristic approach gradually explores and exploits the search space to identify potential regions which contains high-quality sets of the SVM model’s hyperparameters.

Relying on the hyperparameters found by the L-SHADE-PWI metaheuristic, the SVM model analyzes the training dataset and generalizes a decision boundary that separates input samples associated with nonerosion and erosion classes. It is noted that this study employs the Accord.NET Framework [105] to carry out the SVM model’s training and prediction phases. Furthermore, to optimize the SVM-based soil erosion prediction model, this research relies on a K-fold cross-validation with K = 5. Based on the cross-validation framework, the dataset is separated into 5 mutually exclusive sets. In each of the five runs, one set is utilized for model verification and the other four sets are used as training samples. The average predictive performance obtained from the 5 folds is used to express the generalization capability of the rainfall-induced soil erosion prediction model. Accordingly, the objective function (F) used for the L-SHADE-PWI metaheuristic-based optimization is given bywhere FNRk and FPRk are the false-negative rate (FNR) and the false-positive rate (FPR) obtained from kth run, respectively.

The FNR and FPR are calculated according to the following equations:where FN, FP, TP, and TN represent the false-negative, false-positive, true-positive, and true-negative data samples, respectively.

4. Model Results and Discussions

4.1. Preliminary Feature Importance Investigation

For the purpose of constructing a robust soil prediction model, a preliminary assessment of the relevance of input features (EI30, slope, OC, pH, bulk density, soil porosity, soil texture (silt, clay, and sand fractions), and soil cover rate) should be carried out. This study relies on the well-known ReliefF method [107] to investigate the importance of each aforementioned soil erosion influencing factor. This step of the study aims at identifying irrelevant influencing factors. Using the ReliefF method, a weighting value is computed for each input factor. A large weight indicates a strong relevance between an influencing factor and the soil erosion status (either erosion or nonerosion). The preliminary assessment of the relevancy of input features is reported in Figure 3. In this figure, the ReliefF weight values of all input variables are shown. It is observed that the input factor X1 (EI30) receives the highest weight value, followed by the input factor X10 (soil cover rate), X3 (OC topsoil), and X9 (sand fraction). The factors X7 (silt fraction) and X8 (clay fraction) have low weight values. However, these weights are still significantly larger than zero. Therefore, the L-SHADE-PWI metaheuristic-optimized SVM model employs all of the ten factors.

4.2. Model Performance Evaluation Metrics

To quantify and compare performances of soil erosion prediction models, the classification accuracy rate (CAR), which is the percentage of correctly classified cases, is often employed:where Nc and Na denote the numbers of correctly classified samples and the total number of data samples, respectively.

Besides CAR, the true-positive rate (TPR) (the percentage of positive instances correctly classified), false-positive rate (FPR) (the percentage of negative instances misclassified), false-negative rate (FNR) (the percentage of positive instances misclassified), and true-negative rate (TNR) (the percentage of negative instances correctly classified) are also employed [56, 108]. The FPR and FNR indices have been defined in the previous section of the article; the TPR and TNR indices are given bywhere TP, TN, FP, and FN represent the true-positive, true-negative, false-positive, and false-negative data samples, respectively.

In addition, based on the results of the TP, TN, FP, and FN, indices of precision, recall, negative predictive value (NPV), and F1 score are also useful for expressing the model predictive capability [109]:

4.3. Experimental Results and Comparison

To evaluate the model performance, the original dataset has been randomly divided into two mutually exclusive sets. Since the size of the collected dataset is moderate, the training/testing data ratio is selected to be 9/1 [110]. This means that 90% of the original dataset is used for model construction; the rest of the dataset is reserved for the model testing phase. The testing set is employed as novel data instance to verify the generalization of the constructed L-SHADE-PWI-SVM model used for rainfall-induced soil erosion. Illustrations of training and testing datasets are provided in Table 2.

In addition, since the model construction phase of the SVM-based model requires the setting of the penalty coefficient and kernel function parameters, the L-SHADE-PWI has been used to automatically fine-tune these hyperparameters. The L-SHADE-PWI-based optimization process is illustrated in Figure 4. It is noted that the metaheuristic population consists of 20 members and the maximum number of generations is 100. The outcomes of the optimization process are as follows:(i)The penalty coefficient(ii)The kernel function parameter

To demonstrate the capability of the proposed L-SHADE-PWI-SVM, the backpropagation artificial neural network (BPANN) [106, 111] and radial basis function artificial neural network (RBFANN) models are employed as benchmark methods. The BPNN and RBFANN models have been constructed in MATLAB environment with the help of built-in functions provided in the Statistics and Machine Learning Toolbox [112].

It is noted that the number of neurons in the hidden layer of the BPANN model is selected to be based on the suggestion in [113], where DX and CN denote the numbers of input features and class outputs, respectively. Herein, the numbers of input features and class outputs are 10 and 2, respectively. Moreover, the BPANN is trained using the Levenberg–Marquardt (LM) algorithms [111, 114]. The sigmoidal activation function is used for the BPANN to train with the LM algorithm (denoted as LM-BPANN). The BPANN model is trained with the maximum number of epochs = 1000. Moreover, based on several trial-and-error experiments with the collected dataset, the appropriate RBFANN model [115] consists of 30 neurons and has been trained with a basic width of 1.2.

The L-SHADE-PWI-SVM as well as the used benchmark models is trained with a training set (90%) and a testing set (10%) randomly drawn from the original datasets. It is noted that since one-time evaluation is not sufficient to verify the model capability due to the effect of the random data sampling process, this study has repeated the random splitting of the dataset into model training and testing phases 20 times to negate such an undesired effect. Accordingly, the prediction results of the L-SHADE-PWI-SVM and benchmark models obtained from 20 runs are reported in Table 3 and Figure 5.

It can be seen that the proposed hybrid machine learning model has achieved the most desired accuracy on soil erosion detection with CAR = 92.292% while LM-BPANN ranks second with CAR = 87.083% and RBFANN ranks third with CAR = 77.917%. L-SHADE-PWI-SVM is also the most reliable predictor with a smaller CAR deviation (5.1 versus 6.6 and 9.9). This is clearly demonstrated in the box plots in Figure 5.

The proposed model is also superior in other metrics such as precision, recall, NPV, and F1-score with better measurement value and less variation (smaller deviation). More specifically, L-SHADE-PWI-SVM has precision = 0.944, recall = 0.900, NPV = 0.904, and F1-score = 0.919, and LM-BPANN is the second best prediction approach with precision = 0.901, recall = 0.854, NPV = 0.868, and F1-score = 0.869, followed by RBFANN with precision = 0.804, recall = 0.758, NPV = 0.769, and F1-score = 0.776.

Moreover, the receiver operating characteristic curve (ROC) and the area under the curve (AUC) [116] are also employed for assessing the prediction model performance [56, 117, 118]. It is because the AUC can express the overall predictive accuracy of the employed classifiers used for soil erosion prediction. This study also computes the AUC to indicate the generalization capability of the proposed L-SHADE-PWI-SVM as well as the benchmark models. The AUC results are reported in Table 4. As can be seen from the experimental results, the AUC value in the testing phase of the proposed L-SHADE-PWI-SVM (0.908) is higher than those of the LM-BPANN (0.898) and RBFANN (0.787). In addition, the ROCs of the proposed model are illustrated in Figure 6.

4.4. Discussion

To further confirm the superiority of the proposed hybrid L-SHADE-PWI-SVM model, the Wilcoxon signed-rank test [119] with the significant level ( value) = 0.05 is also employed in this study to demonstrate the statistical significance of the difference in model results. This nonparametric hypothesis testing method is widely employed for comparing classification models [120]. The test outcomes of pairwise model comparison are reported in Table 5. Observably, with values <0.05, the null hypothesis of equal means is rejected.

Moreover, to assess the reliability of the L-SHADE-PWI-SVM-based soil erosion prediction model, the coefficient of variation (COV) [121, 122] is employed in this section of the study. The COV (%) is computed as follows [121, 122]:

The COV calculation results of the proposed L-SHADE-PWI-SVM and the benchmark approaches used for rainfall-induced erosion susceptibility prediction are reported in Table 6. It is noted that the COV is the ratio of the standard deviation to the mean and is used to quantify the dispersion of a model prediction outcomes obtained from a repeated data sampling process [123]. In the particular case of rainfall-induced soil erosion susceptibility evaluation, a small value of COV is desirable since it indicates a stable data-driven model used for predicting such phenomenon. As shown in Table 6, the proposed L-SHADE-PWI-SVM has achieved the lowest COV in all of the performance measurement metrics (5.535% for CAR, 7.309% for precision, 9.000% for recall, 8.628% for NPV, and 5.767% for F1-score). These outcomes point out that the newly developed model is the most reliable tool for rainfall-induced soil erosion susceptibility assessment.

In conclusion, L-SHADE-PWI-SVM is more accurate and more reliable model compared to LM-BPANN ( values = 0.009816) and RBFANN ( values = 0.000631). Statistical tests also confirm that the superior of L-SHADE-PWI-SVM in terms of accuracy is also significant with values <0.05. This means that the newly constructed L-SHADE-PWI-SVM is highly appropriate for soil erosion prediction in the study area.

Compared to the conventional approaches for soil erosion prediction such as the Revised Universal Soil Loss Equation (RUSLE) [124, 125] which requires significant efforts on parameter calibration, the newly proposed method is entirely data-driven in which all of the model parameters are determined via the model training process. In addition, recently proposed machine learning methods for soil erosion susceptibility prediction have dominantly relied on individual or ensemble of models [126128]. Therefore, trial-and-error processes and modeling experience are required for constructing such machine learning models. An integration of machine learning and metaheuristic used for soil erosion prediction is rarely investigated. Thus, the current study is an attempt to fill this gap in the literature and show the great potentiality of this hybrid framework for tackling the problem at hand.

Nevertheless, one disadvantage of the proposed approach is that the optimization process to determine an optimal set of parameters of the SVM model might be costly, especially when the size of the collected dataset is large. It is because the model training and prediction processes of the SVM are embedded into the cost function calculation phase of the employed metaheuristic. Another limitation of the current approach is that automatic feature selection has not been integrated into the hybridization of the L-SHADE-PWI and SVM. Such drawbacks should be resolved with in future extensions of the study.

5. Conclusions

In tropical regions, soil erosion is a natural hazard that causes various harmful effects on the land including loss of soil, soil structure breakdown, and the decline of organic matter as well as nutrients within the soil. These ultimately lead to a critical economic loss for landowners. This study has developed a hybrid intelligent method, named as L-SHADE-PWI-SVM, for predicting the status of soil erosion based on an integration of machine learning and metaheuristic. The SVM pattern recognition method is employed to generalize a decision boundary that separates input data into two categories of erosion and nonerosion.

The ten variables of EI30, slope, OC topsoil, pH topsoil, bulk density, soil porosity, soil texture (silt, clay, and sand fractions), and soil cover rate are used as erosion conditioning factors. In addition, to optimize the SVM model performance, the state-of-the-art L-SHADE-PWI metaheuristic is employed. The newly developed L-SHADE-PWI-SVM has achieved a good predictive performance (CAR = 92.292%, F1 score = 0.919, and AUC = 0.908) obtained from a repeated data sampling process. The experimental results supported by the Wilcoxon signed-rank test demonstrate that the proposed hybrid model is superior to the benchmark methods including the LM-BPANN (CAR = 87.083%, F1-score = 0.869, and AUC = 0.898) and the RBFANN (CAR = 77.917%, F1-score = 0.776, and AUC = 0.787). The proposed L-SHADE-PWI-SVM has outperformed the benchmark approaches in all of the evaluation metrics. These facts strongly confirm the efficacy of applying the proposed hybrid machine learning method for solving the problem of interest. The L-SHADE-PWI-SVM can be a very promising tool to assist landowners and managers to quickly identify the potential soil erosion areas and develop preventive measures (Table 7).

Future developments of the current study may include the following:(i)The applications of other advanced metaheuristics in optimizing the machine learning model used for soil erosion prediction(ii)The integration of state-of-the-art feature selection into the machine learning structure to further enhance the prediction accuracy(iii)The investigation of advanced kernel functions for better dealing with nonlinearity in soil erosion data classification

Data Availability

The data used to support the findings of this study can be found in Table 7.

Conflicts of Interest

The authors confirm that there are no conflicts of interest regarding the publication.

Acknowledgments

This research was funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant no. 105.08-2017.302.