Abstract

Soil moisture-holding capacity data are required in modelling agrohydrological functions of dry subhumid environments for sustainable crop yields. However, they are hardly sufficient and costly to measure. Mathematical models called pedotransfer functions (PTFs) that use soil physicochemical properties as inputs to estimate soil moisture-holding capacity are an attractive alternative but limited by specificity to pedoenvironments and regression methods. This study explored the support vector machines method in the development of PTFs (SVR-PTFs) for dry subhumid tropics. Comparison with the multiple linear regression method (MLR-PTFs) was done using a soil dataset containing 296 samples of measured moisture content and soil physicochemical properties. Developed SVR-PTFs have a tendency to underestimate moisture content with the root-mean-square error between 0.037 and 0.042 cm3·cm−3 and coefficients of determination (R2) between 56.2% and 67.9%. The SVR-PTFs were marginally better than MLR-PTFs and had better accuracy than published SVR-PTFs. It is held that the adoption of the linear kernel in the calibration process of SVR-PTFs influenced their performance.

1. Introduction

Sustainability of crop yields in dry subhumid zones of marginal agricultural productivity requires integrated modelling approaches to provide the necessary feedback for adapting agrohydrological functions to changing seasonal soil moisture regimes [1]. Soil moisture-holding capacity is an important parameter for modelling moisture availability. It is a measure of the difference between moisture at field capacity and wilting point [2]. Moisture-holding capacity facilitates the description of soil hydrological processes such as drainage, infiltration, and percolation and is vital input data in models such as Soil Water Assessment Tool (SWAT) [3], and AQUACROP [4]. However, soil moisture data are generally in limited supply for tropical soils [5, 6], largely due to high costs of measurement and lack of associated equipment [6, 7].

Mathematical equations known as pedotransfer functions (PTFs), linking easily measured soil properties as input variables to soil moisture data, have been employed to bridge data gaps. With extensive development for temperate soils [8], PTFs application is fraught with specificity to calibration datasets [1] and geographic regions [8, 9]. Tropical soils have a bimodal particle-size distribution in contrast to the unimodal soils of the temperates [5, 10], with maximal weight percentage for clay- and sand-size fractions and low silt content [5]. This is suggested to impart contrasting soil hydraulic characteristics [5, 1113], limiting transferability of PTFs for modelling processes across their statistical and pedoclimatic calibration bounds [9, 14].

Utility of PTFs necessitates validation or development of new PTFs for improved modelling outputs [9, 12]. Studies to this end for tropical soils in sub-Saharan Africa include Wosten et al. [7], Botula et al. [11], Young et al. [15], Mdemu and Mulengera [16], Mugabe [17], Obalum and Obi [18], and Mdemu [19]. All these studies have drawbacks including evaluation on small soil datasets or compiled soil databases or frequent application of the multiple linear regression method. Among the many PTF development methods, the multiple linear regression method has been highlighted to be inadequate in capturing the nonlinearity associated with moisture-holding properties [14, 20]. An insufficient data size has been reported to be a major weakness for PTF evaluations [2123]. Substantial uncertainty also exists with soil databases used to derive the PTFs [12], probably associated with data entry or measurement inconsistencies [5, 12].

Machine learning algorithms generally have better flexibility in mimicking the complex nonlinear pattern in the soil moisture continuum [11]. Enhanced computational efficiency of computers has spiralled the advancement of sophisticated machine learning algorithms such as artificial neural networks (ANNs) [9, 24], k-nearest neighbour [11, 21, 25], and support vector machines (SVMs) [20, 2628]. Interest here is skewed to the SVMs because they are easier to implement than the popular ANNs [29] and have circumvented typical drawbacks associated with ANNs [20, 30, 31]. Results from ANNs are nonunique, highly dependent on the initialisation parameters, require a relatively large dataset for effective training, and often end in overfitting [30].

Support vector machines are a supervised machine learning algorithm based on statistical learning theory [32, 33], developed for data classification in [34] and later extended to solve regression problems [30, 31, 33]. The key advantage of the SVMs is structural risk minimisation over the empirical risk minimisation which checks overfitting during model development [29, 33]. Lamorski et al. [35] and Twarakavi et al. [20] pioneered the use of SVMs in the development of parametric and point PTFs, reporting improvements over ANNs. There is currently stimulated interest in the use of SVMs for PTF development [25, 26, 31], but with no evident work for sub-Saharan Africa soils. Flexibility of SVMs in incorporating limited soil data [27] would be of added benefit particularly for sub-Saharan countries, where soil data are limited but in high demand for developing sustainable farming systems. In view of this, the objective of this research was to apply support vector machines to develop pedotransfer functions for moisture-holding capacity using experimentally measured data.

2. Materials and Methods

2.1. Study Area

The study area was the Ilakala village in Kilosa District, Morogoro Region, Tanzania (Figure 1), within latitudes 7°5′30″S and 7°9′30″S and longitudes 36°50′30″E and 36°57′30″E. It has a total area of about 44 km2. Agriculture (both cropping and livestock keeping) is the major livelihood activity in the area. The cropping system is a maize-sesame-pigeon pea small-holder system, with maize and pigeon peas as the main food crops. Sesame is a cash crop. Livestock keeping is typically undertaken by pastoralist communities of Masai and Sukuma ethnicities.

Major soil types traversing the study area are Hyperdystric Cambisol (loamic and ochric), Rhodic Acrisol (clayic, cutanic, epieutric, and profondic), Luvic Stagnic Umbrisol (endoeutric and loamic), Endogleyic Protovertic Eutric Cambisol (colluvic and ruptic), and Pellic Vertisol (ferric, humic, and mesotrophic) [36]. Many seasonal streams drain the area from the hilly regions in the southwest and western edges, feeding into River Mhenda that flows along the eastern edge of the village.

2.2. Soil Sampling and Analysis

A soil dataset of 296 samples collected between June 2014 and July 2015 was used in this study. Soil samples were taken from 100 locations at three depths (0–30 cm, 30–60 cm, and 60–100 cm). The 100 cm soil depth interval covers the complete root zone essential for available water for crop growth [37]. However, soil samples at the 60–100 cm depth interval were not taken at four sampling locations due to rockiness. Bulk soil samples were air-dried and crushed and sieved through a 2 mm sieve. Sieved soil samples were then analysed in the laboratory for particle-size distribution and organic carbon. Particle-size fractions were determined by the Bouyoucos hydrometer method [38] and separated according to the United States Department of Agriculture (USDA) particle-size classification system [39]. Organic carbon was determined by the wet oxidation method of Walkley and Black [40]. Duplicate undisturbed soil core samples in 100 cc Kopecky rings with height and diameter dimensions of 5 cm were used to determine soil moisture at field capacity (FC) and moisture content at wilting point (WP) using a pressure plate apparatus. Soil matric suctions of 30 kPa and 1500 kPa were used for FC (θ30) and WP (θ1500), respectively. The very soil core samples were used to determine bulk density (BD) after drying the soil core samples at 105°C for 24 hours [41].

2.3. Descriptive Statistical Analyses

The soil dataset was randomly split into a ratio of 2 : 1 for a training dataset () and a testing dataset (), respectively. Descriptive statistics, normality tests, and correlation analyses were performed for constitutive soil variables in the dataset (BD, OC, sand, clay, and silt contents, FC, and WP) using the R statistical software [42].

2.4. PTFs Development

The training dataset was used for SVR model calibration. Input θ30 and θ1500 data were log-transformed prior to model development for both MLR and SVR methods. This was necessary for the target variables to conform to a normal distribution. Epsilon support vector regression (ε-SVR) was used for development of SVR-PTFs in the R software package e1071 [43]. Mathematical formulation of SVR is elegantly explained in the studies [20, 27, 33, 44]. Success of SVR calibration depends on three key issues: (1) selection of a suitable kernel function, (2) choice of the cost/regularisation parameter C, and (3) the “tube” insensitivity variable ε [20]. Table 1 shows some common kernel functions for SVR. The Gaussian radial basis function (RBF) kernel has been used most frequently [20, 25, 26, 35], but a linear kernel was chosen for this study because of the overfitting challenges reported with the RBF kernel [28].

The parameters C and ε are known as the hyperparameters and their optimisation determines how good the SVR model is, while “γ,” “r,” and “d” are kernel parameters. The parameter C determines the tolerance of the calibration prediction error and structural complexity of the SVR model. With large C values, higher penalties are assigned to the calibration error, resulting in model complexity and a computationally inefficient model with a low generalisation capability. The ε parameter controls the loss function which controls the width of the insensitive zone leading to minimisation of the regression risk. Large values of ε lead to smaller numbers of support vectors and poor generalisation.

The SVR calibration procedure was carried out in three steps: first, the training dataset was used to initially fit the SVR model with the linear kernel function through epsilon regression in the R software package e1071 [43]. Linear kernel functions have only two hyperparameter values that require setting, that is, the C and ε parameters. The default package C parameter value (C = 1) was retained for the initial fit, while the ε parameter was set to the following equation [45]:where is the number of records in the training dataset and is the standard deviation of the data.

In the second step, tuning of the SVR model hyperparameters was performed using a grid-search method with a 10-fold cross validation in 5 repeats. The grid-search method facilitates optimisation of hyperparameters by estimating the training prediction error for each set of all possible combinations of hyperparameters within the feasible feature space [20]. With insights from earlier studies [31, 35], the parameter search space was a priori set to 0.001 ≤ C ≤ 1000 at an incremental ratio of 10 and 0 ≤ ε ≤ 0.3 at steps of 0.001. Subsequent fine tuning was performed using a parameter search space within the neighbourhood of the best optimised hyperparameters from the second step. This process generated the best optimal hyperparameters that were ultimately used for developing the SVR-PTFs in the third step.

Multiple linear regression- (MLR-) PTFs were also developed for comparison purposes. Stepwise regression was used to develop the MLR-PTFs using the SPSS software package version 20 [46]. Both the SVR-PTFs and MLR-PTFs were then applied to the testing data to assess their validity. Performance of the developed PTFs was evaluated using the root-mean-square error (RMSE), mean error (ME), and coefficient of determination (R2) as indicators. The RMSE, ME, and R2 indices were calculated using (2)–(4), respectively. The RMSE and ME should ideally be close to zero, while R2 should be close to one:where is the measured moisture content, is the predicted moisture content, is the mean of the measured moisture content, and is the number of datasets.

3. Results and Discussion

3.1. Descriptive Statistics of Soil Datasets

The training and testing datasets were distributed within seven USDA soil texture classes (Figure 2). Most samples are coarse textured with more than 50% of the samples either of a sandy clay loam or a sandy loam soil textural class.

Table 2 shows the summary statistics of the training and testing datasets. Across both datasets, bulk density ranged from 1 to 1.19 g·cm−3. Organic carbon ranged from 0.06% to 3.37%, and clay, sand, and silt contents ranged from 0.1% to 63.6%, 20% to 96.6%, and 1.4% to 35.4%, respectively. Moisture content at FC (θ30) ranged from 0.08 to 0.48 cm3·cm−3, while moisture at WP (θ1500) ranged from 0.03 to 0.39 cm3·cm−3. Mean values of training and testing datasets were similar for all soil variables. Although the skewness indices were consistent with a Gaussian distribution [47], kurtosis values were nonoptimal for an assumption of normality to be held [48].

Table 3 shows the correlation coefficients for the soil physicochemical properties on moisture content at FC (θ30) and moisture content at WP (θ1500). Sand and clay had a strong correlation (r > 0.7) but with opposite polarity for both FC and WP. Bulk density, OC, clay, and sand had highly significant correlations with moisture content at θ30 and θ1500. Silt was poorly correlated with θ30 and θ1500 with r < 0.07. Organic carbon was positively correlated with moisture content at both suction extremes. Organic carbon content influences moisture retention properties due to its role in many other physical and physicochemical soil properties. Higher OC content improves soil structure and porosity, leading to increased moisture-holding capacity [3]. Organic matter also has high cation-exchange capacity and high specific surface area which enhances its moisture absorption capacity [49, 50].

3.2. PTFs Development

The initial fit generated SVR models with support vectors ranging from 188 to 191 at hyperparameter settings of C = 1 and ε = 0.034 derived from (1) (results not shown). This translated to about 95%–96% of the total support vectors used in model formulation, suggesting poor generalisation of the models with this initial choice of hyperparameters. The number of support vectors within the SVR model signifies its suitability for predictions on a new dataset. A larger proportion of support vectors lead to overfitting of the model and poor predictions on a new dataset, while a smaller proportion lead to underfitting [20]. A 50% threshold has been held as the theoretically optimal proportion of support vectors for good generalisation on new datasets [20, 28].

Figures 3(a)3(h) show model sensitivity with varying SVR hyperparameters’ combinations and increasing soil predictor variables during the coarse grid-search tuning process. The cross-validation error for FC SVR models ranged between 0.035 and 0.08 (Figures 3(a)3(d)), respectively, corresponding to model types FC1 to FC4 (Table 4), while for WP SVR-models, the cross-validation error ranged between 0.04 and 0.14 (Figures 3(e)3(h)) for model types WP1 to WP4. Models were most sensitive to values of C parameters, generally with lower C values (C < 10−2) leading to higher errors for both WP and FC SVR models. The WP models were most sensitive to the hyperparameters than the FC models, with the CV errors of the WP models almost twice those of FC models for the same predictor variables (Figure 3(a) versus Figure 3(e), Figure 3(b) versus Figure 3(f), and Figure 3(c) versus Figure 3(g)). However, that was not the case for the models FC4 and WP4, which showed similar CV errors because the model WP4 had an extra predictor variable (OC) than model FC4.

The pattern of higher CV errors for the WP SVR models was perhaps related to the input predictor variables in the model. Apparently, inclusion of sand as a predictor variable in WP models is counterintuitive as its adsorption of moisture at high matric suctions (i.e., 1500 kPa) is negligible. At high soil matric suctions, forces of molecular attraction (specific surface area and capillary forces) are greatly responsible for moisture retention in the soil matrix [49, 51]. Specific surface area is highly dependent on the soil particle-size distribution or texture class [52]—finer soil particle sizes have the highest specific surface area and sand the least. However, high correlation of sand with θ1500 (Table 3) informed its inclusion as a predictor for WP. This observed correlation could be linked to modification of surface properties of sand grains via fine-sized coatings (e. g. clay) thereby enhancing positive interactions with residual moisture.

Table 4 shows the most optimal hyperparameters from the coarse grid-search (1st) and fine grid-search (2nd) processes, with their corresponding cross-validation errors (CV errors) and number of support vectors (SVs). A lower number of SVs were evident for the SVR models after the 2nd tuning except for the FC4 model.

Table 5 shows the coefficients for the MLR-PTFs. All input variables were statistically significant. The tolerance and VIF scores indicate that the soil variables included as inputs were important predictors for moisture retention at FC and WP. A tolerance score >0.1 and VIF <10 indicate absence of multicollinearity and hence model parsimony. This implies that only the most influential predictor variables were objectively retained in the regression model. Bulk density has an inverse influence on the prediction of moisture retention. Increase in bulk density results in the destruction of the pedostructure and pore architecture, leading to a reduction in the available volume for soil moisture storage. Sand as a predictor variable had an inverse and the least influence on the moisture predictands in the MLR model. This trend could be explained by increases in soil macropores associated with sandy soils, which results in a decline in moisture retention [53]. Furthermore, sand particle fractions have a low cation-exchange capacity [54], which results in limited adsorptive sites for retaining moisture [49]. The small β coefficient for OC could have been because of the calibration of the model on a dataset with a low OC content. The average value for OC content in the training dataset was 0.8% (Table 2), which corresponds to a rating class of very low [55].

3.3. Evaluation of PTFs

Table 6 shows the performance indicators for the SVR models with varying predictors. The RMSE values ranged from 0.037 cm3·cm−3 to 0.042 cm3·cm−3. These RMSE values suggest good model accuracy, given that typical RMSE values for PTFs are reported to range within 0.02 and 0.07 cm3·cm−3 [23]. The MEs for the developed SVR models except for the model FC3 were greater than zero, indicating a tendency to underestimate moisture at FC and WP. Coefficients of determination (R2) were between 56.2% and 67.9% but slightly higher for the SVR models for wilting point (WP2, WP3, and WP4) than the field capacity models (FC2, FC3, and FC4). The best SVR model was FC3 for moisture prediction at field capacity with sand, clay, bulk density, and organic carbon as predictors. For wilting point, the model WP4 was the best performing model with sand, bulk density, and organic carbon as predictors. The developed models explain a substantial proportion of variance of the data and provide satisfactory quantitative estimates of moisture. According to [56], models with R2 values of 0.50 to 0.65 show good discrimination between low and high values, while those within 0.66–0.81 indicate approximate quantitative predictions, 0.82–0.90 indicate good prediction, and >0.91 indicate excellent prediction.

Unit plots of SVR- and MLR-predicted moisture content on the testing dataset are shown in Figure 4. The best performing SVR-PTFs (SVR-FC3 and SVR-WP4) were compared here. The R2, ME, and RMSE values were marginally better for the SVR-PTFs (Figures 4(a) and 4(c)) than the developed MLR-PTFs (Figures 4(b) and 4(d)). These results are comparable to those of Nguyen et al. (2017) who also found marginal differences between MLR-PTFs and SVR-PTFs. They attributed the observed good performance of MLR-PTFs fitted by the least-squares approach to model stability which results in low variance.

Evaluation indices (R2, ME, and RMSE) were also better at wilting point than at field capacity for both SVR and MLR models. Miháliková et al. [57] also found moisture content predictions to be more reliable at WP than at FC. The possible reason for this trend could be linked to the fact that moisture content at higher matric potentials (i.e., FC) is controlled by numerous soil factors which results in large variability within measurements. In contrast, moisture at wilting point is mainly influenced by specific surface area of the soil constituents which minimises variability in measurement values. The positive ME values indicate that the PTFs tend to underestimate moisture content. Although the ME of SVR-FC3 (ME = 5.0 × 10−4) might suggest an unbiased model, this result was due to the deviations above and below the line of fit cancelling out.

The RMSE values for both FC and WP were 0.037 cm3·cm−3 for the SVR-PTFs and 0.038 cm3·cm−3 for the MLR-PTFs. The RMSE values of the SVR-PTFs developed in this study were lower than those reported in similar studies [20, 25, 26, 28, 35] (Nguyen et al., 2017), for matric suctions at or near FC or WP. Explanation for this is not clear-cut and can only be suggested because PTF results are highly bound by datasets [14, 22, 31]. It is surmised that the observed RMSE trend is associated with the kernel function adopted in the SVR model development. This is plausible as a linear kernel was adopted in this study, while the radial basis function (RBF) kernel was used in the studies highlighted. This is corroborated by Lamorski et al. [28] who compared the linear and RBF kernels and found that the latter led to overfitting with high RMSE and poor generalisation capability on independent data when used in development of SVR-PTFs. Ben-Hur et al. [44] also observed that the use of nonlinear kernels (Gaussian RBF or polynomials) only provided marginal improvements in accuracy compared to the linear kernel. Lamorski et al. [28] attributed this to high sensitivity of SVR models to the RBF kernel width parameter (γ) (Table 1). The Gaussian RBF kernel width parameter determines the flexibility of the SVR in fitting the data, with small gamma (γ) values leading to overfitting and reduced accuracy [44]. On the contrary, it has also been shown that large γ parameters beyond an optimal threshold resulted in unrealistic R2 values (R2 = 1), which indicates overfitting [28]. Such results occur when a model starts to describe the random error in the data rather than the relationships between variables resulting in suboptimal performance outside the original training dataset [58].

Another possible explanation could be the variations in dataset characteristics used in the different studies as well as the predictors adopted for the SVR. Differences in measurement approaches and textural composition of the samples in datasets induce variability which affects the quality of PTF outputs [12, 31]. Including additional predictors to the particle-size fractions improved the accuracy of the SVR-models [20, 25, 26]. A similar trend was observed in this study. However, careful consideration is needed to avoid including difficult-to-measure soil properties as predictors.

4. Conclusions

This study was undertaken to develop SVR pedotransfer functions for estimating soil moisture-holding capacity for dry subhumid soils. Performance indices for MLR-PTFs were comparable to SVR-PTFs. The SVR-PTFs developed in this study performed slightly better than published SVR-PTFs. The linear kernel appeals for developing SVR-PTFs. However, further evaluations in this respect will be needed to establish the most optimal kernels to utilise for PTF development in view of popular application of the Gaussian radial basis function.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This research work was supported with a Ph.D. fellowship grant to Jacob Kaingo under the Soil Health Training Programme of the Alliance of a Green Revolution in Africa (AGRA) and Trans-Sec Project. The research article is based on the Ph.D. thesis of Jacob Kaingo. The authors extend gratitude to the smallholder farmers in the Ilakala village who gave permission to sample their fields. The authors appreciate the financial support from AGRA and Trans-Sec Project towards this research.