Abstract

Groundwater plays an important role in global climate change and satisfying human needs. In the study, RS (remote sensing) and GIS (geographic information system) were utilized to generate five thematic layers, lithology, lineament density, topology, slope, and river density considered as factors influencing the groundwater potential. Then, the multicriteria decision model (MCDM) was integrated with C5.0 and CART, respectively, to generate the decision tree with 80 surveyed tube wells divided into four classes on the basis of the yield. To test the precision of the decision tree algorithms, the 10-fold cross validation and kappa coefficient were adopted and the average kappa coefficient for C5.0 and CART was 90.45% and 85.09%, respectively. After applying the decision tree to the whole study area, four classes of groundwater potential zones were demarcated. According to the classification result, the four grades of groundwater potential zones, “very good,” “good,” “moderate,” and “poor,” occupy 4.61%, 8.58%, 26.59%, and 60.23%, respectively, with C5.0 algorithm, while occupying the percentages of 4.68%, 10.09%, 26.10%, and 59.13%, respectively, with CART algorithm. Therefore, we can draw the conclusion that C5.0 algorithm is more appropriate than CART for the groundwater potential zone prediction.

1. Introduction

Increasing population and water scarcity have raised the importance of groundwater zones, as they are a major source of freshwater. Integrated remote sensing and GIS are widely used in groundwater mapping. Locating potential groundwater targets is becoming more convenient and cost-effective with the advent of a number of satellite imageries. Remotely sensed based groundwater exploration has made it feasible to explore the areas with limited human access, for the wide visual range, short time cycle, and increasing spatial resolution.

A lot of work has been done on the delineation of groundwater potential zones, including in tropical humid regions, such as Tirnavos area, Greece [1], Timor Leste, Indonesia [2], SW, Nigeria [3], and New Delhi [4], and in mid-latitude semiarid areas, such as Boryeong and Pohang, Korea [5, 6], Sultan Mountains (Konya, Turkey) [7, 8], Udaipur, India [9], Beheshtabad watershed, and Chaharmahal-and-Bakhtiari Province, Iran [10]. Through the analysis of the characteristics and factors within the typical regions, we found out basic principles and methods for factor selection, which provided reference and basis for the study area.

Various researchers have effectively implemented multicriteria decision model (MCDM) for accurately identifying the groundwater potential zones [14]. The major factors that influence the groundwater potential are lithology, rainfall, slope, drainage density, lineament density, and so forth. The factors’ values are mostly continuous; however, the predecessors mostly divided the continuous factors into several discrete levels according to the relationship with groundwater potential, causing a great loss of the original information. In the previous research, we have established the fuzzy membership functions to analyze each factor’s impact on groundwater enrichment from the perspective of continuity [11].

The MCDM is based on either manual decision method like analytic hierarchy process (AHP) [14, 9] or machine learning method, such as artificial neural network fitting [5], frequency ratio, weights of evidence and logistic regression [7, 8], boosted regression tree, classification and regression tree (CART), random forest [10], chi-squared automatic interaction detector, or the quick, unbiased, and efficient statistical tree algorithms [12, 13]. The flexibility of the AHP method allows the revision of the weights and rating of parameters in order to be suitable for other regions according to their specific characteristics [1]. However, regarding assigning weights to the different thematic layers, personal judgment reduced the objectivity of the model [9]. Decision tree techniques provide a multivariate method, which is known as a successful automatic classification scheme [14, 15]. CART algorithm [16] showed more application than other decision tree algorithms. C5.0 algorithm [1619], as one of the decision tree techniques, serves as an enhancement of C4.5 and shows relatively better classification result, for it can simultaneously handle continuous and categorical variables, with the unbiased processing. The -fold cross validation method [20] is an effective way to improve the precision of the decision tree model, with the basic idea randomly dividing the samples into the training set and validation set, and circling for times to ensure the robustness of the model.

In the current study, considering the groundwater potential relating factors, lithology, lineament density, topology, slope, and river density, decision tree algorithms, C5.0 and CART, respectively, with the 10-fold cross validation were applied to generate and test the decision tree models, for predicting the groundwater potential grade in the whole study area.

2. Materials and Methods

2.1. Study Area and Materials

The study area (shown in Figure 1) is located in the southwestern part of Ritu county, Ali city, with the extent of 79°30′–80°20′E and 33°–33°30′N and surrounded by Kailash mountain in the south and Karakoram mountain in the north. The county is in the southeast of Ban Gong Lake Basin across the Ban Gong Lake-Nu Jiang fault zone. The area belongs to the wide valley in the plateau and mountain lake basin and provides runoff to Ban Gong Lake in the southeast direction. The study area is dominated by subfrigid monsoon climate. The temperature ranges from −22.1 to 13.6 Celsius degrees, with the annual average temperature of 0.5 Celsius degrees. The annual sunshine period is 3370.9 hours, with the frost-free period of 95 days. The annual rainfall is quite low being only 75 mm and having high evaporation of 2456.3 mm. The Ban Gong Lake has a maximum depth of 41.3 m and lies in east-west direction, having salt water in the midwest and freshwater in the east. The original plant Ban Gong willow growing along the lake valley helps in soil and water conservation. The lowest depression for catchment travels along the Ban Gong Lake-Nu Jiang fault and the beaded lake basin depression exists between the surrounding mountains. The elevation ranges between 4196 and 6240 m in the study zone, having an area of about 2240 km2.

The available data sources include geology map with the scale of 1 : 250000 purchased from the National Geological Library, DEM from ASTER satellite with the horizon distribution of 30 m and Landsat 8 OLI image with the acquisition date 5/22/2013 downloaded from https://www.usgs.gov/ and cloud cover, 2.43%, sun elevation, 68.06, sun azimuth, 119.95, and tube wells yield data with field survey. 80 investigated tube wells were divided into four grades according to the yield (Table 1) [21].

2.2. Methodological Framework

The study on the groundwater potential zone delineation is carried out from the perspective of hydrogeology, considering the occurrence space and supply condition. Lithology and lineament density are chosen as the occurrence space factor; topology, slope, and river density are related to the groundwater supply condition. Lithology is the categorical variable, and the rest of the variables are continuous. After the analysis between the factors and groundwater potential grade, C5.0 and CART algorithms were, respectively, applied to generate the decision tree, and the 10-fold cross validation was adopted to test the classification accuracy. The specific technical route is shown in Figure 2.

2.3. Factors Related to Groundwater Potential

Lithology [22] influences the water-holding capacity of aquifer and directly affects the occurrence and distribution of groundwater. The lithology thematic map was derived through digitizing the 1 : 250,000 scale geology map from the National Geological Library as shown in Figure 3. The Quaternary (Q) including three kinds of sedimentary type, alluvial, diluvia, and lacustrine, with the distribution in low-lying area belongs to the loose sediment for the melting snow water that flows from the high mountains. The Jurassic (J) layers are distributed widely along the east-west direction, with the modern (J3), middle (J2), and early age (J1), respectively, lying in the middle, south, and mid-north. The Modern Cretaceous (K2) spreads along the north-south direction. The Early Cretaceous (K1) is mostly distributed in the north along the east-west. The Paleogene (E) lies mostly in the west, with the scatter distribution in the central and eastern part.

The linear faults, accompanied by the cranny, provide space for the occurrence of groundwater [23]. In the stratum with the same lithology, the intersection of the faults leads to development of the cranny, which tends to be the groundwater enrichment zone with the connectivity enhancement. The linear structures are extracted from the satellite image based on the discontinuity of the color from the surrounding areas. Orthographical correction is applied to Landsat 8 OLI image with DEM to eliminate the shadow’s influence on the visual interpretation in the study area. Combination of bands 7/5/3 (SWIR 2/near infrared/green) proved to be the most suitable for the extraction with the geology map in ENVI 5.1. The lineament density shown in Figure 4 was calculated in ArcGIS 10.1 with “line density” command.

Topography controls the groundwater supply conditions. The mountainous region provides better runoff conditions and most of the precipitation is accounted for in the surface runoff with minimum infiltration to the groundwater. On the other hand, precipitation in plains provides slower runoff and facilitates groundwater recharge. Topography map is shown in Figure 5. The topography map with 30 m spatial resolution was extracted from DEM data in ArcGIS 10.1.

Groundwater flow is usually driven by surface force, and the boundary of the terrain is mostly the boundary of the shallow aquifer. Slope [24] is important in analyzing the terrain, as it can affect the groundwater in terms of its storage, flow, and discharge, especially in mountainous areas. Slope was extracted from the DEM in ArcGIS 10.1 and is shown in Figure 6. In general, slopes control the infiltration and flow ability of the surface water. Usually, the steep slope indicates greater water velocity. Therefore, it is observed that in the areas of steeper relief the runoff increases while minimizing the groundwater recharge. On the contrary, on the relatively gentle sloping terrain, the groundwater potentiality increases due to greater infiltration. Thus, lower slope results in greater recharge.

Flow accumulation reflects the upstream flow quantity. In the study area, the supply source is mainly the melting snow. Flow accumulation was derived from DEM for generating a stream network. It can be seen from Figure 7 that the study area has a dendritic pattern for drainage. The dendritic network is usually found in region underlain with homogeneous surface without abrupt changes in geological conditions.

The river density [25] represents the recharge conditions to quantify the influence caused by surface water, where higher density provides better recharge conditions. Based on the flow accumulation, the river density was calculated using the “line density” command in ArcGIS 10.1, as shown in Figure 8.

2.4. Research Method

The decision tree algorithms [16] are suitable for the multifactor classification problem. For the mixture of the continuous and discrete factors in the groundwater potential evaluation model, C5.0 and CART decision tree algorithms were adopted together with the 10-fold cross validation method to improve the classification accuracy and unbiasedness.

2.4.1. C5.0

The C5.0 decision tree algorithm is rooted in ID3 and C4.5. The ID3 algorithm [26] with the maximum gain as the division standard aims to achieve the maximum of the information gain in every node, which often gives priority to the variables with more classes. To make up for the defect of ID3, C4.5 adopts the gain-ratio criterion [19, 27, 28]. With the gain-ratio criterion, the binary nodes divide the continuous variable. The processing method for categorical variables is firstly to refer each of the categories as a branch and then merge each two branches iteratively until the two branches remain. However, the heuristic search may not find the best point for the categorical variables division. For continuous variables, C5.0 algorithm can easily find the division point [29]. To get rid of the biasedness on the continuous variables, the algorithm improves the gain of the continuous variables. Besides, C5.0 algorithm can simplify the originally complex decision tree to the equivalent tree, for the easy understanding. With more splitting layers than other algorithms, it can ensure the high purity of the result nodes. C5.0 algorithm achieves the self-correction after several iterations with boosting technology [30]. The artificial methods lay emphasis on the sole impact of various factors; however, C5.0 algorithm can consider all the factors to analyze the comprehensive influence based on data statistics. C5.0 algorithm has been widely applied to the multivariate classification, for its unbiasedness and precision towards the continuous and categorical variables compared to other decision tree algorithms [17, 3137]. To improve the classification accuracy, boosting technology was applied in C5.0 decision tree algorithm and could adjust the decision tree according to the fault samples until reaching a high precision [38]. Besides, C5.0 algorithm is more applicable to the large data samples.

Assume the splitting node is expected to separate the samples into target categories [27]. The symbol stands for the percentage of category at node , . The inclusive information at node can be expressed as

For the decision tree branch, the samples are divided into by the node , with the subsamples of , , and , respectively. Then, the information can be expressed as

After combining the above information formulas, the information gain can be expressed as

To improve the application of the information gain concept, information gain ratio was put forward:where

The practice showed that the information gain ratio was more preferred to the continuous variable; therefore, the information gain ratio for the continuous variable with distinct values should be expressed as [16]

2.4.2. CART

CART decision tree algorithm [39] can divide the sample set into two subsample sets, making the root and intermediate nodes with two branches based on the recursively binary segmentation technology. CART can handle both continuous and discrete variables, with the impurity level-Gini coefficient as the discriminant basis, considering the probability distribution under the division node.

Assume a total of classes, variable , and node for ; then the Gini index is defined as [16]

When the classes show the equal probability in the node , the Gini index achieves the maximum ; with only one kind in node , the Gini index achieves the minimum. The Gini index increases with the impurity; therefore, the subnodes should be added to lower the impurity. When taking the misclassification cost matrix into consideration, the Gini index formula becomes where represents the cost for misclassifying the case category as . Assuming that the subnode added to node , the Gini index can be expressed aswhere , stand for the proportion of cases in node classified into and .

2.4.3. 10-Fold Cross Validation

The basic idea for the -fold cross validation [40, 41] is to equally divide the surveyed samples into parts, of which parts served as the training samples with the remaining one part for validation with circulations. The method can guarantee every sample acted as the training sample and the verification sample for one time in the circulations. For the -fold cross validation, the commonly used value for is 10, called 10-fold cross validation [42].

3. Results and Discussion

In this study, five groundwater relating factors were used in the analysis and the factors except lithology were continuous. 80 tube wells were utilized for training with C5.0 and CART, respectively, in the statistical analysis software SPSS Clementine 12.0 and MATLAB. According to the confusion matrix [43] constructed by the verification result, the kappa coefficient [44, 45] is used to evaluate the accuracy. Based on the 10-fold cross validation with C5.0 and CART algorithm, respectively, the precision of each loop is shown in Table 2. The importance of each factor was calculated firstly to determine the selection order of the factors for classification as shown in Table 3. For the C5.0 algorithm, the importance was 0.363, 0.331, 0.159, 0.117, and 0.031, respectively, for lithology, topology, river density, slope, and lineament density. For the CART algorithm, the importance was 0.355, 0.308, 0.024, 0.010, and 0.312, respectively. After the ten loops, the decision tree with the higher classification accuracy was chosen as optimal. Figure 9 shows the optimal decision tree generated by C5.0 algorithm, with 6 layers, 21 nodes, 10 internal nodes, and 11 terminal nodes. Table 4 shows the rules for the optimal decision tree by C5.0. The optimal decision tree generated by CART algorithm is shown in Figure 10, with 8 layers, 21 nodes, 11 internal nodes, and 10 terminal nodes, and the rules are shown in Table 5.where is the number of the total categories; is the total number for verification; is the number of the correct classifications; is the number of samples mistaken from the category for others; is the number of samples mistaken from other categories for .

According to the classification rules, we can see that not every rule takes all the variables into account; therefore, for the area lack of the detailed information, the typical factors can help to predict the groundwater potential classification grade. The decision trees show that both the two algorithms can determine the dividing point of the variables, especially for the continuous variables based on the training data, which makes the division of the interval more scientific and reduces the segmentation error, compared with the artificial division.

According to the decision tree result generated by C5.0 algorithm, we did deep analysis. Topology was divided by six nodes: 4196–4301–4316.5–4357–4400.5–6240; and the groundwater was distributed in the low-lying areas. Slope contained two ranges: 0–2.43–75; and the flat areas were beneficial to the surface water infiltration. River density was classified into four intervals: 0–0.506–0.650–0.784–1.14 and reflected the flow capacity in the region, and the higher, the better for groundwater enrichment. Lineament density was divided into two intervals: 0–0.558–1, and the more occurrence space for groundwater existed with the higher lineament density. However, for the CART algorithm, topology was divided into three intervals: 4196–4304–4400–6240; slope contained two ranges: 0–2.386–75; river density was classified into three intervals: 0–0.685–0.703–1.14; lineament density was divided into four intervals: 0–0.296–0.335–0.713–1. After applying the optimal decision trees, respectively, to the whole study area [6, 41], the groundwater potential zone maps were derived and shown in Figures 11 and 12.

According to the results generated by the decision trees, the “very good” area is mostly located in the broad plain zone with a patchy distribution, covering 103.25 km2 about 4.61% of the study area for C5.0 algorithm and 105.05 km2 about 4.68% of the study area for CART algorithm, for the water infiltration into the underground with sufficient time and space. The study shows that low-lying areas with good flow condition, well-developed stratigraphic gap, and strong connectivity like the southwest beaches of Ban Gong Lake can be the first target for groundwater resources. The “good” area was distributed along the river with 192.10 km2, covering 8.58% for C5.0 algorithm, and 226.02 km2, about 10.09% of the study area for CART algorithm, just like a long strip. The zones mainly had a planar distribution on the bottom of the diluvia fan on the low mountain and hilly terrain and a banded distribution on the mountain watershed of pluvial valleys, situated upstream and on the periphery of “very good” area, which can serve as the candidate target for groundwater exploration. The “moderate” zone with 595.51 km2, occupying 26.59% for C5.0 algorithm, and 584.53 km2, about 26.10% of the study area for CART algorithm, spreads around the upstream tributaries. The zone is located on both sides of the river valleys and the top of the diluvia fan. The “poor” area occupies 60.23%, with the area of 1349.14 km2 for C5.0 algorithm, and 1324.40 km2, about 59.13% of the study area for CART algorithm, which is mostly a mountainous region with a high altitude.

4. Conclusions

In this study, C5.0 and CART algorithms were applied for the decision tree generation to predict the groundwater potential zone with the five relating factors and the 10-fold cross validation method was adopted to verify the classification result with the kappa coefficient. From this paper, we can draw some conclusions as follows.

In the study area, the five groundwater relating factors, lithology, topology, slope, river density, and lineament density were appropriate for the groundwater potential grade prediction and the importance based on C5.0 algorithm was 0.363, 0.331, 0.159, 0.117, and 0.03, respectively; for CART algorithm, the importance was 0.355, 0.308, 0.024, 0.010, and 0.312, respectively.

Based on the 10-fold cross validation, both C5.0 and CART could be applied for MCDM with the categorical and continuous variables simultaneously, with the average accuracy of 90.45% and 85.09%, respectively; however, C5.0 algorithm showed higher classification accuracy than CART algorithm.

After applying the optimal decision trees to the whole study area, respectively, the groundwater potential zone map was delineated and the four grades of groundwater potential zones, “very good,” “good,” “moderate,” and “poor,” occupied the area of 103.25 km2, 192.10 km2, 595.51 km2, and 1349.14 km2, with the percentages of 4.61%, 8.58%, 26.59%, and 60.23%, respectively, for C5.0, and for CART the area of 105.05 km2, 226.02 km2, 584.53 km2, and 1324.40 km2, with the percentages of 4.68%, 10.09%, 26.10%, and 59.13%, respectively.

The study result can provide reference for groundwater exploration and in the future work we will consider more relating factors and survey more wells to enrich the model. The integration of decision tree algorithms and MCDM in our study applies only to the qualitative assessment for the lack of the prior knowledge in the large area; therefore, the extra analysis is needed for the specific point investigation. The accuracy demonstrates that the 10-fold cross validation is suitable for training and verifying the decision tree; however, the tested dataset is limited and more tube wells should be investigated to validate the stability of the model.

Competing Interests

The authors declare that there is no conflict of interests in this research work.

Acknowledgments

This work was supported by Development Program of China: Groundwater Exploration Technology in the Water Shortage Region (863 Program 2012AA062601).