Abstract

Landslides are one of the most devastating natural hazards causing huge loss of life and damage to properties and infrastructures and adversely affecting the socioeconomy of the country. Landslides occur in hilly and mountainous areas all over the world. Single, ensemble, and hybrid machine learning (ML) models have been used in landslide studies for better landslide susceptibility mapping and risk management. In the present study, we have used three single ML models, namely, linear discriminant analysis (LDA), logistic regression (LR), and radial basis function network (RBFN), for landslide susceptibility mapping at Pithoragarh district, as these models are easy to apply and so far they have not been used for landslide study in this area. The main objective of this study is to evaluate the performance of these single models for correctly identifying landslide susceptible zones for their further application in other areas. For this, ten important landslide affecting factors, namely, slope, aspect, curvature, elevation, land cover, lithology, geomorphology, distance to rivers, distance to roads, and overburden depth based on the local geoenvironmental conditions, were considered for the modeling. Landslide inventory of past 398 landslide events was used in the development of models. The data of past landslide events (locations) was randomly divided into a 70/30 ratio for training (70%) and validation (30%) of the models. Standard statistical measures, namely, accuracy (ACC), specificity (SPF), sensitivity (SST), positive predictive value (PPV), negative predictive value (NPV), Kappa, root mean square error (RMSE), and area under the receiver operating characteristic curve (AUC), were used to evaluate the performance of the models. Results indicated that the performance of all the models is very good (AUC > 0.90) and that of the LR model is the best (AUC = 0.926). Therefore, these single ML models can be used for the development of accurate landslide susceptibility maps. Our study demonstrated that the single models which are easy to use and can compete with the complex ensemble/hybrid models can be applied for landslide susceptibility mapping in landslide-prone areas.

1. Introduction

Landslides are the most common natural disasters in the hilly and mountainous areas all over the world. Landslides are the downward movement of rock mass/groundmass/rock blocks by gravity [1]. Rainfall, earthquakes, and slope excavation are triggering factors for the occurrence of landslides [2]. Some of the influencing factors of landslides include the topography, geology, hydrology, and land use pattern of the area [3]. In the recent decade, landslide events have increased in both magnitude and frequency due to climate change effect reflected in rainfall patterns [4, 5]. Therefore, it is desirable to identify landslide-susceptible zones for better landslide management and disaster reduction [6].

Nowadays, single, ensemble, and hybrid machine learning (ML) methods are used in conjunction with the remote sensing (RS) and geographic information system (GIS) for landslide susceptibility studies [711]. Some of the single methods include support vector machine (SVM), kernel logistic regression (KLR), logistic model tree (LMT), classification and regression trees (CRT), function trees (FT), alternative decision tree (ADT), artificial neural Network (ANN), Naïve Bayes (NB), Fisher’s linear discriminant function (FLDF), reduced error pruning trees (REPT), decision tree (DT), neurofuzzy (NF), and adaptive regression splines (ARS). In addition, many researchers have used ensemble and hybrid ML techniques, which are a combination of ensembles such as Bagging, Dagging, MultiBoost, AdaBoost, and Random Forest (RF), as well as optimization algorithms such as biogeography-based optimization (BBO), genetic algorithm (GA), gray wolf optimizer (GWO), and particle swarm optimization (PSO) with single ML algorithms, for landslide susceptibility mapping and prediction [1214]. These methods are being used for better land use planning and infrastructure development besides the development of early warning systems [15, 16]. There is a need for reevaluation of single ML models to prove their ability to produce good landslide susceptibility maps for their application in landslide-prone areas as they are simple to use in comparison to ensemble and hybrid models. Literature survey indicates that some of the single ML algorithms can compete with complex ensemble and hybrid models and can be used for the prediction of landslide susceptibility [1719]. Therefore, in the present study, we have used single linear discriminant analysis (LDA), logistic regression (LR), and radial basis function network (RBFN) methods for the landslide susceptibility mapping at Pithoragarh district of Uttarakhand state, India. Landslide influencing factors, namely, slope, aspect, curvature, elevation, land cover, lithology, geomorphology, distance to rivers, distance to roads, and overburden depth, were used for landslide susceptibility modeling. Distance to faults/lineaments has not been considered in this study as a separate factor because distance from rivers which are structurally controlled (faults/lineaments) has been considered in the present model study to avoid repetition. Moreover, rainfall has not been considered in the model study as most of the landslides in the area occur during rains; thus, it has been considered as a triggering factor.

Popular statistical measures such as accuracy (ACC), specificity (SPF), sensitivity (SST), positive predictive value (PPV), negative predictive value (NPV), Kappa index, root mean square error (RMSE), and area under the receiver operating characteristic curve (AUC) were used for the evaluation and comparison of the performances of the studied models. The main objective of the study is to select the best single model for accurate and reliable landslide susceptibility mapping in the hilly and mountainous areas. Weka and GIS software have been used for the data processing and modeling.

2. Methods Used

2.1. Linear Discriminant Analysis (LDA)

LDA is known as a feature selection method that optimizes the number of factors used for modeling or classification processes. One of the functions of the LDA model is to create an optimal space such that the data used in this space has the most distinctive information [20]. In the ML algorithm, the purpose of reducing the dimensions is to reduce one dimension of a d-dimensional data predicted by a k-dimensional subset, when k ≤ d. The purpose of this step is to increase the computed output/yield while reducing error/noise and information abundances and protecting more information about the genuine data [21]. Dimension reduction can be done by searching for a linear transformation matrix W, d ≤ n, so that the dispersion/scattering distance among the categories is maximum and at the same time the dispersion/scattering is minimized within the category. In other words, LDA is a monitoring algorithm [22] which calculates the linear differentiator by maximizing the interval between classes and minimizing the interval between the categories. The distribution matrix in class and the distribution matrix within class SW can be given by

LDA model then solves the optimization problem as follows:where tr and T are the effective action and relocation of a matrix, respectively.

2.2. Logistic Regression (LR)

LR is a supervised learning algorithm mainly used for binary classification. In landslide modeling using this method, the presence of landslides is considered as “1” and the absence of landslides is considered as “0” [23]. When LR is fitted for a regression or classification problem, it first creates a valuable practice. Afterwards, the optimum model variables are recognized via a repetitive nonlinear optimization way [24]. The model quality test is finally over. The divination practice is reelevated to the sigmoid subordinate [25]:where x is an influencing factor.

The forecast subordinate of logistic regression is derived by the following equation:where ω denotes the weights assigned to the input variables.

The disadvantage subordinate is determined as follows:where m is the extent of the data, xi is the specific value of every instance data point, and yi is the class tag of each data instance. From the previous equation, it can be shown that the basic concept of logistic regression mathematics is the inherent logarithm of the unique proportion [26, 27].

The regression variable can be ω, which understates the detriment performance achieved by a steep descending method [28]. The gradient reduction algorithm is used to discover the optimal local resolution of a subordinate in ML. In the gradient reduction method, the section extracted from the senior disadvantage subordinate can be generated as follows:

The parameter can then be updated based on the negative aspect of the variable as

By the learning price/rate η, the variable ω can be used:

2.3. Radial Basis Function Network (RBFN)

RBFN is a simulator that consists of radial subordinates [29]. RBFN can be applied to achieve performance approximation and pattern recognition [30, 31]. A radial subordinate is appointed as a subordinate φ which is radially isochronous about certain points µ in the input area. One popular kind of radial subordinates is Gaussian subordinates [32]; for example,where γ > 0 is a variable and ||.|| defines the Euclidean rule. The point is easily named the capital of the principal subordinate. The radiant foundation subordinate way supported r radiant subordinate with various focused , where r is elected via the user which is usually much shorter than the training dataset measure n [33]. One public approach to specify suitable centers is by using a category analysis way, for example, k-means [34, 35]. Exploiting k-means, one can divide the training dataset T into r disintegrate cluster, in which every cluster has a linked centroid that can be thought of as an archetype of the cluster. It makes sense to use these centroids as the focuses of the analogized foundation subordinates. Hence, RBFN approximates aswhere are variables. It is suitable to present the vector and the n × r matrix Φ with origins Φ (i, j) = φ (xi, µj). The variable vector W is generally specified as the vector that minimizes the error subordinate.

An analytic statement for the optimal ratio vector is [36]

where Φ is the pseudoreverse of Φ.

A particular instance exists when n = r, as where there is n, foundation operations with focus µj = xj. In this case, the optimal weighting vector is specified via W = Φ−1·N (T).

RBFN is particularly well adapted to set very great training datasets which are supported by cluster study.

2.4. Validation Methods

Validation is a crucial step in every modeling project [3739]. In this study, we used the receiver operating characteristic (ROC) curve and a set of statistical indices for model validation. These metrics are briefly described as follows.

2.4.1. Receiver Operating Characteristic (ROC) Curve

One of the popular methods for evaluating the output of landslide susceptibility models is the ROC curve. ROC curve is one of the most common methods for evaluating the performance of methods and algorithms used for spatial modeling [40, 41]. The numerical value of the area under the ROC curve (AUC) varies between 0 and 1, which is quantitatively used for the validation and comparison of the models. Higher AUC values indicate higher model performance and the ability for capturing the general trend of landslide occurrences as well as for predicting future landslides [4245].

2.4.2. Statistical Indices

Other evaluation metrics used in the present study to select the best model were ACC, SPF, SST, PPV, NPV, Kappa index, and RMSE that have widely been used in the literature [4650]. SST is the ratio of landslide pixels that are correctly classified as landslide events. SPF refers to pixels that are correctly classified and referred to as nonslip pixels [48]. These indicators are calculated based on true positive (TP), true negative (TN), false positive (FP), and false negative (FN) using the following equations:

ACC is a statistical index that measures the overall accuracy of the classification process by calculating the ratio of landslide and nonlandslide pixels that are correctly classified:

PPV is the ratio of landslide sensitivity pixels that correctly classify all pixels classified as landslide pixels. NPV is the ratio of the pixels that are properly classified as nonlandslide.where TP displays the correctly grouped landslide pixels, TN indicates the correctly classified nonlandslide pixels, FP is the number of pixels that are wrongly classified, and FN is the number of pixels that are wrongly classified.

In addition to the indices described, the reliability of models can be tested using Kappa. This metric is a strong and trustworthy metric for the classification system because it considers the probability of compromise/agreement occurring randomly. The Kappa indicator is computed as follows:

RMSE was also used to measure the difference between observations and predictions, known as modeling error [51], which is calculated as follows:where N is the number of landslide samples, yobs is the observed value of sample i, and ypre is the predicted value of sample i.

3. Study Area

Pithoragarh District (30.0° N, 80.15° E) is located in Uttarakhand State, Northern India (Figure 1). Physiographically, the area consists of numerous NW-SE-aligned curvilinear hills. The rivers of the Ganga system, namely, Girthi, Keogad, Kali, Gori, Dhauli, Kutiyangti, Sarju, and Ramganga flowing along the periphery and through the district formed rugged topography and deep valleys. The study area is dissected by numerous faults/thrusts and shears. The district is bounded on the northern side by the Main Central Thrust (MCT) and on the southern side by North Almora Thrust (NAT). The area is seismotectonically active (GSI 2000). The main rock types in this area belong to the Garhwal Group of Mesoproterozoic age (1600–1000 Ma). These rock types include quartzite, phyllite, schist, limestone, dolomitic limestone, slate, and also metavolcanics and intrusive amphibolites at places. This district has extreme variation in temperature from below zero (high hills above 3,500 m remain covered with snow) to 40°C (river gorges). The rainy season in this area extends from mid-June to mid-September. The annual average rainfall is about 360 centimeters at lower elevation.

4. Data Used

4.1. Landslide Inventory

Accurate knowledge of problem influencing variables and corresponding data is required for dealing with practical engineering problems [5255]. Identification and collection of landslide events and their past and current locations using time series Google Earth and satellite images [56] in conjunction with field surveys and reports are required to develop a landslide inventory map [14]. The collection of geospatial information of landslides is very important and fundamental in the spatial modeling process [12]. In this study, we have used data of 398 landslide events extracted from the Geological Survey of India report (http://www.portal.gsi.gov.in) and by the interpretations of Google Earth images [57]. Landslide polygons delineated were represented by points on the maps for the modeling purpose. These landslide locations (398 points) were randomly divided in the ratio of 70/30 as a training (70% points: 261) and testing/validation (30% points: 137) data, respectively, based on the experience of the authors and other researchers in the field of ML modeling [58, 59]. The majority of the landslides in the study area are of debris flow and rock fall types (Figure 2).

4.2. Landslide Influencing Factors

One of the important issues in the process of spatial modeling of landslide susceptibility is the selection of the landslide influencing factors and the preparation of the thematic maps of the features. There is no uniform rule for the selection of landslides influencing factors [60, 61]; thereby these factors are selected based on previous similar studies and knowledge of the topography, geology, and geoenvironmental conditions of the study area [62, 63]. Anthropogenic activities such as the construction, infrastructure development, and land use changes can disturb the natural stability of the ground mass causing landslides in the area. Natural phenomena such as earthquakes, floods, and heavy rains are some of the main triggering factors for the occurrence of landslides. In this study, an Aster DEM with 30 m resolution (https://earthexplorer.usgs.gov) was used for the development of the topographic thematic factors [64]. Figure 3 represents thematic maps of the landslide influencing factors and Table 1 represents the relationships between landslides and the influencing factors using the frequency ratio (FR) method as described by Lee and Pradhan [65].

Brief descriptions of the ten influencing factors (slope, aspect, curvature, elevation, geomorphology, land cover, distance to roads, distance to rivers, lithology, and overburden depth) considered in this study are described as follows.

4.2.1. Slope

Slope is an effective factor in landslide susceptibility modeling due to its significant effects on material sliding and water flow due to gravitational effect [66]. Moderate and higher slopes are more vulnerable to landslides. Natural slopes in the area are not uniform from lower elevation to higher elevation and also laterally as they depend on many factors such as geology, structure, soil, drainage, and erosion. Therefore, They have been classified into five groups based on the natural break method using GIS software from DEM for better site representation of this factor in the modeling (Figure 3). The highest weight (FR = 1.40) of landslides occurrence was observed on the areas with slope degrees of 47–88.677o (Table 1).

4.2.2. Aspect

Aspect is a subset of topographic factors which is closely related to sunlight and precipitation falling on different faces of slopes [67]. This factor has also an effect on the hydrological process, evaporation, and transpiration and also on weathering processes and vegetation [68]. In this study, an aspect map was generated from DEM and grouped into nine categories (Figure 3). The highest weight (FR = 1.57) of landslides occurrence was observed on the southern slopes (Table 1).

4.2.3. Curvature

Curvature is one of the morphometric factors which affects water flow and erosion of the ground causing landslide [68, 69]. Curvature map of the study area was divided into three classes: flat (FR = 3.57), convex (FR = 1.24), and concave (FR = 0.80) curvatures, respectively (Table 1 and Figure 3).

4.2.4. Elevation

Elevation affects local climate and air/weather conditions of regions [70] and is an important factor for describing the other factors, such as landform, rainfall, land use type, human activities, vegetation type and distribution, and soil weathering conditions as well as the rate of erosion which affects landslides [71, 72]. Generally, rainfall and weathering occur at lower to moderate elevations. At higher elevations, snow falls and rocks are less weathered. The elevation map of the study area was produced from DEM which was then divided into nine classes based on the natural break method (Figure 3). Most landslides occurred in an altitude class between 551 and 1000 m with an FR value of 5.46 (Table 1).

4.2.5. Geomorphology

Geomorphology of the area describes the relief features of the Earth and can significantly control the landslide occurrences [54, 73, 74]. This factor map was extracted from the GSI report, checked from the DEM and Google Earth images, and classified into twelve geomorphological units (Figure 3). Among the geomorphology feature “intermontane plateau” had the highest FR of 8.91 (Table 1).

4.2.6. Land Cover

Land cover, which consists of vegetation, barren land, and built-up areas [75], affects the landslide susceptibility of the area [76, 77]. This map was prepared from the satellite images in conjunction with Google Earth images and classified into eight classes (Figure 3). Most of the landslides occurred in sparsely vegetated (FR = 5.35) and wasteland (FR = 3.51) (Table 1).

4.2.7. Distance to Roads

Road construction and roadside excavation can destroy natural slope stabilities, thereby affecting landslide occurrences [7880]. In this study, road networks were digitized from Google Earth images and buffered into different classes (Figure 3). Most of the landslides in the study are located at 0–100 m distance from the roads with an FR value of 12.62 (Table 1).

4.2.8. Distance to Rivers

Distance to rivers is one of the important factors related to adjacent ground slope erosion and saturation causing landslides [8183]. The drainage in the area is structurally controlled (faults/lineaments). The river map of the study area was generated from DEM using the flow direction tool of GIS and Google Earth images and buffered into six classes (Figure 3). Most of the landslides in the study are located at 0–100 m distance from the rivers with an FR value of 8.19 (Table 1).

4.2.9. Lithology

Lithology is important in identifying the type of rocks/soil vulnerable to cause landslides in the events of rainfall and gravitational forces acting on the ground [8486]. The map of the lithology of the study area was obtained from the website of Geological Survey of India (https://earthexplorer.usgs.gov) and classified into 18 main groups as follows: group 1 (alluvium and colluvium); group 2 (alluvium, in situ soil); group 3 (amphibolite and mica schist); group 4 (phyllite, quartzite, slate, and limestone); group 5 (carbonaceous phyllite, quartzite, slate, and limestone); group 6 (chlorite schist and massive amphibolite); group 7 (colluvium); group 8 (glacial deposit); group 9 (gravel, boulder embedded in oxidised sandy matrix); group 10 (granite with tourmaline and quartz vein); group 11 (morainic material); group 12 (quartz, mica, chlorite, and hornblende schist); group 13 (scree); group 14 (slate, quartzite, sandstone, talc, lLimestone, dolomite, and stromatolitic limestone); group 15 (stromatolitic dolomitic limestone, phyllite, and talc); group 16 (transported soil and scree); group 17 (alluvium); and group 18 (younger loose debris). The results of the FR method revealed that the areas where group 17 (FR = 10.50), group 7 (FR = 2.39), group 1 (FR = 2.09), and group 13 (FR = 1.26) were the most susceptible portions of the study area to landslide occurrence.

4.2.10. Overburden Depth

The overburden depth represents the information on the soil/overburden thickness that is controlled by erosion processes and slope morphometry [87, 88] and can affect shallow translational debris landslides [89]. Overburden maps are important for identifying slump and creep types of probable landslide areas. In this study, the map of overburden depth was extracted from the website of Geological Survey of India (https://earthexplorer.usgs.gov) and classified into five classes (Figure 3 and Table 1). Most of the landslides occurred when the depth of the overburden was greater than 5 m. The FR value in this class in the present case is 8.14, which is on the river valleys slopes.

5. Methodology

Our methodology is presented in Figure 4 and briefly described as follows:(1)Generation of a database of past landslides using satellite and Google Earth images in conjunction with extracted data from the GSI report(2)Random split of landslide points (polygons represented by points) into two groups: 70% for the training and 30% for the validation of the models(3)Development of thematic maps of landslide influencing factors such as slope, aspect, curvature, elevation, land cover, lithology, geomorphology, distance to rivers, distance to roads, and overburden depth, as well as synthesis of available data from various sources(4)Spatial modeling and generation of landslide susceptibility maps of the study area using the LDA, LR, and RBFN models(5)Performance evaluation of the models using ACC, SPF, SST, PPV, NPV, Kappa, RMSE, and ROC-AUC metrics

6. Results

6.1. Validation of the Models

The evaluation results of the LDA, LR, and RBFN ML models are presented in Table 2. In the training phase, LR is the most accurate in terms of the NPV rating index (85.82). Other statistical measures indicated that performance of the RBFN algorithm is the best (PPV = 86.97%, SST = 85.66%, SPF = 86.77%, ACC = 86.21%, and kappa = 0.72). In the validation phase, LDA shows the best performance in terms of the NPV (81.75) and SST (82.01) metrics. In addition, LR outperforms the other models in terms of the PPV (87.59), SPF (86.29), ACC (82.85), and kappa (0.66) metrics. RBFN shows a validation performance similar to LR in terms of the ACC (82.85) and kappa (0.66) metrics.

In terms of RMSE, the LR (0.319) and RBFN (0.319) models have better results than LDA (0.337) in the training phase (Figure 5). In the validation phase, the RBFN model with RMSE = 0.327 has the lowest modeling error compared to the LR (0.337) and LDA (0.344) models (Figure 6).

The AUC value for the training phase is the highest for the RBFN model (0.938) in comparison with the LDA (0.929) and LR (0.936) models. However, in the validation phase, the RBFN model has a lower AUC (0.805) compared to the LR (AUC = 0.926) and LDA (AUC = 0.920) (Figure 7). According to the results, the LR model is the most accurate model for the prediction of landslide susceptibility in the study area.

6.2. Construction of Landslide Susceptibility Maps

Landslide susceptibility maps were produced using the models’ outputs that gave landslide susceptibility values ranging from 0 to 1. These raster values maps were classified into five susceptibility classes using the natural break method of ArcGIS (Figure 8).

Figure 9 shows FR analysis of the susceptibility maps. On the FR diagram, it is clear that the landslides are mostly located in the very high susceptibility class. Further, with a closer look, it can be seen that the LR model has the highest frequency of landslides in the very high susceptibility class, followed by the LDA and RBFN models. On the other hand, a few landslides fall into the very low susceptibility class. In the very low landslide susceptibility class, RBFN showed the highest percentage of susceptibility, followed by LDA and LR, respectively.

7. Discussion

Nowadays, ML methods are used for the development of landslide susceptibility maps in many regions around the world [90]. In the ML modeling, binary classification is adopted, where “0” indicates presence and “1” indicates absence of landslides. The ML models used in the landslide studies are based on single, ensemble, and hybrid algorithms. In this study, we have used three single ML models, namely, LDA, LR, and RBFN, for the generation of landslide susceptibility maps of Pithoragarh district, which is one of the landslide-prone areas in Northern India. These models have been selected as they have not been earlier applied for the landslide susceptibility mapping of this part of the Himalayas. Ten landslide influencing factors (slope, aspect, curvature, elevation, land cover, lithology, geomorphology, distance to rivers, distance to roads, and overburden depth) were used for the landslide susceptibility mapping. Drainage in the area is structurally controlled by faults/lineaments, so this factor has not been considered separately in the model study to avoid duplication. Moreover, rainfall has been considered as a triggering factor, not as an influencing factor, because most of the landslides in this area occurred during the rainy season. For the model development and validation, landslide data was split into a 70/30 ratio: 70% for training and 30% for testing validation based on the experience of researchers [91, 92].

Performance of the models was checked using standard statistical metrics: ACC, PPV, NPV, SPF, SST, Kappa, RMSE, and AUC, which have also been used by several researchers [46, 93, 94]. All the three models used in this study have shown good performance in predicting landslide susceptible zones with high accuracy which is consistent with other similar studies in other areas [48, 95]. However, the LR model with the AUC value of 0.926 is the most accurate and consequently the best model in predicting the landslide susceptibility in this area. The results revealed that about 20% of the study area fell in the high and very high susceptibility classes. The first two classes (0–100 and 100–200 m) of distance to roads and rivers had the strongest association with the historical landslides in the study. This association was perfectly reflected in the susceptibility maps such that the areas at a distance to roads/rivers of 0–200 m were classified as high and very high susceptibility classes to landslide occurrence. Our results are supported by previous studies that had reported on the significance of roads and rivers on landslide occurrence [78, 9698]. Delineation of accurate landslide susceptible zones using single models will help in better planning and management of the study area. Overall, this case study demonstrated that the single models can still compete with the complex ensemble and hybrid models for landslide susceptibility mapping of landslide-prone areas.

8. Concluding Remarks

Using three single ML models, we have modeled landslide susceptibility in one of the landslide-prone areas of the Himalayas. Results indicated that the LR model was the best model with an AUC value of 0.926, which can be considered an excellent model for predicting the landslide susceptibility of hilly and mountainous areas. Despite widespread development and application of ensemble and hybrid models, our results suggest that single ML models are still efficient in producing accurate landslide susceptibility maps for better planning and management of landslide-prone areas. However, the applicability of these models for other areas depends on local conditions that may change the set of landslide influencing factors. This uncertainty should be addressed by repeating such a study in different landslide-prone regions around the world.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

The authors acknowledge University of Transport Technology for supporting this research.