Research on the Impact of Random Negative Training Samples on the Spatial Quantitative Model of Landslide Hazards
In the present study, a spatial quantitative model of landslide hazards based on a deep belief network (DBN) is constructed. Firstly, environmental similarity-based sampling (ESBS) was used to determine the negative sampling area. Secondly, multiple data sets are constructed. Each data set contains seven landslide-conditioning factors; 70% of the data are used for training; and 30% are used for validation. The performance evaluation index of the spatial quantitative model of landslide hazards was established; that is, the AUC mean (AUCmean) was used to measure the stability of the model, and the AUC standard deviation (AUCSD) was used to measure the uncertainty of the model. Finally, the accuracy of the prediction results of the DBN model is analyzed. The results show that the area with negative sample reliability greater than 0.51 is the best negative sample sampling area, and the stability of the DBN model is maintained at a relatively good level in both the training step (AUCmean = 0.9597) and the validation step (AUCmean = 0.8897). The standard deviation of AUC is close to 0 (AUCSD = 0.0081 in the training step and AUCSD = 0.0085 in the validation step), indicating that the selected negative samples have a weak impact on the performance of the model. The susceptibility areas of very high obtained by the DBN model (landslide points in the susceptibility areas of very high accounted for 55.03%) are realistic. Therefore, the DBN model constructed in the present study is effective and can be used in the field of landslide hazard spatial prediction.
China is one of the countries with a high frequency of geological disasters. It is reported that in 2021, a total of 4,772 geological disasters occurred in China, resulting in 80 deaths and 11 missing. Landslide hazards are one of the most destructive geological disasters. In 2021, there were 2,335 landslide hazards in China, accounting for approximately 50% of the total number of geological disasters and causing direct economic losses of 3.2 billion CNY (Ministry of Natural Resources of the People’s Republic of China). In addition, in southwestern China, due to the complex geological environment and the concentration of the population, landslide hazards pose a greater threat to people’s lives and property. Therefore, achieving rapid, accurate, and refined regional landslide susceptibility assessment is one of the cutting-edge scientific issues in the field of disaster prevention and mitigation.
Landslides are complex processes related to geology, topography, and other geoenvironmental factors associated with different conditioning and triggering factors. Establishing a high-precision spatial quantification model of landslide hazards relies on sufficient high-quality data, so it has remained a difficult task for a long time [1, 2]. Although many techniques have been proposed and used, there is currently no unified paradigm for landslide hazard susceptibility research . The existing spatial quantitative model of landslide hazards has mainly experienced three development processes: qualitative prediction models, quantitative mathematical prediction models driven by knowledge, and mathematical prediction models driven by data. The qualitative approach is advantageous in predicting landslides because collecting field data from landslide areas is challenging and difficult to acquire. However, due to its low prediction accuracy, the spatial quantitative model of landslide hazards is usually based on quantitative methods . Among them, the knowledge-driven quantitative mathematical prediction model method has many previous results, such as multicriteria decision analysis (MCDA) , frequency ratio (FR) , logistic regression (LR) , evidence belief function (EBF) , weight of evidence (WoE) , and linear discriminant analysis (LDA) . In addition, to improve the accuracy of predictions, a large number of scholars have begun to use data-driven mathematical prediction models: machine learning methods, such as artificial neural networks (ANNs) , support vector machines (SVMs) , decision trees (DTs) , Gaussian models , random forests (RFs) , naive Bayes (NBMs) , neuro-fuzzy systems (NFSs) , and boosted regression trees (BRTs) . With the development of supervised data-driven models, deep learning is used in search technology , data mining , machine translation , natural language processing , multimedia learning , and other fields that have achieved many results, and deep learning has an excellent performance in feature extraction and model expression. DBN is a commonly used deep learning model that is mainly based on Bayesian thinking and automatically obtains the deep-level information hidden in the data that is difficult to decipher by finding the joint distribution probability of the data.
Landslide susceptibility assessment is usually driven by supervised data, and the selection of training samples is the basis of supervised learning methods, which can impact the accuracy and generalization ability of the prediction model. Landslide susceptibility assessment models usually use positive and negative sample points for training. The positive sample point refers to the point where the landslide occurs, which is generally based on historical landslide data. Screening requires that historical landslide hazard points have definite spatial geographic coordinates and quantifiable landslide conditioning factors. Generally, the sampling accuracy of positive samples is higher. Negative samples refer to points where nonlandslides have occurred, which generally cannot be directly obtained. The commonly used method is spatial random sampling, and there are likely to be “false negative samples” in the collected negative samples. The geographical environment of these “false negative samples” is similar to that of landslide points, they are potential landslide points or are already within the range of landslides. The existence of “false negative samples” will reduce the quality of the training sample set and affect the performance of the model. Xiao et al.  considered the influence of the scale of landslide hazards and adopted a method of taking negative sample points outside a certain buffer zone of positive sample points. However, due to the characteristics of landslide hazards such as spatial autocorrelation and spatial heterogeneity, there are still large uncertainties in this method, and there is no uniform selection standard for the distance of the buffer zone. Zhu et al.  believed that the points that are more dissimilar to the positive samples in the environmental feature space are more likely to be nonhazardous areas, and the negative sample points can be collected in such areas and proposed ESBS. However, a reasonable framework for assessing the impact of random negative sample points on the performance of the spatial quantitative model of landslide hazards has not yet been proposed. The present study discussed this issue.
There are many directions that can be considered in the context of landslide susceptibility. The present study focuses on the uncertainty of the effect of randomly selected negative sample points on the model performance to establish a high-precision landslide spatial quantification model. By calculating the AUCmean and AUCSD of 40 different sample sets, the uncertainty of the effect of random negative sample points on model performance was analyzed. In addition, the present study also analyzed the accuracy of the prediction results of the supervised data-driven DBN model. The remainder of the paper is organized as follows: Section 1 summarizes the research progress in landslide hazard susceptibility assessment. Section 2 introduces the research area of the present study, the landslide hazards conditioning factors selected in the present study, the spatial quantitative model of landslide hazards based on DBN, and the impact assessment index of random negative sample points on the model. At the same time, the accuracy of the prediction results of the DBN model has been analyzed, and the results are explained in Section 3. This study’s discussion and conclusion are presented in Sections 4 and 5, respectively.
2. Models and Methods
2.1. Study Area
The 2008 Wenchuan earthquake was a large natural disaster with a large scale of damage and a wide range of damage, which induced a variety of earthquake-induced hazards, such as landslides, debris flows, and riverbank bursts. The 10 extremely earthquake-stricken areas of the Wenchuan earthquake (Dujiangyan, Pengzhou, Mianzhu, Shifang, Anxian, Beichuan, Qingchuan, Pingwu, Wenchuan, and Maoxian) are distributed in the Longmen Mountains and nearby mountainous areas, controlled by the Longmenshan fault zone (NE-SW). This area is located in the northwest of the Sichuan Basin and the southeast of the Aba Tibetan Autonomous Prefecture. It is dominated by mountainous landforms with relatively high altitudes. The extreme elevation difference is large, and the geological structure is complex. This area has historically been a frequent area of geological disasters such as landslides. There is no doubt that the Wenchuan earthquake was the most destructive earthquake in the history of New China, and secondary disasters continued for a long time after the earthquake . Since the Wenchuan earthquake was highly destructive, the geological disasters in the severely affected area have increased significantly and lasted for a long time, so this study area has a good representative value. Therefore, in the present study, the 10 extremely earthquake-stricken areas of the Wenchuan earthquake are used as the study area, as shown in Figure 1.
2.2. Research Frame
In the present study, the 10 extremely earthquake-stricken areas of the Wenchuan earthquake are used as the study area to assess the landslide hazard susceptibility of the study area and focus on the analysis of the impact of random negative samples on the model. The research framework shown in Figure 2 is built, which specifically includes the following five steps: Step 1:Fully collect relevant data and landslide hazard data in the study area; select three categories of conditioning factors, such as topography and geomorphology factors, geological environment factors, and inducing factors; and conduct GIS layer rasterization processing on the study area Step 2:Construct 40 groups of training sample sets (with different negative samples and the same positive samples), use 70% of the sample points in each group as the training set and 30% of the sample points as the validation set (the number of landslide points and nonlandslide points in the training set is equal, same for the validation set), and a DBN-based spatial prediction model for landslide hazards is constructed Step 3:By calculating the AUCmean and AUCSD, the impact of randomly selected nonlandslide points on the performance of the model is analyzed, and the stability of the model is assessed Step 4:Use ArcGIS to classify the landslide hazard susceptibility in the study area and construct a landslide hazard susceptibility map Step 5:Analyze the accuracy of the prediction results of the model constructed in the present study
2.3.1. Conditioning Factors of Landslide Hazards
Based on the existing research results of predecessors and the topography and geomorphic environment of the study area, the present study selects 7 direct and indirect landslide hazard conditioning factors. These 7 factors can be roughly divided into three categories: topographic and geomorphic factors (topographic information entropy), geological environment factors (distance to rivers, distance to faults, lithology, normalized difference vegetation index (NDVI)), and inducing factors (distance to roads and peak ground acceleration (PGA)). The descriptions of conditioning factors are shown in Table 1 . The layers of these 7 conditioning factors are gridded with 60 m 60 m pixels in ArcGIS 10.0, and a total of 10,027,131 grids are generated as input data for model training, as shown in Figure 3
2.3.2. Data Preparation
The present study investigated and consulted the historical information of landslide hazards in the study area and collected landslide hazard data from the study area in 2010 , used Landsat 8 remote sensing interpretation to obtain 885 landslide points, and obtained the specific spatial location of the landslide point in the study area.
ESBS was used to determine the sampling area of negative samples. Firstly, the natural breakpoint method was used to discretize the continuous conditioning factors. Secondly, the FR is combined to calculate the landslide frequency ratio to express the relationship between the conditioning factors and landslide occurrence frequency. The formula is as follows:
where represents the numbers of landslides in level of conditioning factors , represents the area of level of conditioning factors , represents the number of levels of conditioning factors , represents the frequency of landslides in level of conditioning factors , and represents the total area of the study area.
By normalizing the calculated frequency ratio with formula (2), the similarity between the level of conditioning factors and the typical category of landslides under conditioning factors can be obtained. The formula is as follows:
Since positive and negative samples have different geographical environment characteristics, the reliability of negative samples decreases with increasing similarity . The present study defines to represent the reliability of negative samples. The formula is as follows:
Finally, the negative sample reliability of the 7 conditioning factors was assigned to each grid in the study area, and the negative sample reliability of each grid was calculated using the sum of the mean and standard deviation () of the negative sample reliability to determine the boundaries of the negative sample sampling area. Forty groups of nonlandslide points were randomly selected in the area (885 points per group) and combined with known landslide points to form 40 groups of sample points, as shown in Figure 4. For each group, 70% of the landslide points and nonlandslide points (620 each) were randomly used as the training set, and the remaining 30% of the landslide points and nonlandslide points (265 each) were used as the validation set.
2.3.3. Application of Methods
In the present study, the DBN is used to construct a spatial quantitative model of landslide hazards in extremely earthquake-stricken areas. The DBN is composed of multilayer restricted Boltzmann machines (RBM) and a layer of a backpropagation (BP) neural network. The DBN structure is shown in Figure 5.
The DBN shown in Figure 5 consists of a stack of a BP layer and two RBMs. Each RBM has two layers: the first layer is a hidden layer, and the second layer is a visible layer. When the RBM is stacked in a DBN structure, the output of the first RBM is used as the input of the second RBM, and the output of the last RBM is used as the input of the BP neural network for training. The training step of the DBN is as follows.
First, the contrast divergence algorithm is used to train each RBM to obtain the optimal weight value and bias coefficient of each RBM. Suppose the training sample set is , and each sample is independent and identically distributed, where is the number of training samples, . Define the energy function of the DBN as follows:
where is the weight value between two neurons, and the weight value represents the correlation strength between the two neurons; is the bias coefficient of the obvious layer neuron; and is the bias coefficient of the hidden layer neuron.
According to formula (4), the joint probability distribution function of and the partition function of DBN can be derived as follows:
During the adjustment process, the probability that the neurons in the explicit layer and the hidden layer are activated is
Training RBM means adjusting the parameters so that the probability distribution represented by the RBM under the set of parameters fits as much as possible with the distribution satisfied by the training data. When the likelihood function reaches the maximum, the training of this layer is stopped, and the optimal weight value and bias coefficient of this layer are obtained.
After the above training process, the local optimal weight value and bias coefficient are obtained, and finally, the BP neural network is used for reverse regulation training to obtain the global optimal weight value and bias coefficient.
2.3.4. Accuracy Assessment Index
The present study uses the area under the curve (AUC) of the receiver operation characteristic (ROC) to assess the performance of the model, which is the area enclosed by the ROC curve and the abscissa axis. The abscissa of the ROC curve is the false positive rate (FPR), and the ordinate is the true positive rate (TPR). The relationship between the true positive (TP) detection rate and the corresponding false positive (FP) error rate was combined to visualize the performance of the model. The definitions of FPR and TPR are shown in Table 2.
3.1. Sampling Area Determination of the Negative Samples
To determine the sampling area of negative samples, the present study calculates the reliability of negative samples in the study area based on the similarity of geographical environment features and generates a spatial distribution map of the reliability of negative samples in the study area, as shown in Figure 6. The mean of the negative sample reliability is 0.51; therefore, the present study uses the area of as the sampling area of negative sample points.
3.2. Accuracy Assessment
To characterize the impact of randomly selected nonlandslide points on the performance of the model, the present study randomly selects 40 groups of nonlandslide points in the negative sample sampling area and calculates the AUC of 40 groups of sample points. AUCmean is used to validate the stability of the model, and AUCSD is used to represent the uncertain impact of randomly selected nonlandslide points on the model performance. In the training step of the model, by adjusting the parameters of the DBN, 40 sets of sample points are obtained during the training step and the validation step ROC curve, as shown in Figure 7.
Table 3 shows the AUCmean and AUCSD of the 40 sets of sample points and the AUC of each set. The AUCmean is 95.97% and 88.97% in the training step and validation step, respectively, which indicates that the DBN model has high stability. The AUCSD is approximately 0.01 whether it is in the training step or validation step, and the AUC of the DBN model changes very little, which means that the random selection of negative sample points in an appropriate area has a weaker impact on the performance of the model, and the model has better performance.
3.3. Landslide Hazard Mapping
All the grid data in the study area were brought into the constructed DBN model, and the natural breakpoint method in ArcGIS was used to divide the landslide hazard susceptibility assessment results into five levels: very high, high, moderate, low, and very low. Then, a susceptibility map of landslide hazards was constructed (Figure 8). At the same time, statistical analysis of landslide hazards under five levels is performed, as shown in Table 4.
Figure 8 shows that the spatial distribution of the landslide hazard prediction results of the DBN model mainly concentrated on the diagonal line from the northeast to the southwest of the study area, which is more consistent with the direction of the fault zone. Among the statistical indicators shown in Table 4, the susceptibility areas of very high by the DBN model accounted for 26.80% of the total area of the study area, including 55.03% (487) of known landslide points. The susceptibility areas of very high determined by the DBN model include more known landslide points, and the very high landslide sensitivity area is realistic, indicating that the DBN model can accurately reflect the spatial distribution characteristics of landslide hazards.
The accuracy of the spatial prediction of landslides will be affected by the models that we used and the input data. At present, data-driven supervised learning methods are widely used in the establishment of spatial quantification models of landslide hazards. For data-driven supervised methods, the accuracy and generalization ability of prediction models will be affected by the selection of data sets to a certain extent. A large number of recent studies have addressed that a reliable landslide inventory (positional accuracy and the sampling strategy) is vital to accurate susceptibility maps [29, 30]. Although the spatial position of the positive sample points is determined, Huang et al.  considered the irregular landslide boundaries and spatial shapes when selecting the positive sample points and discussed uncertainty patterns for landslide susceptibility modeling. The existing negative sample point selection usually adopts the method of random sampling in space outside a certain buffer of positive samples . However, due to the spatial autocorrelation and regional heterogeneity of landslide hazards, this method still has high uncertainty, and there is no uniform selection standard for the distance of the buffer zone. In the present study, ESBS is used to determine the negative sample sampling area; AUCmean is used to measure the stability of the landslide prediction models; and AUCSD is used to determine the uncertainty of the effect of random negative sample points on model performance. Although the present study determined the sampling area of negative samples and examined the impact of a random selection of negative sample points in this area on model performance, there is no clear conclusion about the quantitative method of sample point selection, which deserves further study.
The present study analyzes the accuracy of the prediction results of the landslide hazard spatial quantification model based on DBN. The DBN model can automatically obtain the hidden information in the data by finding the joint distribution probability of the data to fully identify the spatial distribution characteristics of landslide hazards. The DBN model prediction results are obviously satisfactory; it can be seen from Table 4 that 55.03% of the known landslide points (487) are included in the area where the susceptibility areas are very high by the DBN model. The trained DBN model can accurately reflect the spatial distribution characteristics of landslide hazards, and the DBN model has higher prediction accuracy . However, the DBN model also has certain limitations. The training process of the model is mainly a “black box” process, and the settings of network parameters can only be determined after multiple experiments . The interpretability of the results deserves further exploration.
The present study takes 10 extremely earthquake-stricken areas of the Wenchuan earthquake as examples, constructs a DBN-based landslide hazard prediction model, focuses on the analysis of the impact of negative sample training data on the model, and conducts a landslide hazard susceptibility assessment. The following conclusions were obtained:(1)The best negative sample sampling area determined by ESBS is the area where the negative sample reliability is greater than 0.51. Nonlandslide points are randomly selected in this area to train and validate the model; the AUCmean is used to assess the stability of the DBN model; and the AUCSD is used to assess the uncertain impact of randomly selected nonlandslide points on the performance of the model. We calculated the AUCmean and AUCSD of 40 groups of sample points. The AUCmean is 95.97% and 88.97% in the training step and validation step, respectively, which shows that the DBN model has relatively high stability. The AUCSD is approximately 0.01 in both the training step and the validation step, indicating that the randomly selected nonlandslide points in an appropriate area have a weak impact on the performance of the model.(2)The present study selects seven conditioning factors: terrain information entropy, distance to rivers, distance to faults, lithology, NDVI, distance to roads, PGA, and a DBN-based spatial quantitative model of landslide hazards. At the same time, a probability map of landslide hazards in the study area is constructed based on the probability. The natural breakpoint method is used to divide the landslide susceptibility into five grids (very high, high, moderate, low, and very low). Among them, the areas with a susceptibility level of very high judged by the DBN model contain more known landslide points. It accounted for 55.03% of all landslide points, indicating that the model constructed in the present study can better highlight the spatial distribution characteristics of landslide hazards.
Some or all data, models, or codes that support the findings of this study are available from the corresponding author upon reasonable request.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
This work was supported by the National Natural Science Foundation of China (42072322); Chengdu University of Technology Development Funding Program for Young and Middle-Aged Key Teachers (10912-JXGG2020-06251); and the General Program of Mineral Resources Research Center in Sichuan Province (SCKCZY2020-YB05).
I. N. Aghdam, B. Pradhan, and M. Panahi, “Landslide susceptibility assessment using a novel hybrid model of statistical bivariate methods (FR and WOE) and adaptive neuro-fuzzy inference system (ANFIS) at southern Zagros Mountains in Iran,” Environmental Earth Sciences, vol. 76, no. 6, pp. 1–22, 2017.View at: Publisher Site | Google Scholar
F. G. Murillo-Garcia and I. Alcantara-Ayala, “Landslide Susceptibility Analysis and Mapping using Statistical Multivariate Techniques: pahuatlan, puebla, Mexico”, In: W. Wu (eds). Recent Advances in Modeling Landslides and Debris Flows. Springer Series in Geomechanics and Geoengineering. Springer, Berlin, Germany, 2015.View at: Publisher Site
P. Badal, F. ., A. Omar, A. Ali, K. Sang-Wan, L. Samsung, and P. Hyuck-Jin, “Spatial clustering and modelling for landslide susceptibility mapping in the north of the Kathmandu Valley,” Nepal. Landslides, vol. 18, pp. 1403–1419, 2020.View at: Google Scholar
W. Chen, X. S. Yan, Z. Zhao, H. Hong, D. T. Bui, and B. Pradhan, “Spatial prediction of landslide susceptibility using data mining-based kernel logistic regression, naive Bayes and RBF Network models for the Long County area (China),” Bulletin of Engineering Geology and the Environment, vol. 78, no. 1, pp. 247–266, 2019.View at: Publisher Site | Google Scholar
Y. L. Chang, Y. C. Wang, Y. S. Fu, C. C. Han, J. Chanussot, and B. Huang, “Multisource data fusion and Fisher criterion based nearest feature space approach to landslide classification,” Ieee Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 2, pp. 576–588, 2015.View at: Publisher Site | Google Scholar
M. M. S. K. T. M. M. Mufti and H. Amir, “Deep learning in mining biological data,” Cognitive Computation, vol. 13, no. 1, pp. 1–33, 2021.View at: Google Scholar
Y. H. Han and J. J. Chen, “Multimodal deep learning for multimedia understanding and reasoning,” Multimedia Tools and Applications, vol. 80, no. 11, Article ID 17167, 2021.View at: Google Scholar