The main purpose of the present study is to investigate the possible application of decision tree in landslide susceptibility assessment. The study area having a surface area of 174.8  locates at the northern coast of the Sea of Marmara and western part of Istanbul metropolitan area. When applying data mining and extracting decision tree, geological formations, altitude, slope, plan curvature, profile curvature, heat load and stream power index parameters are taken into consideration as landslide conditioning factors. Using the predicted values, the landslide susceptibility map of the study area is produced. The AUC value of the produced landslide susceptibility map has been obtained as 89.6%. According to the results of the AUC evaluation, the produced map has exhibited a good enough performance.

1. Introduction

The landslide susceptibility and hazard assessments can be carried out either by using direct mapping techniques or by using indirect mapping techniques. Direct hazard assessment, in which the degree of hazard is determined by the mapping geomorphologist, based on his/her experience and knowledge of the terrain conditions [39]. However, indirect hazard assessment, in which either statistical models or deterministic models are used to predict landslide prone areas, based on information obtained from the interrelation between landscape factors and the landslide distribution [39]. In the recent years, many studies on the indirect landslide susceptibility assessment have been published depending on the developments of Geographical Information Systems (GISs) and digital cartography. It is possible to produce a landslide susceptibility map employing various indirect mapping techniques such as combination of index maps (i.e., [1, 3, 5, 26]), bivariate and multivariate statistical analyses (i.e., [4, 615, 1719, 21, 25, 46]), neural networks (i.e., [2, 19, 23, 24, 27, 32, 43]) and fuzzy approach (i.e., [2831, 37]). It is evident that each landslide susceptibility assessment method considered by landslide community has some advantages and drawbacks.

According to Miller and Han [49], similar to many research and application fields, geography has moved from a data-poor and computation-poor to data-rich and computation-rich environment. The scope, coverage, and the volume of digital geographic datasets are growing rapidly. Moreover, new high spatial and spectral resolution remote sensing systems and other monitoring devices are gathering vast amounts of georeferenced digital imagery, video, and sound. Traditional spatial analytical techniques cannot easily discover new and unexpected patterns, trends, and relationships that can be hidden deep within very large diverse geographic datasets [49]. At this point, data mining which encompasses variety of statistical analysis, pattern recognition and machine learning techniques can be used to overcome the problem when processing very large datasets. When producing landslide susceptibility maps, very large datasets should be processed. As mentioned above, landslide susceptibility maps have been produced by using several methods. The investigation of the decision tree in landslide susceptibility assessment constitutes the main purpose of this study. For the purpose of the study, decision tree is used to analyze the geographic data to predict landslide susceptible zones on map for the study area, Cekmece area of Istanbul, Turkey (Figure 1). The study also includes assessment of the landslide conditioning factors, some the oretical information on the decision tree technique, and application of the decision trees to landslide susceptibility assessment.

2. General Characteristics of the Study Area

The study area having a surface area of 174.8  locates at the northern coast of the Sea of Marmara and western part of Istanbul metropolitan area (see Figure 1). The Buyuk Cekmece Lake, and the Kucuk Cekmece Lake and Dikilitas creek are the western and eastern borders of the study area, respectively. The area has a high seismicity. In the last decade, Turkey has experienced some large earthquakes. More than 300 earthquakes in the region have been reported to have occurred between 2100 BC and AD 1900 [55]. The active northern branch of the North Anatolian Fault Zone (NAFZ) passes through approximately 9 km from south of the application site. In the last 20 centuries, between Izmit and Gulf of Saros, 29 historically large (between 6.3 and 7.4 Ms) earthquakes occurred along the northern branch of the NAFZ [35]. Actually, it may be effortlessly remarked that there are two main landslide triggers, one of which is earthquake and the other is heavy rainfall. If the characteristics of the region are taken into consideration, it is possible to say that the landslide triggers exist in the application area.

In the study area, summers are hot and slightly rainy while winters are warm and rainy. The topography of the region and presence of lakes and dams also affect the weather conditions (http://istanbul.meteor.gov.tr/). The region receives 85% of the total annual precipitation in the period of September-May (http://istanbul.meteor.gov.tr/). In this study, the data of Florya Meteorology Station, the nearest station to the application site, was assessed. According to the meteorological data of the period of 1937–1990, the average monthly rainfall varies between 20.5 mm and 102.0 mm. The annual precipitation varies between 500 mm and 1000 mm in the region while average annual precipitation of long period of the Florya Meteorology Station is 642.4 mm. The maximum daily precipitation was observed in Istanbul as 88.9 kg/ (http://www.meteor.gov.tr/). As will be explained in the latter sections, the precipitation is highly effective on the landslide occurrence in the study area.

Various lithological units from Middle-Late Eocene to Quaternary crop out in the region. The 1/25000 scaled geological map of the study area was prepared by Duman et al. [36]. Areal distributions of the geological formations of the study area are given in Table 1 and Figure 2. The Kirklareli limestone is the oldest rocks of the study area from the Middle-Late Eocene while the youngest unit in the study area is actual alluviums. In the area, some inactive normal faults are typical. The dip values of the beddings of the sedimentary units in the area are rather low, 5–15 degrees. For this reason, the strikes and the dip directions exhibit a high variation in short distances. However, there is no considerable folding in the study area [17].

The altitude values in the study area are between 0 and 200 m while the dominant altitude ranges are 75–100 and 100–125 m (Table 2). The study area has a dendritic drainage pattern, because of presence of soft lithologies and low slope angles. The general physiographic trend of the area is NW-SE. The range of the slope gradient values is from 0 to 57 degrees. Finally, these slope gradient values indicate that the majority of the region has gentle slopes.

3. Assessment of Landslide Conditioning Factors

In this section, the landslide conditioning factors observed in the study area are explained. Before the explanations, the data used is given. In the present study, the digital elevation model (DEM) produced by Duman et al. [17] is used. The DEM was prepared by digitizing 10 m altitude contours of the 1/25,000 scale topographical map [17]. Maps of slope gradient, heat load, altitude, stream power index, plan curvature, and profile curvature are produced using the DEM in raster format with a pixel size of . The lithology map taken into consideration in this study was prepared by Duman et al. [36]. This vector map was converted to a raster map with a pixel size of by Duman et al. [17]. A reliable landslide inventory defining the type and activity of all landslides, as well as their spatial distribution, is essential before any analysis of the occurrence of landslides and their relationship to environmental conditions undertaken [34]. Therefore, it is possible to say that a reliable landslide inventory is a crucial part of a landslide susceptibility map among the parameters employed, because it is the fundamental component of the assessments. In Turkey, a landslide inventory project at national scale has been conducted by the Geological Research Department of the General Directorate of Mineral Research and Exploration (MTA). When preparing the landslide inventory, to identify the landslides, the vertical black-and-white aerial photographs of medium scale (1 : 35,000), dated 1955–1956, were used [17]. When describing the type and activity of the landslides in the project [38], mass movements were classified according to the cinematic types of classification proposed by Varnes [16], that is, flows, falls, and slides. The landslides are also classified according to their relative depths, as shallow (depth 5 m) and deep seated (depth 5 m). Landslide activities were classified into two groups as active and inactive by Duman et al. [38]. Active landslides are defined as those currently moving; whereas inactive ones are as relict according to Working Party on World Landslide Inventory WP/WLI [40]. Shallow landslides are classified as active because of their ongoing observed movements [41].

One of the most important stages of landslide susceptibility mapping is to describe the factors governing the landslides identified in the area. A landslide susceptibility mapping procedure for the application site has been performed previously by Duman et al. [17] considering the logistic regression technique. Duman et al. [17] have employed geological formations, geomorphological units, and relative permeability of different lithological units, slope gradient, slope aspect, altitude, plan and profile curvatures, and stream power index parameters as the landslide conditioning factors. However, in the present study, some parameters such as geomorphological units and relative permeability of the lithological units have been eliminated. Furthermore, several topographic parameters used in the study reflect geomorphological characteristics of the study area. To abstain a redundancy, the geomorphologic units are eliminated from the model. The other eliminated parameter used by Duman et al. [17] is the relative permeability of the lithological units. Determination of this parameter is too difficult, sometimes impossible. As a result, the relative permeability is also excluded from the model. One of the topographic parameters used by Duman et al. [17] is the slope aspect. Aspect of a slope is the direction or azimuth of a slope faces. It strongly influences potential direct incident radiation and temperature. Untransformed slope aspect is a poor variable for quantitative analysis, since 1 degree is adjacent to 360 degrees; the numeric values are very different even though the slope aspect is about the same. Hence, slope aspect values need to be transformed in one of several ways, depending on the precision with which it was measured and the environmental factor(s) the analyst would like it to represent [42]. In this study, the heat load index [42] has been used instead of the slope aspect. According to the descriptive statistics of the parameter, the mean heat load index value on the grid cells with landslides is observed as 0.529 (0.338) (Table 2). The other conditioning parameters used in the study are slope gradient, stream power index, plan and profile curvatures, and altitude. These parameters can also be obtained easily from the digital elevation model of the study area.

The characters of landslides identified in the region are mainly deep seated and active. They are generally located in the lithologies including the permeable sandstone layers and impermeable layers such as claystone, siltstone, and mudstone layers. This is typical for the landslides identified in the study area. When considering this finding, it may be said that one of the main conditioning factors of the landslides in the study area is lithology [17]. This can be seen clearly in Table 1. As can be seen in Table 1, the majority of the landslides (approximately 60%) occurred in two formations, namely, Danisment formation-Acmalar member (Toda) and Ergene (Tme) formation. According to Duman et al. [17], another factor governing the landslides in the study area is the sandstone bedding planes. If the orientation of slope and bedding plane is roughly similar, some large landslides can occur. In these areas, initiation of the landslides is controlled by the bedding planes as planar failure, and then in the displaced and accumulated material, some rotational landslides are observed [17]. Rarely, in this material, some earth flows may occur depending on the heavy precipitation [17]. Besides, permeability differences between sandstones and claystones are highly important on landslide occurrence in the study area. The sandstones at upper levels are saturated by the surface waters because the surface waters do not infiltrate to claystones at lower levels.

One of the most important topographical factors conditioning landslides is the slope gradient. In the regional landslide susceptibility or hazard assessments, several researchers (i.e., [7, 8, 33, 44, 45]) took into consideration statistical techniques for the assessment of slope gradient in terms of landslide activity. In the present study, the slope gradient is considered as a conditioning factor during the analyses. Descriptive statistics revealed that the landslides in the application site typically occur on the gentle slopes (Table 2). Some authors (i.e., [26, 30]) pointed out that the altitude is a good indicator for the landslide susceptibility. For this reason, the altitude has been accepted as a conditioning factor, as well. The mean altitude value on the grid cells with landslides is observed as 85.1 m (45.2 m) (Table 2).

The term curvature is generally defined as the curvature of a line formed by intersection of a random plane with the terrain surface [40]. The influence of plan curvature on the land degradation processes is the convergence or divergence of water during downhill flow. In addition, this parameter constitutes one of the main factors controlling the geometry of the terrain surface where landslides occur [47]. In this study, profile and plan curvature values have been calculated using a script, namely, Digital Elevation Model Analysis Tool (DEMAT) compiled in the Avenue soft computing language of ArcView GIS by Behrens [48] and these parameters have been considered as the conditioning factors. While the positive values of slope curvature define the convexity, the negative ones present the concavity of the terrain surface. The minimum and the maximum profile curvature values were calculated as −2.930 and 3.080 on the grid cells with landslides, and −2.920 and 3.430 on the grid cells without landslides (Table 2). These ranges have been obtained for the plan curvature values on the grid cells with landslides and without landslides as −3.130 to 2.870 and −4.500 to 2.690, respectively (Table 2).

The last parameter considered in the present study is stream power index (SPI). It is a measure of erosive power of water flow based on the assumption that discharge (q) is proportional to specific catchment area () [20] (3.1). Although, the minimum and the maximum ranges of the SPI values on the grid cells with landslides and without landslides are observed as close (Table 2), the mean SPI value for the grid cells with landslides was calculated more than that for the grid cells without landslides (Table 2). Theoretically, the maximum SPI values are calculated in drainage channels. Therefore, the values of SPI on the grids cells with landslides should not be expected as high. This hypothetical expectation is supported by the observation of the maximum SPI value on the grid cells without landslides (Table 2) where is the specific catchment area () while is the slope gradient in degree.

4. Modeling Approach

Data mining involves various techniques such as statistics, neural networks, decision tree, genetic algorithm, and visualization techniques that have been developed over the years. Data mining problems are generally categorized as association, clustering, classification, and prediction [50]. Classification refers to finding rules to assign data items into pre-existing classes [49]. Association analysis is used to discover patterns that describe strongly associated features in the data. On the other hand, aim of clustering is to find groups of closely related observations so that observations that belong to the same cluster are more similar to each other than observations that belong to other clusters [51].

In practice, there are several data mining tools such as Oracle DM, SQL Server Analysis Services, SPSS Clementine, and SAS Enterprise Miner for commercial use. In the present study, the decision tree technique is used to predict the landslide susceptibility classes by employing Microsoft Server 2008 Analysis Services. Decision tree is a data mining approach that is often used for classification and prediction. Although other methodologies such as neural network can also be used for classification, decision tree has the advantages of easy interpretation and understanding for the decision makers to compare with their domain knowledge for validation and justify their decisions [50]. In addition, there are a few advantages of using decision trees over using other data mining algorithms, for instances, decision trees are quick to build and easy to interpret and prediction based on decision trees is efficient [22].

Decision trees are built through recursive data partitioning, where in each iteration the data is split according to the values of a selected attribute. The recursion stops at “pure” data subsets which only include instances of the same class [53]. In other words, the principle idea of a decision tree is to split your data recursively into subsets so that each subset contains more or less homogeneous states of your target variable (predictable attribute). At each split in the tree, all input attributes are evaluated for their impact on the predictable attribute. When this recursive process is completed, a decision tree is formed [22]. If the predictable target attribute consists of discrete data, the formed decision tree model is called a classification tree. However, if the target attribute is a continuous variable, then the model is called as a regression tree. The process of decision tree building is sometimes called decision tree induction. Many techniques have been developed for decision tree induction up to now. Nevertheless, the general idea of decision tree induction is same on every type of decision tree methods. Each technique employs a learning algorithm to identify a model that best fits the relationships between the attribute set and class label of input data. The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of records it has never seen before [51]. An example showing the generation of a decision tree by employing train data and the prediction ability by using the test data is given in Figure 3.

ID3 is a well-known decision tree algorithm proposed by Ross Quinlan of the University of Sydney, Australia. ID3 tree was later enhanced to be C4.5. C4.5 can handle numeric attributes, missing values, and noisy data. Some decision trees can perform regression tasks, for example, to predict continuous variables such as temperature and humidity. The Classification and Regression Tree (CART) proposed by Briemann is a popular decision tree algorithm for classification and regression [22].

5. Landslide Susceptibility Mapping and Results

In order to perform the research reported in the present paper, Microsoft SQL Server 2008 Analysis Services software is chosen as the analyzing platform as it supports decision trees with continuous variables (called as regression trees). High scalability and having support for nested table, automatic feature selection, automatic cardinality reduction features of it are the other reasons for choosing this data mining platform. Additionally, Microsoft Analysis Services allows building data mining applications via the support of Microsoft Visual Studio Integrated Development Environment and ADOMD extensions [22]. Since the purpose of this data-mining study is to develop a model for predicting the landslide susceptible areas, decision tree, a well-known classification technique, has been utilized during the analysis. Microsoft SQL Server Analysis Services employ their own decision tree algorithm that is called Microsoft Decision Trees. This commercial algorithm can handle both discrete and continuous valued variables/attributes and presents useful parameters for configuring tree induction step such as tree splitting options. Another reason of selecting this algorithm and software is that it can build dependency network graphs which show the effects of variables between them and impact degrees of independent variable(s) on predictable variable(s).

In this study, all of the input variables and target output variable are continuous, so resulted tree is a special version of decision tree named regression tree. Regression is similar to classification. The only difference is that regression predicts continuous attributes. Although the basic task of a decision tree algorithm is classification, it can be used for regression as well. Another well-known regression tree algorithm is CART. The Microsoft Decision Trees algorithm adds the support for regression in SQL Server 2005 and 2008. Microsoft Regression Trees contain a linear regression formula at each leaf node. Using a regression tree has its advantages over simple linear regression in that a tree can represent both linear and nonlinear relationships [22].

The data understanding and the data preparation stages are among the most important steps in the data mining applications [54]. At the start-up of this study, entire dataset is converted to Access 2003 format from SPSS format to be used in Analysis Services easily. The dataset used in the analysis consists of 17 input attributes and 1 predictable attribute (landslide; cells with landslide, cells without landslide) and 280,132 records/cases (Table 3). 226,041 cases belong to pixel without landslide. However, 53,674 cases stand for pixel with landslide and 417 missing cases exist. Although predictable attribute (landslide information) is a discrete valued attribute, it is handled as continious attribute to obtain more accurate results such as in the range of 0-1 instead of and . After data preparation stage, “Microsoft Decision Trees” algorithm is run on training dataset by separating whole dataset to 85% training and 15% test cases of whole dataset. By using the trained model, the landslide susceptibility values of each pixel on the map are determined via prediction tool of software and the landslide susceptibility map is produced with the help of predicted numerical values of each pixel on the map. One of the useful features of “Microsoft Decision Trees” algorithm is building and reporting dependency network graphs. The dependency network graphs display the relationships among attributes derived from decision tree model’s content [22]. An example view of decision tree that is derived from our model is presented in Figure 4.

Using the predicted landslide susceptibility values, the landslide susceptibility map of the study area is produced (Figure 5). On the map the reddish tones (close to 1) mean high susceptible areas while the green parts (close to 0) represent more stable zones in the region. The dependent variable is controlled by coefficients of the independent variables in the equations. The effect importance order is obtained from the dependency network graphs by considering the coefficients of the independent variables in the model. According to the dependency network graph, the most effective parameters are the geological formations Danisment fm.-Acmalar m. (Toda) and Kirklareli limestone (Tek), SPI, slope gradient, and profile curvature (Table 3). This result is in a good accordance with the landslide density assessment of the lithological units, because the majority of the landslides occurred in these lithologies. After production of the landslide susceptibility map, the performance of the map is also assessed. To assess the spatial effectiveness of the susceptibility maps using a threshold-independent method, ROC curve was drawn in the present study (Figure 6). Moreover, plotting different accuracy values obtained against the whole range of possible threshold values constitutes the essential of a ROC curve evaluation [51]. The area under ROC curve (AUC) constitutes one of the most common used accuracy statistics for the prediction models in natural hazard assessments [56]. The minimum value of AUC is 0.5 means no improvement over random assignment while the maximum value of that is 1 denotes perfect discrimination. Finally, according to the results of the AUC evaluation (Figure 6), the obtained map exhibited a good performance.

6. Conclusions

Duman et al. [17] produced previously the landslide susceptibility map of the same area by logistic regression analysis. The model produced by Duman et al. [17] is used to generate a landslide susceptibility map that correctly classified 83.8% of the landslide-prone areas. This percentage (83.8%) belongs to a correct classification of the landslide information of being only “one” (landslide presence). When the information regarding being “zero” (landslide absence) is also considered, the overall correct classification value becomes 76.0% [17]. In the present study, to check the performance of the produced map, the ROC curve is drawn and the value of AUC is obtained as 89.6%. This result reveals that the performance of the produced map in this study is obviously higher than that of the map produced by Duman et al. [17]. As a result of the application of decision tree method, two geological formations (Danisment fm.-Acmalar m., Toda; Kirklareli limestone, Tek), stream power index, and slope gradient are obtained as the most effective parameters on the landslide occurrence in the study area. If the field conditions are taken into account, this result is physically plausible. It is possible to produce landslide susceptibility maps by various approaches as statistical methods, artificial neural networks, fuzzy approaches, and so forth. The main difference of the decision trees employed in the present study from the other methods can exhibit the order of the conditioning parameters. This situation provides a comparison between analysis results and field observations for the expert. Especially, visual interpretations of the decision trees are powerful tool when compared with the other approaches. Additionally, decision trees applied in the study shows a high prediction capacity. These interpretations reveal that the decision tree is a useful tool when producing reliable landslide susceptibility maps. The reliability of landslide susceptibility maps is highly important because when landslide hazard and risk maps are produced, the landslide susceptibility maps should be used. The produced map has a sufficient capacity for the medium scaled and regional planning purposes. This means that according to the results of the present study the decision tree is an efficient tool for medium scaled and regional landslide susceptibility analyses.

As a final point, in the present study, decision tree, one of the data mining methods, is investigated to produce landslide susceptibility map of a landslide-prone area (Cekmece, Istanbul, Turkey). By using the developed decision tree model, two important results can be obtained; the model is used to predict the landslide susceptibility degrees and the effect order of input attributes on landslide occurrence is investigated.