Accurate and timely collection of urban land use and land cover information is crucial for many aspects of urban development and environment protection. Very high-resolution (VHR) remote sensing images have made it possible to detect and distinguish detailed information on the ground. While abundant texture information and limited spectral channels of VHR images will lead to the increase of intraclass variance and the decrease of the interclass variance. Substantial studies on pixel-based classification algorithms revealed that there were some limitations on land cover information extraction with VHR remote sensing imagery when applying the conventional pixel-based classifiers. Aiming at evaluating the advantages of classifier ensemble strategies and object-based image analysis (OBIA) method for VHR satellite data classification under complex urban area, we present an approach-integrated multiscale segmentation OBIA and a mature classifier ensemble method named random forest. The framework was tested on Chinese GaoFen-1 (GF-1), and GF-2 VHR remotely sensed data over the central business district (CBD) of Zhengzhou metropolitan. Process flow of the proposed framework including data fusion, multiscale image segmentation, best optimal segmentation scale evaluation, multivariance texture feature extraction, random forest ensemble learning classifier construction, accuracy assessment, and time consumption. Advantages of the proposed framework were compared and discussed with several mature state-of-art machine learning algorithms such as the -nearest neighbor (KNN), support vector machine (SVM), and decision tree classifier (DTC). Experimental results showed that the OA of the proposed method is up to 99.29% and 98.98% for the GF-1 dataset and GF-2 dataset, respectively. And the OA is increased by 26.89%, 11.79%, 11.89%, and 4.26% compared with the traditional machine learning algorithms such as the decision tree classifier (DTC), support vector machine (SVM), -nearest neighbor (KNN), and random forest (RF) on the test of the GF-1 dataset; OA increased by 32.31%, 13.48%, 9.77%, and 7.72% for the GF-2 dataset. In terms of time consuming, by rough statistic, OBIA-RF spends 223.55 s, SVM spends 403.57 s, KNN spends 86.93 s, and DT spends 0.61 s on average of the GF-1 and GF-2 datasets. Taking the account classification accuracy and running time, the proposed method has good ability of generalization and robustness for complex urban surface classification with high-resolution remotely sensed data.

1. Introduction

The classification accuracy of remotely sensed data and its sensitivity to classification algorithms have a critical importance for the geospatial community, as classified images provide the base layers for many applications and models [1]. Recent availability of submeter resolution imagery from advanced satellite sensors, such as WorldView-3 and Chinese GaoFen series, can provide new opportunities for detailed urban land cover mapping at the object level [2]. Applications such as environmental monitoring, natural resource management, and change detection require more accurate, detailed, and constantly updated land cover-type mapping [3]. Detailed urban land cover information is not only essential for understanding the urban environment changes and monitoring and managing the urban ecological environment but also for supporting the government to make a decision on urban expansion, urban planning, and management [46]. In the last few decades, it has become an effective and convenient mean to obtain this information from remotely sensed imagery, because of its unique advantages of frequent and wide coverage, by machine learning classification technology [7].

Most of land use and land cover classification research are traditionally based on low- and medium-resolution remotely sensed imagery, such as MODIS [6, 8, 9], Landsat [1012], and SPOT1/4 [13]. However, urban surface coverage presents high-frequency heterogeneity, resulting in a large number of mixed pixels in medium- and low-resolution images. With the rapid development of sensor technology, a large number of high-resolution remotely sensed imagery (IKONOS, Quickbird, GeoEye-1, WorldView-1-4, GF-1/2, etc.) in meters or submeters are becoming more and more popular [14]. With the characteristics of high definition and abundant spatial information, high-resolution satellite image can compensate the shortcomings of mixing pixels in low- and medium-resolution images in urban land cover classification [15, 16]. And high spatial resolution images, where spatial resolution is equal or a little equal to 4 meters, could make it possible to map complex urban surface. A major challenge in using high spatial resolution for detailed urban mapping comes from the high level of intraclass spectral variability, such as building roof and road, and low level of interclass spectral variability, such as water body and shadow. In this condition, traditional pixel-based classification algorithms such as the maximum likelihood classification (MLC) can easily make missclass error and generate the salt-and-pepper effect which may reduce classification accuracy for very high-resolution imagery.

There are currently various classification algorithms, each with its own advantages and limitations [17]. And that, common mature statistical-based machine learning algorithms, such as MLC, requires hypothesis that training data follows a normal distribution, but high-resolution images cannot meet this requirement. Many previous studies revealed a bunch of machine learning algorithms such as support vector machines (SVM) [1], artificial neural networks (ANN) [18], and decision tree [19] have been popular for land cover classification. These classifiers always have limitations in practical applications in areas such as volatile and complex urban area, due to the enhanced complex of spatial relationship between pixels and the complex earth’s surface phenomenon [21, 22]. Recently, ensemble methods have been introduced to integrate multiple single classifiers to improve classification performances. The combination of multisource remote sensing and geographic data is believed to offer improved accuracies in land cover classification [23]. In general, there are two steps to build the ensemble, namely, generating base learners and combining base learners. In order to obtain a good ensemble, the base learner should be as accurate as possible and as diverse as possible. Due to its high potential and superior performance, ensemble methods have been employed in a remote sensing community. Existing theoretical and empirical studies have reported that ensemble classifiers can obtain more accuracy prediction and outperform individual classifiers [3, 17, 2325]. The random forest (RF) classifier, as one of the more popular ensemble learning algorithms in recent years, is composed of multiple decision trees in that each tree is trained using bootstrap sampling and employing the majority vote for the final prediction [26, 27]. It has received increasing attention due to its excellent classification result, the ability to avoid overfitting, and the rapid speed to process [2831].

In order to overcome limitations of pixel-based classification, object-based image analysis (OBIA) or geospatial object-based image analysis (GEOBIA) has been introduced to improve the quality of information extraction from high-resolution imagery. Image segmentation is a critical and important step in (geographic) object-based image analysis (GEOBIA or OBIA). The final feature extraction and classification in OBIA are highly dependent on the quality of image segmentation [32]. There are two main steps, containing segmentation and classification in OBIA. The processed object of OBIA is not a pixel, but an object composed of multiple adjacent homogenous pixels through segmentation, which containing not only the spectral information but also the textual and contextual information from imagery [32, 33]. Due to its advantages, OBIA has been more popular in the remote sensing community and successfully applied in land cover classification [3436]. Many previous studies have showed that the OBIA method had outperformed the pixel-based classification [32, 3740]. However, the number of input features used for classification has grown exponentially, some of which are irrelevant and redundant features, affecting the performance of the classifier, especially, when the purpose of segmentation has been changed from helping pixel labeling to object identification at present era [32].

In this paper, we verify the ability of GF-1 and GF-2 very high-resolution imagery in urban land use and land cover classification. For this purpose, we combined the random forest ensemble classifier with the OBIA method. We test the method on two selected complex urban areas of a metropolis city. And the proposed strategy was also compared with the pixel-based random forest and the state-of-art mature machine learning algorithm including the SVM, KNN, and DT classifiers from classification accuracy and operational efficiency aspects.

The rest of this paper is organized as follows: a brief introduction about the study area and dataset and preprocessing are given in Section 2. The framework and details of the proposed methodology strategy based on object-oriented analysis and random forest are drawn in Section 3. The results and discussion are shown in Section 4. Finally, the conclusions are drawn in Section 5.

2. Study Area and Data Preprocess

The study site is a central business district (CBD) of Zhengdong new district which is located in the eastern part of Zhengzhou city, capital of Henan province, and is a new urban area invested and developed by Zhengzhou Municipal Committee, municipal government in accordance with the State Council approved the City of Zhengzhou city master plan in order to implement the megacity framework, expand the size of the city, and accelerate urbanization and urban modernization strategy. Based on the National Economic and Technological Development Zone, the original area of CBD is about 25 km2, west from 107 national road, east to Jingzhu Expressway, south to the airport highway, north to Lianhuo Expressway, and the long-term planning area of CBD is about 150 km2. The study area is focused on Ruyihu, the center of CBD (see Figure 1), which is surrounded by three landmarks of the CBD-Zhengzhou International Convention and Exhibition Center, Henan Arts Center, Zhengzhou Convention and Exhibition Hotel. The land surface is dominated by human-made material, which is a challenging task to identify different land use and land cover types. According to the planning and construction situation, the types of surface cover are mainly divided into the urban building areas (UB), urban commercial area (UC), urban green areas (UG), urban road areas (UR), urban water area (UW), and high building shadow (HS) (see Table 1). Data availability statement stated that the very high-resolution remotely sensed data used in this research is provided by Henan Data and Application Center of the High Resolution Earth Observation System through signing a contract with National Defense Science and Technology Bureau of Henan. Unfortunately, we do not have the priority to share the high-resolution satellite remote sensed data. Anyway, we can share our code used in this research. Researchers can test algorithms using these codes with their own datasets and repeat experiments to obtain similar research conclusions. Researchers who are interested in this code can download it from hyperlink https://pan. http://baidu.com/s/19nXD7oHwq0FnpZJ7T6p5HQ, using password 6xae, or contact with the corresponding author to obtain source data to conduct secondary analysis.

2.1. Remote Sensing Data and Preprocessing

Under the background of “Chinese high-resolution earth observation” major project, a series of high-resolution satellites have been launched, involving GF-1 (2 m res. panchromatic camera/8 m res., multispectral camera/16 m res., and wide-angle multispectral camera), GF-2 (1 m res., panchromatic camera/4 m res., and multispectral camera), GF-3 (1 m res., C-band synthetic aptitude radar), GF-4 (50 m res., fixed-point camera in geostationary orbit), GF-5 (VNIR hyperspectral camera), GF-6 (2 m res., wide-angle multispectral camera), and GF-7 (stereographic cartography cameras). GF-1 satellite is the first low earth orbit remote sensing satellite of China’s high-resolution earth observation system, which breaks through the key technologies of optical remote sensing for high spatial resolution and multispectral and wide coverage. It can meet the needs of research data support in the fields of resources and environment, precision agriculture, and disaster measurement, which has become an important means of information services and other aspects. It is of great strategic significance to improve the level of satellite engineering in China and the self-sufficiency rate of high-resolution data.

The selected remotely sensed data is GF-1 and GF-2 very high-resolution satellite images, which was acquired on July 14, 2015 and July 30, 2015, respectively. The specific parameters are shown in Table 2.

The preprocess of selected dataset includes radiation calibration, atmospheric correction, geometric registration, orthorectification, image fusion (NNDiffuse pan-sharpening algorithm), and image resize. Radiation calibration is the process of converting DN values of image data into apparent reflectivity using atmospheric correction techniques. Equation (1) can be used to convert the channel observation DN value to equivalent brightness value.

where Gain is the calibration slope, DN is satellite observation value, Bias is calibration intercept, and these parameters can be obtained from meta file with satellite data. Then, the apparent reflectivity can be calculated based on the brightness value using

where ESUN is solar spectral radiation, is solar-earth distance, and is the zenith angle of sun.

The selected atmospheric correction is based on a 6S radiative transfer model, which is a package included in Pixel Information Export (http://www.piesat.cn/en/index.html). The purpose of atmospheric correction is to eliminate the absorption and dispersion from the sun and target.

Geometric correction includes image registration and orthorectification. The purpose of geometric correction is to correct image deformation caused by system and nonsystemic factors. In this research, the image-to-image registration method was selected to correct multispectral data based on panchromatic data of GF-1 and GF-2 sensors, respectively.

Orthorectification is the process of correcting image space and geometric distortion to generate a multicenter projection plane orthographic image. In addition to correcting geometric distortions caused by general system factors, it can also eliminate geometric distortion caused by terrain.

Image fusion is the process of generating new images under the prescribed geographical coordinate system according to a certain algorithm. This study combines multispectral data with high spatial resolution and single-band images with high spatial resolution, making the fused images have both high spatial resolution and rich spectral resolution.

3. Methodology

The proposed methodology in this research (shown in Figure 2) includes three main stages: (1) multiscale segmentation and multifeature extraction; (2) construction of the state-of-art machine learning algorithms such as decision tree classifier (DTC), random forest (RF), support vector machine (SVM), -nearest neighbor (KNN), and object-based image analysis random forest (OBIA-RF); and (3) accuracy assessment and comparison analysis and discussion; more details can be found in Figure 2.

3.1. Multiscale Segmentation and Feature Extraction

Very high-resolution (VHR) remote sensing images have a limitation in spectral information which means there are 4 spectral bands including green, blue, red, and near infrared in general. While the VHR remote sensing images are always rich in detailed characters, more specific details of land surface will be presented on the images. In order to overcome this shortcoming, we conquer the disadvantage and make full use of advantages of these data. After data preprocessing, we performed multiscale segmentation and employed feature extraction based on the segmented results to obtain multifeature image sets as inputs of image classification models.

Quality of segmentation has a direct effect on the performance of classification, which is related to the segmentation parameters selected by an analyst. Most of the segmentation algorithms are regarded as a subjective task with the trial-and-error strategy. The multiscale segmentation algorithm, the most popular method currently, employed in this experiment is merging pixels of the original image into small object patches from bottom to top, and then merging the small patches into large patches to complete the merging of the regional objects [4143]. Three major parameters for the multiscale segmentation algorithm are scale, shape, and compactness, defining within-object homogeneity. Here, we select the appropriate scale parameters and heterogeneity standard specifications to ensure the highest homogeneity within the generated object and the heterogeneity between adjacent objects and other objects. Scale parameter is considered the most effective parameter affecting the segmentation quality [4446]. In this study, shape parameter was set as 0.1. The optimal segmentation scale parameter of GF-1 image is quantitatively evaluated by ESP-2 (estimation of scale parameter), of which the principle is to select the optimal scales based on the rate of change (ROC) curve for the local variance (LV)of object heterogeneity at a corresponding scale [44, 47]. Peaks value of ROC-LV curves were considered the most appropriate segmentation scales at which the image can be segmented in the most optimal levels. Methodology model of ROC can be described as

where is mean standard deviation of the object in the layer and is mean standard deviation in the next lower layer. When ROC was obtained, the optimal segmentation scale parameter is selected by visually interpreting based on segmentation result and the boundary matching effect of the actual feature [47].

On the basis of the best segmentation result, a total of 24 spectral features, texture features, and spatial geometric features were extracted. The extracted spectral, textural, and spatial features from segmented VHR remotely sensed data can be summarized as shown in Table 3.

In detail, the selected spectral features are mean value of all four bands (which means average value of all image objects). The brightness feature is that the sum of the average values of the layers containing spectral information divided by the number of layers of the image object, which can be calculated by

where is the brightness value of the object, is the total number of bands contained in the object, and is the average gray value of the object.

In the high-resolution image, since the reflectivity of the water body in the near-infrared band is significantly lower than that of other ground objects, the shadows are very similar in many features with water body so that they are difficult to separate. In order to highlight the water, the ratio and standard deviation of the fourth band and NDWI were additionally extracted. The ratio of the fourth band is the average gray value of all pixels in the fourth band divided by the average gray value of all pixels in all 4 bands of the image. In addition, only layers containing spectral information can be used to obtain reasonable results. The standard deviation of band 4 is calculated from all the pixel values contained in an object in band 4.

The NDWI refers to the normalized ratio index between the green band and the near-infrared band in the image. Using NDWI can better distinguish the water in the image from other features. It can be calculated by

where is the value of object in the green band and is the value of object in the near-infrared band.

NDVI is a vegetation index proposed based on the reflection characteristics of vegetation in the visible and infrared bands. It is the ratio of the difference between the reflection intensity value in the visible red band and the reflection intensity value in the near-infrared band to the sum of the two. The formula can be described as

where is the value of object in the near-infrared band and is the value of object in the red band.

Although limited to spectral information, high-resolution remote sensing images contain rich geometric and structural information, which can reflect the spatial distribution and geometric forms of ground objects. The selected texture features contain eight features unit extracted based on the gray level cooccurrence. The selected eight textural features extracted from GLCM include entropy, mean, variance, homogeneity, contrast, dissimilarity, correlation, and angular second moment. Gray-level cooccurrence matrix (GLCM), which is calculated based on statistic method, is also known as gray-level spatial dependence matrix considered one of the most popular techniques used for texture analysis. GLCM has strong ability to assess texture features by considering spatial relationship of pixels and its surrounding. GLCM mean value is not simply the average of all original pixel values; pixel value is weighted by its frequency of its occurrence in combination with a certain neighbor pixel value. Variance in GLCM texture performs the same task as does the common descriptive statistic called variance.

Entropy measures the complexity of a given image, which reflects the sharpness of the image and the depth of the texture; entropy can be calculated using

where and standards position of pixels of GLCM and is probability of presence of pixel pairs at certain distance and angle.

Contrast measures local variations and texture of shadow depth in GLCM. The larger the contrast, the deeper groove of texture and clearer effect of the image will be shown. Contrast can be calculated by

Homogeneity represents values by the inverse of the contrast weight, with weights decreasing exponentially away from the diagonal, which can be calculated using

Correlation coefficient concludes that the degree of two variable’s activities is associated and can be calculated by

Angular second moment (ASM) uses as weight for itself; high values of ASM occurs when the window is very orderly, so it measures the homogeneousness of a given image, and ASM can be calculated using

In addition to spectral and texture features, combining the geometric characteristics of high-resolution remote sensing images are extremely significant for detailed land cover information extraction. In this article, the article seven spatial geometric features include area, border index, compactness, density, length, length/width, and shape index were selected. Area of an image can be obtained through multiplying number of pixels constituting the image object and the covered area of the object. The boundary index can be calculated as the ratio between the boundary length of the image object and the smallest enclosing rectangle. The tighter the image object, the smaller its border. The density describes the distribution in the pixel space of the image object, that is, how tight the image object is. Density is based on the covariance matrix, which is calculated by dividing the number of pixels constituting the image object by its approximate radius. Length-width ratio can be used as one of the features for road extraction. It is calculated by the length and the width of the objects. In addition, the length-width ratio can be used to calculate the length of the image object. Shape index was selected to describe the smoothness of the surface of an object. The smoother the surface of the image object, the lower its shape index. The more fragmented the image object, the larger its shape index. It can be calculated by dividing the frame length of an object by the volume of the object.

3.2. Classification Algorithms

In this research, an OBIA-RF method, also known as a combination of OBIA and classifier ensemble method which can take advantage of OBIA and classifier ensemble was constructed, and four state-of-art classification algorithms named KNN, SVM, and DTC were selected for performance comparison. Performance evaluation was carried out by quantitative indicators such as overall accuracy, kappa coefficient, and execution time consumption.

3.2.1. Random Forest

Random forest algorithm is an ensemble learning method proposed by Leo Breiman in 2001 [48], and is one of the most well-known ensemble learning methodology and has advantages of, i.e., performing out-of-sample prediction rapidly, requiring only slight parameter tuning, having capable ranking of the importance of features [28]. Decision trees in RF are generated by randomly selecting sample (bootstrap sampling) subsets in the training sample set and randomly selecting the feature variables to achieve optimal splitting. The obtained decision trees do not need pruning, and the final classification result is obtained by the majority vote method from the classification results of all decision trees in the integration. Gini index which measures the impurity of a given element with respect to the result of the classes is selected as a measure for the best split selection for RF [49]. There are two key parameters in the process of constructing a random forest pattern: the number of spanning trees and the number of randomly selected features. By literature review, the number of selected features is more important than the number of how many trees are trained; especially, generally, each split number of randomly selected features is set as the square root of the number of input characters [5052]. This parameter can be optimized based on the out-of-bag error estimate. In this research, the number of trees is set as 100 and number of random attribute selection is , where is the total number of features. And then, the number of trees for RF was tuning from 50 to 500 with a step of 10.

3.2.2. Support Vector Machine

SVM is one of the most appealing algorithms for remotely sensed data classification due to their advantages of generalization even with limited training samples which is common in remote sensing data processing [53]. And, as a supervised nonparametric statistical learning method, SVM does not need a training set strictly conforming to the standard independent and identical distribution. The advantages of SVM come from two aspects, transforming original space training set into a very high-dimensional new space and finding a large margin linear boundary in the new space. SVM is a classifier based on theory of structural risk minimization, which tries to lower the generalization error by maximizing the margins on the training data. Thus, SVM looks for an ideal margin by solving optimization problem as

Furthermore, for classes that are nonseparable, the optimization can be solved by the so-called ‘kernel stick.’ The optimization procedure seeks to find coefficients ai and w0 in Equatuion (13), where is kernel function. By default, the kernel function is set as the Gaussian kernel, kernel scale is 8, and box constraint is 1 standardized. And then, the scale of kernels for SVM was tuning from 0.1 to 8 with a step of 0.4

3.2.3. -Nearest Neighbor

The -nearest neighbor classifier (-NN) is a kind of nonparametric and memory-based learning, as well as instance-based learning or lazy learning used for classification and regression [26]. In the classification procedure, a given pixel will be classified by plurality vote of its neighbors in the feature space. The most intuitive -NN classifier is 1-NN classifier; in this case, a given pixel will be assigned to the class of its closest neighbor in the feature space, which can be described as . The useful technique which can help nearer neighbors contribute more than the more distant one is assigning different weights to the neighbors. A common weighting scheme is setting each neighbor a weight of , where is the distance to the neighbor. Classification performance of -NN can be significantly improved by metric learning, and diversity can be introduced to -NN classifier by using different subsets of features, different distance metrics, different values of , etc. In the experiment part, Euclidean distance is selected and number of neighbors is 100, distance weight is setting as equal to evaluate algorithm performance, and the number of neighbors for KNN was tuning from 1 to 300 with step of 10.

3.3. Comparisons and Assessment

Classification accuracy was evaluated by the confusion matrix as well as overall accuracy and the Kappa coefficient [54]. With the help of the confusion matrix, overall accuracy and kappa coefficient can be calculated using

where is the number of classes, represents the total number of considered pixel, are the diagonal elements of the confusion matrix, represents the marginal sum of the rows in the confusion matrix, and represents the marginal sum of the columns in the confusion matrix [55].

All experiments are performed on high-performance computing system using a portable bash system (PBS), as shown in Figure 3. Each algorithm is programmed as a job which can be submitted to cluster. Finally, time consumption of all algorithms is compared.

4. Results and Discussion

4.1. LULC Mapping

On the basis of segmentation results, spectral features and texture features are integrated and fused to generate multifeature images as inputs to all classifiers. In this research, the best segmentation scale is setting to 105 for GF-1 dataset and 210 for GF-2 dataset through estimation of scale parameter analysis. Land cover types in the study area include 6 categories named UB, UC, UR, UG, UW, and HS, more details can be found in Table 1. Training and testing samples are labeled by an expert of remote sensing with the assistance of Google Earth. Labeled samples for training and testing are shown in Table 4. In order to test the sample sensitivity of the proposed process chain, limited and sufficient samples were selected from the GF-1 and GF-2 datasets, respectively. And 3-hold out validation method was selected for accuracy assessment.

During the research procedure, the same labeled training and testing samples are used as inputs of the DT, SVM, KNN, and RF classifiers. The default parameters of the constructed DT, RF, SVM, and KNN classifier models are selected for GF-1 and GF-2 image processing. Overall accuracy, kappa coefficient of all experiments, and the best classification results for GF-1 and GF-2 are shown in Table 5 and Figures 4 and 5.

It can be seen that the OBIA-RF algorithm has best classification results with an overall accuracy of 99.43% and 98.98% for GF-1 and GF-2, respectively (Table 5 and Figures 4 and 5). In general, the correct classification accuracy of all categories covered by urban land surface reached 91%. The overall accuracy of the original RF classification is 94.67% for GF-1 and 91.26% for GF-2, and the Kappa coefficient is 0.93 and 0.89, respectively. The overall accuracy of the SVM classification method is 87.5% and 85.5%, and Kappa coefficient is 0.85, 0.83 for the GF-1 and GF-2 datasets, respectively. The overall accuracy of the DTC classification is the lowest, only 72.4% and 66.67% for GF-1 and GF-2.

By analyzing the single-class accuracy of original RF and OBIA-RF classification results, the accuracy of corrected classification of UB, UR, and UC is relative lower than other classes, that is, 91.4%, 90.4%, and 92.2% for the GF-1 dataset, while the lowest single accuracy land use type of the GF-2 dataset is UR and UC for 85.4% and 86.7%, respectively. When the OBIA method is combined with RF accuracy, these difficult identified classes are improved to 91.4%, 90.4%, and 92.2% for GF-1 data and 99.2% and 99.5% for GF-2 data.

When pixel-based approach and object-based approach are compared, research results demonstrate that object-based approach outstands pixel-based approach, which is also demonstrated by other research [37, 38]. While our research further demonstrates that among the selected algorithms, OBIA is more suitable for the classifier ensemble method when compared with stand-of-art single classifier, especially for higher spatial resolution satellite data, GF-2 instead of GF-1.

4.2. General Discussion

Based on Table 5, the best accuracies are achieved by the OBIA-RF model for both GF-1 and GF-2 datasets. Obviously, these experiments showed the superiority of OBIA-RF over the selected state-of-art method in terms of classification accuracy. By statistics, the OBIA-RF method lets 4.62%, 11.89%, 11.79%, and 26.89% better accuracy than RF, KNN, SVM, and DTC for the GF-1 dataset and lets 7.72%, 9.77%, 13.48%, and 32.31% better accuracy than RF, KNN, SVM, and DTC for the GF-2 dataset.

The DTC achieved the worst overall accuracy. SVM model improves classification accuracy by 15.1% and 18.83% at pixel level and improves classification accuracy by 21.95% and 30.21% at object level for GF-1 and GF-2, respectively. The RF model led to further improvement in classification accuracy by 22.27% and 24.59% at pixel level for GF-1 and GF-2. RF reduces the correlation between trees through random sampling of observations and features. This can demonstrate the advantage of the classifier ensemble for classification of VHR remotely sensed data. The OBIA method led to final improvement in classification accuracy by 4.62% and 7.72% in this study. This advantage is especially valuable for the relative high benchmark of random forest performance. When the OBIA method was combined with traditional machine learning model, especially classifier ensemble which takes advantage of textual, spatial structure information, and spectral information, classification results will be improved undoubtedly.

Furthermore, we investigate the sensitivity of the proposed model as well as the selected state-of-art machine learning model including DT, SVM, and KNN to parameter choice. Figures 6 and 7 plot the OA as a function of parameter for the corresponding machine learning model selected. The sensitivity of models to parameter choice for the GF-1 and GF-2 datasets (Figures 6 and 7) shows that (a) OA of the DTC model increased with the maximum number of splits, the peak value appears when number of split, GF-1 dataset equals to (290, 430) and GF-2 dataset equals to (480, 250), for pixel and OBIA training, respectively; (b) OA of SVM model increased to peak value when scale of kernels, GF-1 dataset equals to (0.5, 2.5) and GF-2 dataset equals to (0.1, 4.1), for pixel and OBIA and then significantly declines; (c) OA of KNN model appears when neighbors, GF-1 dataset equals to (291, 71) and GF-2 dataset equals to (221, 171), for pixel and OBIA training, respectively; (d) accuracy of RF fluctuates continuously with parameters changes, peak value appears when the number of trees reached, GF-1 dataset equals to (70, 240) and GF-2 dataset equals to (230, 450), for pixel and OBIA training, respectively; and (e) the OBIA method performed better than pixel-based method for all selected model including the GF-1 dataset and GF-2 dataset.

Sensitivity test of models to parameter choice of GF-1 and GF-2 shows that rank of OA for selected model is RF, DTC, SVM, and KNN. Furthermore, we also wonder which features are most important in the procedure of prediction. With the help of out-of-bag estimation, feature importance in the RF ensemble learning are calculated and shown in Figure 8 for the GF-1 dataset and Figure 9 for the GF-2 dataset. This feature importance rank shows the contribution weight of different features for complex urban surface classification. Figure 8 demonstrates that the most important features for GF-1 remotely sensed data interpretation is texture information (mean value calculated using gray-level cooccurrence matrix(GLCM)) and the second important feature is standard deviation calculated using GLCM, and followed by the important spatial feature calculated by ratio, and then spatial feature of length/width. And, for the GF-2 dataset, the first five rank features for complex urban surface interpretation belongs to spatial feature (length, ratio) and spectral feature (NDVI, NDWI, mean2). By a comprehensive consideration of the processing results of the selected two datasets, a preliminary conclusion can be drawn as that spatial and texture features play an important role for complex urban surface classification with high- and very high-resolution remotely sensed data.

The quality of segmentation directly affects the effect of subsequent classification. In this research, the multiscale segmentation method was selected, which has three user-defined parameters: scale, shape, and compactness. And the scale parameter that defines the average size of the image object is considered to be the most effective parameter that affects the segmentation quality, while there is no universal rule for this scale determination. By literature review [46], the ESP2 (estimation of scale parameter) scale parameter estimation tool is introduced and combined with visual interpretation to evaluate the optimal scale value and segmentation effect of GF-1 and GF-2 remote sensing images in this research. Based on statistic results (Figure 10), the appearance of first peaks are 105 and 220 for GF-1 and GF-2, respectively, which are the optimal segmentation scales of GF-1 and GF-2 images in this study. The shape and compactness parameters have limited influence on the performance of OBIA, and they were setconstant at 0.1 and 0.5, respectively. Therefore, in this research, we chose 0.1 and 0.5 to participate in the segmentation to get the final segmentation map (Figure 10).

Finally, to more explicitly evaluate the practical speed of proposed image classification chain compared to RF, SVM, and KNN, we consider empirical run times. In terms of image performance speed test, to ensure fair comparison, all methods shared the same code. The algorithms were deployed in the Henan Polytech High Performance Computing Center server, shown in Figure 3 (here, node is 1, and thread number of per CPU core per node is 24). Through rough statistic, OBIA-RF spent 223.55 s, SVM spent 403.57 s, KNN spent 86.93 s, and DTC spent 0.64 s for the GF-1 and GF-2 datasets in average. The model with the most time consumption is SVM, followed by KNN. The most time saving model is DTC, which got less than 70% classification accuracy.

5. Conclusion and Future Work

In this paper, we have proposed a novel urban mapping process chain which can take advantage of both the OBIA and classifier ensemble methods. The novelty in this paper is in the direction of successful evaluation of OBIA-RF on Chinese high-resolution satellite images GF-1 and GF-2 datasets. The performance of the OBIA-RF method has been compared with the state-of-art model such as DTC, SVM, and KNN, and performance of proposed OBIA-RF method has also been examined from urban mapping accuracy, the sensitivity to parametric selection, and time consumption. As the proposed process chain considered only spectral and GLCM texture features for image semantic, other features which might be useful to urban mapping such as local indicator of spatial association, mathematical morphology profiles, and decomposition characteristics of full polarized SAR features will be considered for future research.

Data Availability

The very high-resolution remotely sensed data used in this research is provided by Henan Data and Application Center of the High Resolution Earth Observation System through signing a contract with National Defense Science and Technology Bureau of Henan. Unfortunately, we do not have the priority to share the high-resolution satellite remote sensed data. Anyway, we can share our code used in this research. Researchers can test algorithms using these codes with their own datasets and repeat experiments to obtain similar research conclusions. Researchers who are interested in this code can download it from hyperlink https://pan.baidu.com/s/1hUyKWuoSzv2mrDx-NZ192w, using password ku0w, or contact with the corresponding author.

Conflicts of Interest

The authors declare that they have no conflict of interest.

Authors’ Contributions

RM.H did the conceptualization. P.L. did the methodology. P.L. did the formal analysis. RM.H and HW. Z. wrote the manuscript. RM.H proposed the idea and wrote the original manuscript and the following revisions. P.L. provided the funding. RM.H, GY.W., and XL.W. executed all the experiments, XL.W, RM.H, and HW. Z contributed to the revisions and provided valuable comments. Ruimei Han, Pei Liu, Guangyan Wang, Hanwei Zhang, and Xilong Wu contributed equally to this work.


We would also like to thank the data provider of Henan Data and Application Center of the High Resolution Earth Observation System. This research was supported by grants from the National Natural Science Foundation of China (41601450), Natural Science Foundation of Hebei Province (grant number D20200409002), and Henan Key Technology R&D Projects (182102310860).