Abstract
Remote-sensing images play a crucial role in a wide range of applications and have been receiving significant attention. In recent years, great efforts have been made in developing various methods for intelligent interpretation of remote-sensing images. Generally speaking, machine learning-based methods of remote-sensing image interpretation require a large number of labeled samples and there are still not enough annotated datasets in the field of remote sensing. However, manual annotation of remote-sensing images is usually labor-intensive and requires expert knowledge and the accuracy of annotation results is relatively low. The goal of this paper is to propose a novel tile-level annotation method of remote-sensing images to obtain remote-sensing datasets which are well-labeled and contain accurate semantic concepts. Firstly, we use a set of images with defined semantic concepts to represent the training set and divide them into several nonoverlapping regions. Secondly, the color features, texture features, and spatial features of each region are extracted, and discriminative features are obtained by the weight optimization feature fusion method. Then, the features are quantized into visual words by applying a density-based clustering center selection method and an isolated feature point elimination method. And the remote-sensing images can be represented by a series of visual words. Finally, the LDA model is used to calculate the probabilities of semantic categories for each region. The experiments are conducted on remote-sensing images which demonstrate that our proposed method can achieve good performance on remote-sensing image tile-level annotation. The implications of our research can obtain annotated datasets with accurate semantic concepts for intelligent interpretation of remote-sensing images.
1. Introduction
As a widely used emerging technology, remote-sensing technology is closely connected with spatial geography, photoelectric information, and other disciplines and has become a relatively important part of modern science, as well as an important technical means to study the earth resources and environment [1]. With the growing number of remote-sensing images, efficient content extraction and scene annotation that can help us quickly understand the huge-size image are becoming increasingly necessary. Machine learning-based methods, which achieve many improvements in many research fields, have been widely applied in remote-sensing image classification and object recognition [2]. However, machine learning-based methods require large amounts of manually annotated training data. Since the image data delivered by remote-sensing technology usually has a large size, expansive human efforts are usually needed to annotate them manually. And due to cognitive differences, the manual annotation of remote-sensing images usually results in large errors. Hence, an effective annotation method is strongly required in remote-sensing applications. Automatic annotation of remote-sensing image is to automatically assign semantic concept in the form of captioning to a remote-sensing image [3]. It can be applied to image retrieval systems to retrieve image data of specific content from databases.
The methods of automatic remote-sensing image annotation can be regarded as a type of multiclass image classification with predefined semantic labels. During the past decades, several models have been proposed for image annotation. Lu et al. proposed an automatic image annotation method based on an adaptive similarity graph model, their method considers both visual and textual information [4]. Wang et al. developed a novel probabilistic model for image annotation, which is based on an approximate inference and estimation algorithm [5]. Hu et al. presented an effective method to realize the image annotation task [6]. They apply both labeled images and unlabeled images to extract the structural knowledge. Song et al. proposed a semisupervised image annotation method, which exploits the geometric relationships between training data [7]. Inspired by the ensemble of human annotations, Wu et al. proposed a new automatic image annotation model, named dubbed diverse and distinct image annotation (D2IA), which creates semantically relevant, yet distinct and diverse tags [8]. Lin et al. proposed a block subimage annotation method as a replacement for full-image annotation [9]. Their method can reduce the cost of full-image annotation and generate high-quality annotated sample data for semantic segmentation. By exploiting the dual-level semantic concepts, Zhu et al. proposed an end-to-end framework for object-level multilabel annotation of remote-sensing images [10]. A remote-sensing image can be assigned with a single label at the scene level to depict the overall understanding of the scene and with multiple labels at the object level to represent the major components by applying their method. In [11], the method proposed by Huang et al. used multiscale feature fusion module, channel spatial attention learning module, and label relevance extraction module to achieve the annotation of remote-sensing images.
Other algorithms in machine learning have also been used to deal with image annotation. Liu et al. focused on the issue of multilabel learning with missing labels, where only partial labels are available, and proposed a new approach, namely, SVMMN for image annotation [12]. Jiu et al. proposed an image annotation method based on multiple kernel support vector machine [13]. Bahrololoum et al. developed a multiexpert-based framework for automatic image annotation [14]. Their method integrates the information from feature space and concept space. Inspired by clustering multitask learning, Deng et al. proposed a classification method to achieve image annotation with a high performance multitask feature hash learning algorithm [15]. Verma and Jawahar proposed a 2-pass k-nearest neighbor method [16]. Their method uses both the image-to-label similarity and the image-to-image similarity. Ding et al. proposed a multi-instance multilabel learning method, which integrates the instance context and labels context into a general framework [17]. Ivasic-Kos et al. developed a two-tier image annotation model where the first tier corresponds to object-level annotation and the second tier to scene-level annotation [18]. Motivated by the fact that multiple concepts often co-occur in images, Feng and Bhanu proposed a semantic concept co-occurrence model, which enabled the semantic description of complex scene images [19]. Yin et al. pay attention to the specific case in which images are both labeled with a category and annotated with free text and develop a supervised multimodal hierarchical semantic model [20]. Zamiri and Sadoghi Yazdi proposed a multiview robust spectral clustering (MVRSC) method, which modeled the relationship between semantic and multifeatures of training images based on the Maximum Correntropy Criterion [21]. Zhang et al. proposed a novel learning to rank approach to address the image autoannotation problem [22]. Their approach integrates learning to rank algorithms and nearest neighbor-based models and inherits their advantages. Feng and Lang developed a novel graph regularized low-rank feature mapping for image annotation under semisupervised multilabel learning framework [23]. Their method is able to learn reliable labeled models with only a small number of accurately labeled training images.
There are also some studies which focus on innovation in the feature extraction procedure. To fully utilize the natural structure information of the image, Yuan et al. proposed a low-rank matrix regression model for feature extraction and feature selection [24]. Zhang et al. proposed a novel multiview multilabel sparse feature selection (MSFS) method, which exploits both view relations and label correlations to select discriminative features for image annotation [25]. Wan et al. presented a feature extraction algorithm named sparse fuzzy 2D discriminant local preserving projection (SF2DDLPP), which reduces the sensitivity to data sparse points through the elastic net regression and enhances the robustness of feature extraction and recognition algorithms [26]. In order to annotate the image with context features, Mehmood et al. presented a novel image representation based on the weighted average of triangular histograms (WATH) of visual words, which can reduce overfitting problems on larger sizes of the dictionary and semantic gap issues between high-level image semantic and low-level image features [27]. Zhang et al. proposed a new method of texture feature extraction based on the direction measure and a gray level co-occurrence matrix (GLCM) fusion algorithm, which applies the GLCM to extract the texture feature value of an image and integrates the weight factor that is introduced by the direction measure to obtain the final texture feature of an image [28]. Wang et al. proposed a dual low-rank regularized multilabel learning model under a graph regularized semisupervised learning framework, which can effectively capture the label correlations in the learned feature space and enforce the label matrix to be self-recovered in label space as well [29]. Huang et al. presented a semantic-enhanced image and sentence matching model, which can improve the image representation by learning semantic concepts and organizing them in a correct semantic order [30].
Recently, a latent Dirichlet allocation (LDA)-based method has been presented for remote-sensing image annotation [31]. Given the training set, LDA is employed in learning the parameters of the probability distribution of visual words for each class. Then, classification is performed based on the maximum-likelihood method, which assigns the image to the class that maximizes the likelihood. Wang et al. proposed regularized latent Dirichlet allocation (rLDA) model [32]. In their proposed approach, tag similarity and tag relevance are jointly estimated iteratively so that they can benefit from each other, and the multiwise relationships among tags are explored. Nguyen et al. proposed an M3LDA method, which combines visual features and user-provided tags to annotate the image [33]. Wang et al. aimed to organize images in an unsupervised manner using latent features, and the LDA model is used to implement image annotation [34]. Ding and Li proposed a new approach to topic modeling, termed Vocabulary-Selection-Embedded Correspondence-LDA (VSEC-LDA), which learns the latent model while simultaneously selecting the most relevant words [35]. Zheng et al. proposed an improved multimodal data fusion-based latent Dirichlet allocation topic model (MMDF-LDA) to annotate images via fusing visual content, user-supplied tags, and geographic information [36].
To obtain the annotated images, the above methods mainly use visual semantic features such as color features, texture features, and shape features. However, some categories in remote-sensing images are highly similar in color, texture, and shape, and the simple combination of visual semantic features cannot obtain a highly discriminative feature representation of remote-sensing images, which is likely to be wrongly labeled or mislabeled when the regions are in highly similarity. Therefore, it is difficult to introduce reliable semantic concepts into remote-sensing images. The LDA model clusters feature into multiple visual words by using the K-means algorithm and uses these visual words to encode features corresponding to different categories. The K-means algorithm uses random selection to initialize the clustering centers, and when the initial clustering centers are located in regions with low feature point density, the clustering process will bring high-computational complexity and the generated visual words will not have better generalization performance. In addition, redundant feature points can adversely impact the visual word generation results. To solve the above problems, we proposed a novel method of tile-level annotation for remote-sensing images, which adaptively assigns different weights to different visual semantic features according to the characteristics of the categories, increasing the weight of important and well-discriminated features and decreasing the weight of minor and supplementary features. At the same time, we change the way of initializing the clustering centers in the visual word generation process to confirm that the initial clustering centers are selected in regions with high feature point density and eliminate the interference of redundant feature points on the clustering results. The visual words with high generalization ability are obtained while the computational complexity is decreased.
For this paper, the main contributions are as follows:(1)In the optimized feature extraction process, the relationship between weights of feature and image representation is investigated and a feature fusion method based on the weight optimization is proposed. Our proposed method provides a better understanding of scene classes and generates discriminative features for further learning.(2)In the visual word generation and representation process, the importance of initial clustering center selection is revealed. And an improvement method is proposed in this paper, which not only reduces the time complexity of the visual word generation procedure and obtains expressive visual words but also suppresses the problem of incorrect annotation.(3)Our proposed method can obtain a large number of remote-sensing images which contain rich semantic concepts and provide accurately labeled training data for intelligent interpretation of remote-sensing images. The effectiveness of our proposed method is tested on different datasets and is compared and evaluated with classical and existing algorithms.
The organization structure of the paper is as follows. Section 2 describes the proposed method. Section 3 provides some experimental results of the proposed method. Section 4 draws the conclusion of this paper.
2. Proposed Method
A novel method is proposed for remote-sensing image tile-level annotation in this paper. In the proposed method, the image can be annotated with user-given semantic concepts. Our proposed method mainly contains four stages, including tile-level region generation, optimized feature extraction, visual word generation and representation, graphical model generation, and image annotation. An overview of the proposed approach can be seen in Figure 1. First of all, the image is divided into several nonoverlapping regions. Secondly, color features, texture features, and spatial features of each region of the remote-sensing image are obtained, respectively. We concatenate them into a high-dimensional descriptor using the weight optimization method. Thirdly, in the process of visual word generation, we proposed a new clustering center selection method and an isolated feature point elimination method to optimize the clustering process. Each semantic category can be represented by a set of visual words. Finally, the LDA model is used to annotate and assign semantic concepts to the images.

2.1. Tile-Level Region Generation
For a large image to be annotated using semantic concepts, we first divide the remote-sensing image into nonoverlapping rectangular regions with equal size, i.e.,
Thus, the annotation of can be viewed as the classification of the region into the semantic classes . We employ the evenly sampled grid method [37] to segment the image region into the same local patches. Here, the local patch is set as . Then, each region can be represented by the features of patches obtained applying our proposed method.
2.2. Optimized Feature Extraction
In our proposed method, the color histogram is used to represent the color feature of the patches. For the color histogram, we follow the approach proposed by Smith A. R. and convert the image into HSV color space [38]. Hue and color were taken as invariants, and saturation characteristics were taken as variables for statistics [39]. The obtained histograms were converted into three-dimensional vectors, with the magnitude of each dimension being 30, 32, and 32, respectively.
The other two regional features selected in our paper are texture features and spatial features, which can be represented by several features, such as standard deviation and pyramid histograms of oriented gradients (PHOG) [40]. The PHOG feature is inspired by histogram of oriented gradients (HOG) and the image pyramid representation [41]. It can represent the shape and spatial distribution of local areas of the image. We select Gabor features as texture features and PHOG features as spatial features. In our paper, 40 kernel functions in 5 directions and 8 scales are selected to filter one channel of the region, and each pixel can obtain a 40-dimensional Gabor feature. To avoid the effects of light intensity and shadow, the image should be normalized based on Gamma correction before PHOG feature extraction. By setting the level of the image pyramid, we can obtain different levels of PHOG features.
After the features are obtained, we proposed a novel multifeature fusion method with weight optimization. General feature fusion methods simply connect the features horizontally to form a new feature. However, the linear connection of features is unable to reflect the variability between remote-sensing image regions, and the generated features will not be discriminative. Our method fully considers the correlation between remote-sensing image features and their corresponding regions. The features with higher correlation to the corresponding regions are given a higher weight, while the features with less correlation to the corresponding regions are given a lower weight. The level of correlation is usually reflected by the stability of the features, which is defined as the standard deviation of the features. A lower standard deviation indicates that the features are more stable and more correlated with the region, which indicates higher weights. Suppose that the remote-sensing image contains a total of classes, and each class contains images. The dimension of feature is ; the weight of dimension feature of class is as follows:where is the standard deviation of dimension feature of class aswhere indicates the value of dimensional feature of image and is the average value of dimensional feature of class . Finally, the weight of each dimensional feature is obtained, and a weight matrix with size of is constructed as follows:
Let the feature vector of a region be . The final fused features of the region can be expressed aswhere represents the weight matrix and represents the original high-dimensional feature vector. The above method can optimize the weights of regional features and obtain the discriminative fused feature representation of remote-sensing image regions.
2.3. Visual Word Generation and Representation
As the weight-optimized features are obtained, each patch can be represented by a high-dimension descriptor. The descriptors of all patches are quantized into different kinds of visual words by K-means clustering. However, the K-means clustering algorithm is deficient in the selection of initial clustering centers and isolated points, which requires random selection of the initial clustering centers for the first iteration. The Euclidean distance between the randomly selected initial clustering centers and the finalized clustering centers is large, which would not only lead to the inaccuracy of the finalized visual words but also increase the time consumption. Meanwhile, a large number of outliers are easily generated in the clustering process. If these outlier points are taken into account in the clustering process of each iteration, the obtained clustering centers will deviate from the actual clustering centers and the obtained visual words will not be expressive. To address the above problems, we proposed a density-based clustering center initialization method and an isolated point elimination method.
2.3.1. Density-Based Clustering Center Initialization
The purpose of clustering is to divide the set of feature points into K clusters, and the sum of squared errors within the clusters is less than a predetermined threshold. The obtained clustering centers are in low similarity and contain a high density of feature points around the clustering centers. Similarly, the initial clustering center should also have the above characteristics. In this paper, the Euclidean distance will be adopted as the unit of similarity measure for each cluster center in Euclidean space. Inspired by the density-based spatial clustering of applications with noise method (DBSCAN) [42], the density of feature points in the neighborhood of the cluster center is considered when selecting the initial cluster center. The relevant definitions are as follows.
We define the Euclidean distance between two high-dimensional feature points and as follows:
For a feature point set which contains feature points, represents the Euclidean distance matrix between every two feature points in as
The neighborhood radius can be defined aswhere denotes the smallest distance of the row in matrix and . Thus, the number of feature points in a neighborhood of radius can be represented as follows:
When , can be set as 1. When , will be set as 0. The minimum Euclidean distance between feature point and the initial set of clustering centers can be defined as
In Euclidean space, let be the set of feature points of the selected initial clustering centers and be the set of feature points which are not selected as the initial clustering centers. The weights of the feature points in the feature points set that are selected as the initial clustering centers can be indicated as
Figure 2 shows the example of our proposed method. The red points in Figure 2 indicate the feature points and the blue point indicates the currently selected clustering center. According to equation (11), our proposed method firstly calculates the Euclidean distance distribution matrix corresponding to the set of feature points . Then, we calculate the neighborhood radius based on the matrix , which is indicated by the orange line in Figure 2. After that, the density corresponding to each feature point is calculated; the feature point with the highest density is selected to join the set of initialized clustering centers . Finally, the weights are calculated among the remaining feature points, and the feature points with the largest weight coefficients are continuously selected and added to until the number of feature points in reaches . The density-based clustering center initialization algorithm is shown in Algorithm 1. In our paper, we use an approach described in [43] to determine the optimal number of clusters for the visual word generation.

|
2.3.2. Isolated Point Elimination
Isolated points in the clustering process represent the points which are far from the clustering center. In our proposed method, we select the deviation to measure the distance between isolated points and the cluster center and eliminate isolated points before the start of clustering process.
The sum of Euclidean distances between feature point and the remaining feature points can be defined as follows:
The deviation of the feature point can be calculated as
As shown in Algorithm 2, according to the above definition, we proposed the elimination algorithm for isolated points. The value of deviation of each feature point is calculated on the basis of the Euclidean distance matrix and the set of feature points . If , the corresponding feature point can be defined as an isolated point and eliminated from the feature point set . After all the isolated points are eliminated, a new feature points set will be obtained.
|
Our proposed method is able to obtain discriminative features correlated with corresponding regions, select appropriate initial clustering centers in the visual word generation procedure, and eliminate the influence of isolated points on the clustering results. Then, each patch of the region is mapped to a certain visual word through the clustering process. Thus, each category of the image can be represented by a series of visual words.
2.4. Graphical Model Generation and Image Annotation
Once the representation of images’ regions as sequences of visual words is obtained, the next step is to get a model for each class from its respective training set. LDA is a document-topic generation model, also known as a three-layer Bayesian probability model, including word, topic, and document [44]. In this paper, since we try to apply techniques used in the text domain to images, we need to define an analogy between their respective terminologies as follows:(1)A word corresponds to a segment of the image or a window of pixels (patch)(2)A document indicates a region of an image (tile)(3)A corpus is equivalent to an image
The graphical model representation of LDA is shown in Figure 3, where is a known word, , , and represent the number of words, topics, and documents, and represents the selected topic. We determine the number of topics of each class by computing of a corpus [45]. Used by convention in language modeling, is a measure of the ability of a model to generalize unseen data and is defined as the reciprocal geometric mean of the likelihood of a test corpus given in the model [31]. A lower value indicates better generalization performance of the model. More formally, for a test corpus of documents , the perplexity iswhere , represents the distribution of document on topic , represents the word in , and represents the size of document (i.e., the number of words).

LDA generates a model for each document in the corpus through the following process:(1)Choose a K-dimensional Dirichlet random variable , where is the number of topics(2)For each word , we choose a topic (3)Choose a word from , which represents the multinomial possibility of selecting a word from the topic with being a matrix that expresses the probability for each topic to generate each word
The joint probability of select and can be expressed as
In learning, the goal is to find the corpus-level parameters and such that the log likelihood of the entire database is maximized. We apply variational expectation maximization method [45] to ensure convergence of the iterative process. To estimate the category of each region with the values of and , we choose the category label with the highest probability:
An annotated image can be obtained by combining all of the annotated regions. Therefore, given the training image with their annotations and the visual word representation of the training and test image, we can estimate the annotation of the test image.
3. Experimental Results
In this section, we first describe the dataset used for the experiments and the parameter setting of the proposed method. Second, we compare the results of using single features for annotation by using fused features for annotation on the basic LDA model and prove the necessity of feature fusion. And the performances of using weight-optimized fusion features (WOF features) and nonoptimized fusion features (NOF features) are discussed. Then, we validate the improvement of the annotation results using density-based clustering center selecting methods and isolated point elimination methods in the visual word generation process. Finally, the results obtained for the tile-level annotation of remote-sensing images using other three different methods are discussed.
3.1. Data Description and Parameter Setting
To evaluate the performance of the proposed method for remote-sensing image tile-level annotation, two kinds of dataset with different scenes and spatial resolutions are selected. The first dataset is a set of remote-sensing images from Google Earth with 17 m spatial resolution. It is captured from Japan and contains 105 images with size of pixels. There are four classes in the dataset, including water, farmland, vegetation, and residential area. Figure 4 shows the example of such images. The regions of the water, farmland, vegetation, and residential area are denoted by red, blue, green, and yellow, respectively. To ensure the reality of ground truth, we invited five different people to label the images separately. For each region, we counted the categories of all annotation results in the corresponding positions, and the category with the highest number of occurrences was adopted as the category of the region. The number of topics for water, farmland, vegetation, and residential area are 50, 20, 50, and 40, respectively. The second dataset is a large-scale classification set of Gaofen image dataset (GID). This dataset contains 150 pixel-level annotated GF-2 images with the size of and contains 5 categories [46]. As shown in Figure 5, five categories in this dataset include built-up, farmland, forest, meadow, and water. The number of topics corresponding to the above categories is 80, 50, 50, 20, and 60, respectively. In our paper, the level of the image pyramid for PHOG is set to 1, and the total number of the visual words is set to 300, which shows good performance in our empirical study. The size of the region is set to in order to obtain the best efficiency of the algorithm. We randomly choose of each class as a training set and the remainder as a testing set. The computer environment in which this algorithm runs is 2.6 GHz Intel(R) Core(TM) i7-6700HQ CPU, 8 GB RAM. The system takes about 15 minutes to learn the LDA model.

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)
3.2. Evaluation of the Proposed Method
3.2.1. Evaluation Measures
In the evaluation of the machine learning classifier, it is generally necessary to output the confusion matrix which is calculated as follows:where represents the number of categories in the classification task. Each column of the confusion matrix represents the predicted results and the total number of each column represents the number of all samples predicted to be in that category. Each row represents the true belonging category of the sample, and the total number of samples in each row represents the number of sample instances of this class. And denotes the number of regions with category classified as category .
According to the confusion matrix, the evaluation measures used in our paper include user accuracy (UA) [47], producer accuracy (PA) [48], overall accuracy (OA) [48], and Kappa coefficient [48]. The definitions are as follows.
UA indicates the proportion of correctly classified samples to all samples identified as that category. For a category c, the UA can be represented by
PA denotes the proportion of correctly classified samples to the ground truth. For a category c, it can be calculated as follows:
OA describes the overall performance of the classification results:
However, OA has some limitations when the sample size is widely varied between categories. Therefore, another evaluation measure, Kappa coefficient, is introduced in this paper, which is as follows:where represents the number of categories, represents the total number of samples, represents the diagonal elements in row and column of the matrix, represents the sum of each column in the matrix, and represents the sum of each row in the matrix.
3.2.2. Comparison with Basic Methods
We first compare the results of tile-level image annotation using NOF features and single features on the first dataset. As can be seen from Figure 6(c)–6(f), due to the obvious variability of waters and vegetation in color, when using color features only, these two classes can get better annotation results. However, in the middle area of Figure 6(c), there are still some isolated areas that are identified as other classes besides vegetation category. In the annotation results at the top of Figure 6(c), there are also a large number of mislabeled areas. The annotation using only texture features results in a large number of noise areas in the vegetation category. And a large number of regions belong to vegetation and residential area categories are misclassified in Figure 6(d). In Figure 6(e), some areas were misidentified as vegetation category at the left area of the image. Using NOF features for annotation provides better annotation results compared to using single features for annotation. However, at the upper-right corner area of Figure 6(f), there are still a large number of the residential areas wrongly annotated into farmland category.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)
To verify the improvement of annotation performance using WOF features, we calculate the variance of the extracted features and obtain the weight matrix based on the proposed method. The original features are transformed based on the weight matrix. Finally, the optimized fused features are used for tile-level annotation. As shown in Figure 6(g), it can be seen that the accuracy of the annotation result is improved by using WOF features. As the weights of the corresponding features of vegetation and residential area were adjusted in the feature fusion process, at the upper-right and lower-right corners of the image, the mislabeling of the corresponding areas was also reduced.
Figures 6(c)–6(g) show the annotation results obtained by using the K-means algorithm in the visual word generation process. And Figure 6(h) shows the annotation result using our proposed method. It can be seen that the isolated regions in the annotation results are significantly reduced. At the right upper-right corner of the image in Figure 6(h), the residential area category is no longer wrongly identified as farmland. This is due to the appropriate selection of clustering centers and the reduction of the number of isolated feature points in the visual word generation process. Our proposed method can significantly reduce the number of isolated regions and mislabeled regions in the annotation results and improve the smoothness of the annotation results.
Tables 1 and 2 show the values of UA and PA using different methods for annotation, respectively. For both water and farmland categories, the UA values of using NOF features is higher than using single feature. For vegetation and residential area categories, due to the fact that the weights of texture features and spatial features are not considered in the feature fusion process, the UA values are lower when using NOF features than when using color features only. When WOF features were employed and visual words were generated by using the K-means algorithm, compared to the values of using NOF features, the UA values were increased by 0.001 6, 0.002 4, 0.039 8, and 0.014 4 for the water, vegetation, farmland, and residential area categories, respectively. The improvement in annotation performance is more significant in the categories of farmland and residential areas. The weight optimization in the feature fusion process enhances the flexibility of features and ensures more accurate annotation results when labeling similar regions. Similarly, in the case of using K-means to generate visual words, the PA values of using WOF features are all increased compared to using NOF features, and the PA values are increased by 0.005 9, 0.011 8, 0.004 8, and 0.028 6, respectively.
Our proposed method can obtain a more discriminative feature representation for different regions. More appropriate clustering centers are selected and a large number of isolated feature points are eliminated in the visual word generation process. In Tables 1 and 2, the UA values of all categories reach the highest point and the PA values of vegetation and residential area categories also improve. Our proposed method can achieve higher annotation accuracy, especially for residential area and farmland categories. The PA values of water and farmland categories can also maintain above 0.9.
Table 3 shows the values of OA and Kappa coefficients for annotation using different methods, respectively. In Table 3, the OA of our method is 90.95% and the Kappa coefficient reaches 0.829 3. Compared with basic methods, our proposed method can achieve higher accuracy for remote-sensing tile-level annotation.
3.2.3. Comparison with Other Approaches
In order to verify the effectiveness of our proposed method, we compare the annotation results of our proposed method with three other methods on qualitative and quantitative measures on both two datasets, respectively. Both Figures 7 and 8 show the performance of different approaches and Table 4 shows the mean value of OA and Kappa coefficient on the testing set using four different methods.

(a)

(b)

(c)

(d)

(e)

(f)

(a)

(b)

(c)

(d)

(e)

(f)
The first method is the multilevel max-margin discriminative random field method proposed by Hu et al. [49]. The annotation result of this method is shown in Figures 7(c) and 8(c). It can be seen that there are so many confusions among these classes. For instance, in Figure 7(c), a large number of water areas have been misclassified into vegetation category. This is because the two categories are similar in color, which makes it difficult to distinguish in the annotation process. At the left side of the image, a large number of vegetation areas are misclassified into residential area category. In Figure 8(c), at the top left area of the image, the water areas are also annotated by blue and yellow. In the lower and right side of the image, some of the vegetation category is also misclassified as water. Therefore, their method can only result in a low accuracy of remote-sensing image annotation.
The second method is an object-based LDA (OB-LDA) method [50]. The result of this method is shown in Figures 7(d) and 8(d). Compared with Figure 7(c), the annotation result in Figure 7(d) reaches higher accuracy, and some areas in the vegetation and residential area are no longer incorrectly identified as water category. Meanwhile, the water category on the right side of the image obtains better labeling results. However, their method wrongly identified the river in the upper right corner of the image and covered it with green. In Figure 7(d), there are still a large number of misclassified areas in water, but the number of areas misclassified as water in other categories is significantly reduced. There are a small number of isolated areas appear in the lower-right corner of the image, which is mistaken for residential area. The vegetation category at the top of the image was also misclassified as a residential area. As shown in Table 4, the average performance of the second method is 87.46%.
The third method is the LEGION + SVM method [51]. The result of this method is shown in Figures 7(e) and 8(e). Compared with the first two methods, the annotation results obtained by this method are smoother, and the annotation results are better in the category of water. However, in Figure 7(e), the river in the image is still identified as vegetation category. In Figure 7(e), a small number of residential areas are wrongly identified into vegetation category at the top of the image. In Figure 8(e), at the middle of the image, there are plenty of farmland areas wrongly detected as residential areas. The accuracy of this method is shown in Table 4.
Figures 7(f) and 8(f) show the results of our proposed method. Our proposed method can achieve better annotation results both for residential areas and farmland categories. At the top of Figure 7(f), our method can eliminate isolated areas in the category of residential areas and obtain better accuracy. Similarly, in Figure 8(f), our method reduces the number of misclassified areas in farmland. Compared with the other three methods, our proposed method achieves the highest OA value and Kappa coefficients.
After the experiments on the first virtual dataset, we evaluated the performance of our proposed method on a large-scale classification set of GID and compared the results with three other methods. Figure 9 shows the annotation results of different methods.

(a)

(b)

(c)

(d)

(e)

(f)
As shown in Figure 9, from visual inspection, all the annotation results seem to be compact. We selected a region of 1500 1000 to verify the performance of our proposed method. As shown in Figure 9(a), the image contains three categories of build-up, water, and farmland. In Figure 9(c), a portion of the water area above the image is clearly misclassified as farmland category. In Figures 9(c) and 9(d), obvious misclassification between build-up and farmland can be observed. This phenomenon can be explained by the fact that there exists a clear similar texture between the farmland and the buildings in Figure 9(a). In Figure 9(e), a part of the water category in the lower-right corner is misidentified as farmland category, due to the darker color of the farmland in this area, which is closer to the color of the water. In Figure 9(f), the obvious speckle noise or the isolated areas are greatly eliminated. Our proposed method can avoid the abovementioned misrecognition problems and achieve the highest accuracy. The OA value and Kappa coefficients of the four methods are shown in Table 5.
3.3. Sensitivity of the Parameters to the Annotation Accuracy
In this section, we analyze the impact of the parameters on the annotation results. In this paper, a confusion matrix for actual and predicted class is formed comprising of , , , and to evaluate sensitivity of the parameters to the annotation accuracy. The significance of the terms is given below:
, which refers to the proportion of TP in positive cases predicted by the classifier, can be represented as follows:
Similarly, represents the proportion of TP in all positive cases, which is calculated as follows:
Thus, the measure which used to represent the annotation accuracy of the two datasets can be represented as follows:
3.3.1. Patch Size
A large size of patches can obtain more features of the image, and meanwhile, it may introduce some redundant features. On the contrary, a small size of patches refines the representation of the images, although it may ignore some essential features. To determine the proper size of the patch, we set three different sizes of the patch, i.e., , , and to test the influences on the annotation results in two datasets. The size of each region was set to and the number of visual words was set to 300.
As shown in Figure 10(a), the values of water and vegetation categories are larger than 0.9 and that of residential area and farmland categories is smaller than those of water and vegetation categories, which is due to the complex texture information. When the patch size is set to , the value of residential area reaches the highest. In Figure 10(b), the value shows in no large variability. As shown in Figure 10, the size of the patch slightly influences the annotation quality, and our proposed method can reach better performance when the patch size is set to .

(a)

(b)
3.3.2. Number of Visual Words
We also investigated the effect of the number of visual words on the annotation results. Figure 11 shows the annotation quality of two datasets corresponding to the different numbers of visual words. In our experiments, we selected 50, 150, 300, and 450 visual words to evaluate the effect of different numbers of visual words on the annotation results, respectively. The size of the patch was set to and the size of each region was set to . The increase of the number of visual words enhances the coding ability of features, which can obtain a more discriminant image representation. However, an excessive number of visual words may result in redundancy in the feature encoding process. As shown in Figures 11(a) and 11(b), with the increase in the number of visual words, the value can achieve better results and when the number of visual words is equal to 300. Therefore, we finally choose as the number of clusters.

(a)

(b)
3.3.3. The Introduction of Undefined Categories
In our experiments, except for the predefined categories, a small number of other categories also exist, such as roads and sandy areas. We ignored these categories in our previous experiments due to their small proportion in the images. Here, we treated the categories of roads, sandy areas, and bare soil as “undefined” categories and introduced them into the training and evaluation process. We analyzed the effect of introducing the “undefined” category on the annotation results in the first dataset. In our experiments, the number of visual words was set to 300, the patch size was set to , and the size of each tile was set to . Part of the annotation results is shown in Figure 12. In Figure 12, the areas labeled as “undefined” categories are colored in black. Since the “undefined” category includes ground objects with large variability in color, texture, and shape, its features may result in less discriminative, which not only fails to achieve high accuracy when annotating but also affects the annotation accuracy of other categories. In Figure 12, although the categories of bare soil and sandy areas are correctly labeled as “undefined” categories, some of the water and vegetation categories are incorrectly labeled. Meanwhile, some sandy areas on the seashore are mistaken as residential areas because of the similarity in color between residential areas and sandy areas. In the results of our experiment, only a small portion of the road area is correctly labeled. This is because the texture features of the road category are not obvious and the shape of the road is highly variable. When remote-sensing images are divided into tile-level regions, the continuity of road categories will be cut off, making it difficult to extract shape features.

Figure 13 shows the variation of values for each category after the introduction of the “undefined” category. As can be seen from Figure 13, after the introduction of the “undefined” category, the accuracy of the original category is slightly reduced but remains at a high value, and the value of the “undefined” category nearly reaches 0.7. Our proposed method can obtain a good annotation result when facing undefined categories such as bare soil and sandy areas. However, the annotation result needs to be improved when facing categories with obvious shape features and topological features including roads, for instance. In order to avoid such problems, we plan in future work to add a “undefined” class or to define semantic concepts that cover all the possible area types in the testing set.

4. Conclusions
In this paper, we focus on the tile-level annotation of remote-sensing images. Our proposed method can learn semantic representation and produce discriminative class labels. The annotation performance is improved by using the weight optimization method, considering the relationship between feature and image representation. The density-based clustering center selection method is used in the visual word generation process to improve the expressibility of visual words, and meanwhile, the isolated feature points are eliminated. The experimental results on two datasets, of quite different land covers, have demonstrated its robustness and effectiveness. And a comparison with three other methods is also provided. It has been shown that our method achieves better performance for remote-sensing image tile-level annotation than the other methods. Our method can effectively reduce the cost of manual annotation, minimize the errors caused by the manual annotation process, and obtain remote-sensing images with accurate semantic concepts. However, the performance of our proposed method is not satisfactory for labeling classes without predefined categories such as roads and deserts. We will concentrate on the remote-sensing image tile-level annotation method in the case of containing unknown categories in the future.
Data Availability
The first dataset used to support the findings of this study is available in Google Earth software. The second GID dataset used to support the findings of this study is from previously reported studies and datasets, which have been cited. The processed dataset is available at https://x-ytong.github.io/project/GID.html.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was funded by State Key Laboratory of Geo-Information Engineering (no. SKLGIE2019-Z-3-2).