Abstract

This paper deals with the problem of the classification of large-scale very high-resolution (VHR) remote sensing (RS) images in a semisupervised scenario, where we have a limited training set (less than ten training samples per class). Typical pixel-based classification methods are unfeasible for large-scale VHR images. Thus, as a practical and efficient solution, we propose to subdivide the large image into a grid of tiles and then classify the tiles instead of classifying pixels. Our proposed method uses the power of a pretrained convolutional neural network (CNN) to first extract descriptive features from each tile. Next, a neural network classifier (composed of 2 fully connected layers) is trained in a semisupervised fashion and used to classify all remaining tiles in the image. This basically presents a coarse classification of the image, which is sufficient for many RS application. The second contribution deals with the employment of the semisupervised learning to improve the classification accuracy. We present a novel semisupervised approach which exploits both the spectral and spatial relationships embedded in the remaining unlabelled tiles. In particular, we embed a spectral graph Laplacian in the hidden layer of the neural network. In addition, we apply regularization of the output labels using a spatial graph Laplacian and the random Walker algorithm. Experimental results obtained by testing the method on two large-scale images acquired by the IKONOS2 sensor reveal promising capabilities of this method in terms of classification accuracy even with less than ten training samples per class.

1. Introduction

The intent of the classification process is to categorize all pixels in a remote sensing (RS) image into one of several land-cover classes. Most classification solutions have used RS images with small sizes (typically less than 1000 × 1000 pixels). However, the latest generation of satellite-based imaging sensors (Pleiades, Sentinel, etc.) acquires big volumes of Earth’s images with high spatial, spectral, and temporal resolution. This leads to images of very large sizes and creates new challenges for land-use classification algorithms. In this case, traditional pixel-based algorithms become unfeasible due to (1) increased computational and memory costs, (2) increased spectral variation within the same land-cover class, and (3) increased variability in the data distributions over this large-scale image, and (4) the need to collect a large set of training samples to properly model the underlying distribution of the data in the image. The latter issue may be caused by the increased resolution which heightens the effect of lighting conditions, the depth dimension, and the sensor’s noise. The same land-covers and even the same objects can be found in images belonging to different classes. Thus, there is a great need to develop efficient solutions to classify large-scale images.

One solution to this problem is to use object-based techniques. This can be accomplished first via segmenting the large image into regions and then performing object analysis on the regions. However, segmenting large images is, in itself, still a difficult problem [1]. Another solution is to divide the image into tiles and then perform coarse classification of the image by classifying the tiles instead of pixels as was introduced in [24].

The problem of tile classification becomes then similar to object classification or recognition in computer vision applications. In those applications, many local descriptors, like bag of words (BOW) [5], local binary patterns (LBP) [6], and histograms of oriented gradients (HOG) [7], with their invariance to geometric and photometric transformations, have been proven effective especially for object recognition. They can be extracted both in sparse and in dense ways. Recently, deep convolutional neural networks (CNN) [810] have shown outstanding results in many computer vision applications such as image classification [11], object recognition [12], face recognition [13], medical image analysis [14], speech recognition [15], and traffic flow prediction [16]. Several CNN architectures have been already proposed and tested in computer vision tasks, and most of them have been implemented and made available online [17].

CNNs use images directly as input, and they automatically learn a hierarchy of features instead of using handcrafted features. This is accomplished by successively convolving the input image with learned filters to build up a hierarchy of feature maps. The hierarchical approach allows learning more and more complex and highly descriptive set of features. One disadvantages of CNN is the high computational costs during the training phase due to the high number of weights to be estimated (in the range of hundreds of thousands). Another stringent requirement is the need for a huge number of training images [18]. One way to solve this problem is to use CNNs that have been pretrained on huge auxiliary datasets and then transfer the knowledge embedded in them to help extract highly descriptive features in our new domain. These highly descriptive features are then used to train another classifier (such as a smaller neural network) for our particular recognition task.

In the remote sensing community, some works have introduced CNN as a solution to classification and object detection problems [1923]. For the scene classification problem, the reader is referred to this work by Cheng et al. [24] for a recent review of the state-of-the-art review in the area. As a set of sample work, Castelluccio et al. proposed a solution for the classification of RS scenes by starting with a pretrained CNN and then fine-tuning it on the target data in order to improve the accuracy [19]. Zhang et al. proposed a method called gradient boosting random convolutional network (GBRCN), which trains and fuses the result of several CNNs for the purpose of RS scene classification [22]. Chen et al. proposed a method to detect small objects such as vehicles in satellite images using a CNN to extract and learn rich features from the training data [21]. Marmanis et al. extracted feature representations using a pretrained CNN and then fed them into another CNN, which is trained in a supervised manner [20]. Another more recent work proposes a novel new feature for scene classification called bag of convolutional features (BoCF) [23]. This is similar to bag of words concept, except that the codebook is composed of convolutional feature vectors collected from the 5th layer of the VGGNet-16 CNN when it is applied to all training and testing scenes.

Another important issue in RS classification is that the collection of a statistically significant and representative amount of training (labelled) samples is a tedious task. Thus, the RS community has introduced semisupervised learning (SSL) methods to tackle this problem [2529]. SSL methods attempt to exploit the wealth of information provided by unlabelled data, besides the few available labelled data, to improve the performance of the classifier. SSL is based on the assumption that points within the same structure (such as a cluster or a manifold) are likely to have the same label [29]. We can use the large amount of unlabelled data to uncover such structure. For example, in RS images, it is reasonable to assume that samples are likely to have similar labels, if they have close spectral information or if they are neighbours spatially. Many machine-learning researchers have found that unlabelled data, when used in conjunction with a small amount of labelled data, can produce considerable improvement in learning accuracy [30].

Semisupervised learning is natural for deep CNN architectures since the latter requires huge training data and in practice, either we never have enough labelled data or it is quite costly to get them. Nonlinear embedding algorithms are popular for use with shallow semisupervised learning techniques such as kernel methods, but they can be applied to deep multilayer architectures as well, as a regularizer either at the output layer or on the hidden layers of the architecture [31]. Graph-based techniques are one way to perform semisupervised learning. In this approach, one tries to preserve the structure present in the unlabelled data by building a graph where each sample data point is a node and the edges encode the similarity between data points. In the literature, graph Laplacian is embedded at the hidden layer and optimized in a joint way within the error function of the network [29]. Another approach is to use the graph Laplacian as a regularizer at the output layer such as in Markov random field (MRF).

In this work, we present an effective solution for the classification of large-scale VHR RS images, where the number of labelled samples is limited (no more than ten tiles per class). To deal with the huge size of the VHR images, we subdivide the large image into square tiles and classify them instead of classifying pixels. This is our first contribution in this paper. Our second contribution is presenting a deep architecture for semisupervised classification of large-scale images. This deep architecture is composed of a pretrained CNN, for feature extraction, followed by two fully connected layers for classification. The proposed solution accomplishes semisupervised learning via simultaneous embedding of a graph Laplacian in the hidden layer followed by a spatial graph Laplacian after the output layer for regularization. The experimental results, carried out on two large-scale VHR images, have shown that embedding a graph Laplacian at the feature level and at the output level simultaneously provides significant improvement in the classification result.

The rest of the paper is organized as follows: in Section 2, we present the details of proposed method. In Section 3, we present the experimental results and discuss them. Finally, we present our concluding remarks and future works in Section 4.

2. Materials and Methods

The general overview of our proposed solution is shown in Figure 1. The first step in this solution is to divide the large image into a grid of tiles, and the goal is to classify these tiles instead of pixels to produce a coarse tile-based classification of the RS image. Let I be a large VHR RS image that is divided into a grid of tiles where represents tile of size . The second step in the algorithm is feature extraction using one of the pretrained CNNs found in the literature. Let be the set of features extracted from each one of the tiles. Furthermore, let be a training set of labelled sample tiles, where are the corresponding class labels.

The classification step is carried out using a fully connected neural network (NN), as shown in Figure 1, with one hidden layer and an output (softmax) layer. As mentioned previously, using this proposed architecture, we can implement semisupervised learning in different ways. We can embed the graph Laplacian at the hidden layer of the fully connected neural network (i.e., at the feature level). We can also embed a graph Laplacian at the output level, which implements what is called output regularization. Finally, we can implement a suitable combination thereof. Our final proposed solution shown in Figure 1 embeds a graph Laplacian (spectral or spatial) in the hidden layer of the NN and also adds an additional layer that performs regularization of the output labels. In the remainder of this section, we explain how to use a pretrained CNN for feature extraction. Then the next subsection shows how to perform classification using a fully connected neural network. Finally, we explain how to achieve semisupervised learning via embedding a graph Laplacian at the hidden layer of the neural network and also embedding another graph Laplacian to perform output regularization spatially.

2.1. Feature Extraction from Tiles

The second step in the proposed solution is to extract feature descriptors from the image tiles. Several types of handcrafted feature descriptors are presented in the literature for image representation, such as bag of words (BOW) [5], local binary patterns (LBP) [6], or histograms of oriented gradients (HOG) [7]. However, more recently, the new state-of-the-art approach is to use feature descriptors that can be learned instead of handcrafted features. CNNs play a major role in this regard as they have been proven effective in feature learning from images. However, as CNN needs a huge amount of training data, one can resort to pretrained CNN as a compromise solution. In this case, a pretrained CNN becomes like a feature extraction tool instead of a classifier as it is originally designed. This approach exploits the huge information embedded in pretrained CNNs to provide highly descriptive features from the tiles.

A CNN is composed of several layers that do different functions such as convolution, pooling, activation function, and classification as shown in Figure 1. The last layer is always a softmax layer which is used for classification. Given this deep nature, one can extract features at different layers of the CNN. In this work, we take the output of the hidden layer before the softmax layer as feature vector representation. Thus, given a CNN with L layers, we feed each tile image , as input to the CNN and generate a feature representation at layer , of the CNN, with , that is, where is the dimensionality of the feature vector.

2.2. Classification via a Fully Connected Layer

Given a training set composed of feature vectors and their corresponding labels, the hidden layer takes the input and maps it to another representation of dimension through the nonlinear activation function as follows: where is the mapping weight matrix. A typical choice of the activation function is the sigmoid function, that is, . For simplicity, we omit the bias vector in the expression as it can be incorporated as an additional column vector in the mapping matrix; then in that case, the feature vector should be appended by the value 1.

The softmax layer performs multiclass classification and takes as input the resulting hidden representation . It produces, as a result, an estimate of the posterior probability for each class label as follows: where are the weights of the softmax layer and the superscript refers to the transpose operation. To learn the weights representing the complete network structure, we minimize the training error on the labelled training data. The cost function is then formulated as follows: where is the set of labelled data and is the cross-entropy loss, which measures the error between the actual network outputs and the desired outputs of the labelled source data. As the outputs of the network are probabilistic, we propose to maximize the log-posterior probability to learn the network weights, which is equivalent to minimizing the so-called cross-entropy error: where is an indicator function that takes 1 if the statement is true otherwise it takes 0 and the superscript refers to matrix transpose.

2.3. Graph-Based Semisupervised Learning

The idea of graph-based semisupervised learning is to build a graph that connects similar sample data points to each other, and then use this graph to propagate estimated labels among similar data points. This follows directly from the assumption that similar data points should have similar labels, which the graph tries to encode.

Figure 2 illustrates how a Laplacian matrix is built using a simple example. Let be a graph with vertices and edges Let denote an edge spanning two vertices and and is the associated weight. The degree of a vertex is for all edges incident on . Then the tile-based Laplacian matrix indexed by vertices and is given by:

In our case, each tile image is associated with a vertex in the graph, while edges are defined in several ways. The k-nearest graph, for example, connects each vertex to the k-most similar vertices. The number is usually left as a parameter to be determined. Another approach is to connect each vertex to all vertices whose similarity is below a certain threshold value. This is known as a spectral graph and is illustrated in Figure 2(d).

However, here, the tiles are also spatially dependent. Thus, we can also consider adjacency as a similarity, that is, we can make the assumption that neighbouring tiles are similar and should have similar labels. In this case, the number of connections per node is usually limited to two choices: 8 or 4 depending on whether we want to consider all 8 neighbouring tiles or only 4. This is known as a spatial graph and is illustrated in Figure 2(c).

Finally, a weight is associated with each edge in the graph. This weight is defined as the similarity between image tiles corresponding to vertices and . It can be calculated using many types of similarity measures or distance measures. A common choice for obtaining these weights is the Gaussian weighting function, that is, and is a free parameter (usually set to 1).

2.3.1. Embedding Graph Laplacian in the Hidden Layer

The first approach to implement semisupervised learning is by embedding a spectral or spatial graph Laplacian in the hidden layer. The fully connected neural network is supplied with labelled and unlabelled data. Then to learn the weights representing the complete network structure, we propose to simultaneously minimize: (i) the training error on the labelled data and (ii) the energy of the graph Laplacian built on all unlabelled tiles. The proposed cost function is then formulated as follows: where is a regularization parameter. The first term is the same as (4), while the second term is added for graph regularization. This basically minimizes the energy of the graph and can be written as follows:

It can be shown that (8) can be written in the following matrix form: where represents the unlabelled samples.

By substituting the terms in (5) and (9), the total cost function is then given by

2.3.2. Spatial Regularization of the Output Layer

Another approach to exploit the graph Laplacian is an output regularizer based on the random Walker (RW) algorithm. This technique is similar to a Markov random field (MRF) approach; however, unlike the MRF, it yields a closed-form solution as opposed to an iterative one [32]. This regularization step needs a spatial graph Laplacian and posterior probability estimates as initial inputs. Then the RW algorithm can improve the classification result by minimizing an energy function of the graph Laplacian, defined as follows: where and are now probability vectors (of dimension ) associated with vertices (tiles) and , respectively. The first term is related to image smoothness since it computes the norm between pixels. The second term is related to data fidelity since it computes a norm between the initial estimation (provided by the fully connected NN) and the new label estimates. Finally, and are local and global weights, respectively, enforcing that fidelity. Moreover, the local weights could be set equal to , but for simplicity, we suppose for all tiles in the image.

The set of vertices of the graph can be written as , where and are the set of labelled and unlabelled vertices, respectively. In this case, the Laplacian matrix can be written as . After differentiating, setting the result to 0 and solving the matrix equation, we found that the probabilities associated with the unlabelled tiles are obtained by the following closed-form solution: where and are the initial estimated probabilities (posterior probabilities of the fully connected NN classifier) associated with the unlabelled tiles. Note that is a positive semidefinite matrix, and is a sparse (Laplacian) matrix. Thus, the linear system of equations in (12) can be solved in linear time.

3. Results

3.1. Dataset Description

We tested the proposed solution on two large-scale VHR RS images. The first image is of size 6500 × 10400 taken over the city of Jeddah in Saudi Arabia by the IKONOS-2 sensor in July 2004 (see Figure 3(a)). The image has three spectral bands with a spatial resolution of 1 m. It has been divided into 3977 tiles of size 128 × 128, of which 1949 tiles are unlabelled, while 2028 are labelled into nine land-cover types including asphalt (73 tiles), bare soil (61 tiles), beach (60 tiles), cargo (174 tiles), dense residential (203 tiles), medium residential (193 tiles), sparse residential (47 tiles), trees (74 tiles), and water (1216 tiles).

The second image is of size 14000 × 14000 and is taken from the city of Riyadh by GeoEye sensor in June 2010 (see Figure 4(a)). The image has three spectral bands with a spatial resolution of 0.5 m. It has been divided into 4900 tiles of which 3860 are unlabelled, while 1040 are labelled into five land-cover types including bare soil (163 tiles), freeway (185 tiles), parking (109 tiles), residential (486 tiles), and sparse residential (97 tiles).

3.2. Experimental Setup

Both large images have been subdivided into a grid of equal size tiles. The Jeddah image is divided in tiles of size 128 × 128 pixels, while the Riyadh image tiles are of size 196 × 196 pixels. We start the division from the top left corner, and any leftover pixels (in case the size of the large image is not a multiple of 128) appear at the right and bottom sides of the image. The ground truth classification maps are created manually by inspecting a random set of tiles (the test set) one by one and assigning one of the 9 labels to all tiles in the test set. Testing maps for both the Jeddah and Riyadh images are shown in Figures 5(b) and 6(b), respectively. For both images, we extract feature representations from all tiles including HOG, LBP, BOW, and several pretrained CNN architectures. The set of pretrained CNNs used includes the imagenet-vgg-very-deep-16 and imagenet-vgg-m models by the VGG group at the University of Oxford [33, 34], the GoogleNet model developed by a team from Google Inc. [35], and Microsoft’s ResNet-152 model [36]. These CNNs are selected because they have won the first and second places in the ImageNet Large-Scale Visual Recognition Challenge- (ILSVRC-) 2014 and 2015 challenges. Table 1 shows the sizes of each feature representation.

As for the assessment method, we use the Overall Accuracy (OA) and Average Accuracy (AA) measures. To assess the statistical significance of the improvements achieved by applying the semisupervised learning, we computed McNemar’s test, which is based on the standardized normal test statistic: where measures the pairwise statistical significance of the difference between the accuracies of the and classifiers. The variable stands for the number of samples classified correctly and wrongly by the and classifier, respectively. Thus, and are the counts of the classified samples on which the two classifiers disagree. At the commonly used 5% level of significance, the improvement of classifier over is said to be significant if . Finally, all experiments are repeated 10 times with different random training tiles and then the average accuracy values are presented.

3.3. Comparison of Various Pretrained CNN and Finding the Best Hidden Layer Size

In this first experiment, we compare the performance of feature representations obtained by the pretrained CNNs and the ones obtained by BOW, LBP, and HOG. We also investigate in this experiment the sensitivity of the results with respect to the hidden layer size for all the types of feature descriptors considered in this study. We train a fully connected NN with the following hidden layer sizes 32, 64, 128, 256, 512, and 1024. We perform the experiment 10 times; each time, we select a different set of training sample tiles per class randomly and average the accuracy results.

Figure 7 shows the overall accuracies (OA) achieved for each type of feature descriptor and with different hidden layer sizes. It is clear from this experiment that CNN features are quite superior to all other types of handcrafted features regardless of the hidden layer size. This confirms other results about CNN that is reported in the object recognition literature.

Furthermore, if we consider CNN type features only, VGG very-deep-16 pretrained CNN, in particular, provided the best OA. We can also conclude that in general the range [64, 256] constitutes the best values for the hidden layer size. For the remainder of this paper, the hidden layer size of 256 and the VGG very-deep-16 pretrained CNN will be used in all experiments.

3.4. Semisupervised Learning Results

In this set of experiments, we present the results of applying semisupervised learning via graph Laplacian. We first implement embedding a spatial graph Laplacian as an output regularizer based on the RW algorithm without embedding any graph in the hidden layer. This step has a regularization parameter as shown in (11). In Figure 5, we show the Z score achieved for different values of the regularization parameter in the range of . From these results, we learn that output regularization does not provide significant improvement for the Riyadh dataset as the Z score is greater than −1.96.

However, the improvement for the Jeddah dataset is significant (<−1.96). We also learn that the best value for the parameter is around 4 to 6; thus, we decided to fix this parameter to 5 for optimal performance of the method.

In the second experiment, we present the results of embedding a graph Laplacian in the hidden layer. We investigated embedding both a spectral graph Laplacian and spatial graph Laplacian. The cost function to be optimized, shown in (10), has a regularization parameter. Figure 6 shows the Z score achieved for different values of the regularization parameter in the range of . From these results, we learn that embedding a spatial graph Laplacian in the hidden layer degrades the classification accuracy significantly as the Z scores in Figures 6(c) and 6(d) are all above −1.96. On the other hand, embedding a spectral graph Laplacian in the hidden layer does provide significant improvement for both datasets as the Z score is less than −1.96. As for the best value for the parameter, it is clear, from Figures 6(a) and 6(b), that it is around the values 0.08 to 0.1; thus, we decided to fix this parameter to 0.1 for optimal performance of the method.

Next, we present the results of the complete system shown in Figure 1. The results for the Jeddah and Riyadh image are shown in Tables 2 and 3, respectively. These tables show the OA, AA, and per class accuracy for 5 different scenarios: (1) using the fully connected NN only, (2) using the NN followed by spatial regularization using RW algorithm, (3) using the NN with spatial graph Laplacian in the hidden layer, (4) using the NN with spectral graph Laplacian in the hidden layer, and finally, (5) using the NN with both spectral graph Laplacian in the hidden layer followed by spatial regularization using RW algorithm, which is the complete solution proposed in this paper. These tables also show execution times in seconds, which include both training and testing times.

The results show clearly that semisupervised learning has improved the accuracy significantly for all types of CNN. However, in agreement with an earlier experiment, the best results achieved are with the vgg-very-deep-16 pretrained CNN, with an OA of 91.23% and an AA of 83.68%. Also, the semisupervised learning has significantly improved the accuracy of the NN as the Z score reached −4.67. Similarly for the Riyadh image, the results shown in Table 3 indicate an even better improvement induced by semisupervised learning with a Z score below −5 for all types of CNN. Again, the best results are achieved using the vgg-very-deep-16 pretrained CNN, with an OA of 95.26%, an AA of 92.91%, and a Z score of −5.40.

Finally, qualitative results are also shown in Figures 3 and 4 for the Jeddah and Riyadh datasets, respectively. Here, we show the classification maps that have achieved the highest OA after semisupervised learning among the 10 successive runs of the algorithm. In Figures 3(e) and 4(e), we show the tiles whose labels are corrected by semisupervised learning. Comparing Figure 3(d) to Figure 3(c), one can see the type of improvement produced by semisupervised learning, namely, the reduced noise particularly in the water and in residential classes. As for the Riyadh classification maps shown in Figures 4(c) and 4(d), the improvements are clearly visible in the freeway class, where it is more recognizable as roads.

In this final experiment, we compare the proposed method to a state-of-the-art method in the literature called the Laplacian extreme learning machine (LapELM) method [37]. To make the comparison fairer, we have changed the features used in the original method [37] to the same CNN features used here. Thus, the two methods now use the same CNN features extracted with the help of the vgg-very-deep-16 pretrained CNN. As for the other parameters of the LapELM method, we have set the ranges for the regularization parameter C to [0.001, 1000] and the RBF kernel parameter γ to [0.01, 5]. To estimate the best parameter values in these ranges, we use the deferential evolution technique, which outperforms the grid search technique as shown in [37]. As for the proposed semisupervised method, we note that the batch size parameter is usually set to 100 in the literature. However, here, we should take care not to set the batch size to be larger than the total number of training samples (number of samples per class × the number of classes). For this experiment, we set it equal to the number of samples per class times the number of classes.

We also study the sensitivity of the method to reduced training samples (less than 10). Figure 8 shows the OA for sample sizes from 2 to 10 for both methods being compared. Figure 8(a) shows the results for the Jeddah dataset, while Figure 8(b) shows the result for the Riyadh dataset. The first observation is that the classification accuracy for both methods degrades gracefully when the number of training samples per class decreases. However, the results confirm again the superiority of the proposed method compared to Laplacian ELM.

4. Conclusions and Future Work

In this letter, we proposed a practical and efficient solution to the semisupervised classification of large-scale VHR RS images. The proposed method is based on subdividing the image into square tiles and then extracting feature descriptors on these tiles. These in turn are fed into another fully connected neural network with semisupervised learning, which exploits the structural information contained in the unlabelled tiles. Semisupervised learning is implemented in our proposed method in two ways, first by embedding a graph Laplacian in the hidden layer and second by spatial regularization of the output layer using a random walker algorithm.

Experimental results have shown that our proposed method can provide coarse classification maps for very large RS images with high accuracy. Furthermore, we have shown that our proposed semisupervised learning that combines a graph Laplacian in the hidden layer and spatial regularization of the output layer always provides significant improvement in terms of the overall accuracy, while having reasonable computational times for such large VHR images.

As a future research direction, we can employ a self-paced selection strategy of unlabelled titles similar to the techniques used in [38, 39] to further improve the performance. One can also study different fusion techniques of the feature descriptors of many pretrained CNNs. Another direction is to research ways to combine data from different datasets in order to improve the accuracy of the learning algorithm.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

The authors would like to extend their sincere appreciation to Deanship of Scientific Research at King Saud University for funding this research group (no. RG-1435-050).