Abstract

Convolutional Neural Network- (CNN-) based land cover classification algorithms have recently been applied in hyperspectral images (HSI) field. However, the large-scale training parameters bring huge computation burden to CNN and the spatial variability of spectral signatures leads to relative low classification accuracy. In this paper, we propose a CNN-based classification framework that extracts square matrix representation-based spectral-spatial features and performs land cover classification. Numerical results on popular datasets show that our framework outperforms sparsity-based approaches like basic thresholding classifier-weighted least squares (BTC-WLS) and other deep learning-based methods in terms of both classification accuracy and computational cost.

1. Introduction

Different from traditional images (e.g., RGB image), hyperspectral image (HSI) with hundreds of spectral channels has been widely applied in remote sensing [1]. Land cover classification is an important way to extract useful information from the HSI [24] where the task is to predict the type of land cover presented at the location of each pixel. There are several challenges associated with the predictive analysis of HSI classification: huge computation resulting from large-scale training parameters and large spatial variability of spectral signature.

Recently, supervised classification is probably the most active research area in hyperspectral data analysis. There is a vast literature on this topic using supervised machine-learning models, such as decision trees [5], random forests [6], and support vector machines (SVM) [79]. A random forest [10] is an ensemble learning approach that operates by constructing several decision trees in the training course and outputting the classes of the input hyperspectral pixels via integration of predictions of the individual trees. The SVM-based methods [8, 11], in general, follow a two-step approach. Firstly, complex handcrafted features are computed from the raw data input, and secondly, the obtained features are used to learn classifiers. However, these approaches are treated as “shallow” models, and invariance and abstractness of the extracted features are limited compared to the “deep” ones. It is believed that, compared to the “shallow” models, deep learning architectures are able to extract high-level, hierarchical, and abstract features, which are generally more robust to the nonlinear processing.

Convolutional Neural Network (CNN) is regarded as an important branch of the deep learning family and especially good at image classification [12]. If designed properly, a CNN provides a hierarchical description of the input data in terms of relevant and easily interpreted features at every layer in image categorization tasks. For example, W. Hu et al. [13] trained a simple one-dimensional (1D) five-layer CNN that directly classifies hyperspectral images in spectral domain. D. Guidici et al. [14] attempted to carry out 1D CNN for classifying land cover from multi-seasonal hyperspectral imagery, followed by the features extracted from the spectral domain through the training of the network. To avoid overfitting, S. Mei et al. [15] suggested a spectral-spatial-feature-based classification framework, which jointly makes use of batch normalization, dropout, and parametric rectified linear unit activation function and a 1D CNN. However, above-mentioned algorithms are all based on inputting 1D vector column corresponding to every pixel into CNN framework, followed by huge parameters when one needs to deal with hyperspectral data, which faces the difficulties of complex computation burden and much information redundancies to training framework.

To address the issue of imbalance between classification accuracy and computation, this paper establishes the framework of Fast Matrix Representation-Based Spectral-Spatial CNN (FMRSS Net), which uses a matrix representation of every pixel as input feature fed to the proposed deep model using format conversion. This approach reduces large-scale training parameters compared to that of vector column to decrease the computation burden. In addition, we also explore the spatial context in spectral domain to reduce the disturbance of interclass spectral variation. Furthermore, the learned feature can be transferred to different data or tasks [16].

The rest of this paper is organized as follows. An introduction to the existing methods is briefly given in Section 2. The details of the proposed FMRSS Net are described in Section 3. The datasets description, network analysis, experimental results, and a comparison with state-of-the-art algorithms are provided in Section 4. Finally, Section 5 concludes this paper.

2. Existing Methods

In this section, two kinds of state-of-the-art HSI classification methods were described in this work: one is the deep learning-based method and the other is the sparsity-based method.

2.1. HSI Classification via Simple CNN

At a broad level, a CNN is a deep-network topology that typically combines convolutional filter layers in conjunction with a classification network, which for this work is a fully connected neural network (NN). Through the standard back-propagation training process, convolutional filters are trained to capture salient structural features from the sequential input data. W. Hu et al. [13] mention these structural features as the “intraclass appearance and shape variation” within spectra and apply to HSI classification in the first time. The architecture of their proposed classifier contains five layers with weights, including the input layer, the convolutional layer, the max pooling layer, the full connection layer, and the output layer. They utilized single-season hyperspectral data and simple 1D CNN across the full spectral dimension to classify land cover with 90 to 93% overall accuracy, and CNN outperformed SVM by 1 to 3%. There are some drawbacks of this strategy. First, the proposed CNN is employed to classify HSI only in spectral domain which ignores the spatial information and leads to low accuracy. Second, the number of training parameters is large resulting in huge computation burden.

2.2. HSI Classification via BTC-WLS

The basic thresholding classifier (BTC) is a lightweight sparsity-based algorithm for HSI classification proposed by M. A. Toksöz et al. [17]. BTC is derived from the basic thresholding algorithm which could be considered as one of the simplest techniques in compressed sensing theory [18, 19]. BTC is a pixelwise classifier which uses only the spectral features of a given test pixel. It performs the classification using a predetermined dictionary consisting of labeled training pixels. It then produces the class label and residual vector of the test pixel. In addition, their proposal is extended to a three-step spectral-spatial framework to improve classification accuracy. First, every pixel of a given HSI is classified using BTC. The resulting residual vectors form a cube which could be interpreted as a stack of images representing residual maps. Second, each residual map is filtered using an averaging filter. Finally, the class label of each test pixel is determined based on minimal residual. For the spectral-spatial proposal, BTC is also applied to the same filtering techniques in order to smooth the resulting residual maps and the version of it is called BTC-WLS (based on weighted least squares filtering [20]). The reason that this proposal includes the WLS filter is that it does not cause halo artifacts at the edges of an image as the degree of smoothing increases. The proposal outperforms well-known SVM-based methods and sparsity-based greedy approaches like simultaneous orthogonal matching pursuit in terms of both classification accuracy and computational cost. In the spectral-spatial case, although the BTC-WLS algorithm achieves the best results in terms of all metrics, it cannot distinguish between small targets in hyperspectral image scene for generally lacking training samples of desired class.

2.3. HSI Classification via SAE-LR

Chen et al. [21] employed deep learning method to handle HSI classification for the first time, where a stacked autoencoder (SAE) followed by logistic regression (LR) was adopted to extract the deep features in HSI. This paper optimizes using the mini-batch stochastic gradient descent method to derive the partial differentials of cost function. Then, the weight updating rule can be redefined. Both a representative spectral pixel vector and the corresponding spatial vector obtained from applying principle component analysis (PCA) to hyperspectral data over the spectral dimension are acquired separately from a local region and then jointly used as an input to the SAE. While SAE can extract deep features hierarchically in a layer-wise training fashion, the training samples composed of image patches have to be flattened to one dimension in order to meet the input requirement of such models. Unfortunately, SAE are unsupervised and do not directly make use of the label information when learning the features.

3. HSI Classification via Proposed FMRSS Net

As compared with the simple CNN and BTC-WLS algorithms, we can expect our algorithm to exhibit the following advantages:(1)It provides high classification accuracy.(2)It reduces the large number of training parameters and decreases computational cost.(3)It enables us to incorporate spatial information.

In this context, we propose the FMRSS Net framework for HSI classification which satisfies these properties and consists of four steps, as described briefly below.(1)HSI preprocessing: this step carries out band screening, normalization band sorting. and extraction.(2)Joint spatial-spectral information processing: mean filter is applied to each band data and the spectral information per individual pixel is correlated with that of adjacent pixels.(3)Format conversion: in this step, a one-dimensional (1D) vector column is converted to a square matrix representation, which reduces the number of training parameters of CNN.(4)Classification by using a four-layer CNN: this classifier includes four learning layers: convolution layer 1, max pooling layer, convolution layer 2, and fully connected layer.

The flow chart of our FMRSS Net framework is shown in Figure 1.

3.1. Preprocessing for HSI

Suppose the original HSI includes T frequency bands, and each band corresponds to a two-dimensional (2D) image with size . The noise bands and water absorption bands need to be removed from the T original bands in the beginning, resulting in U remaining bands.

After normalizing to [-1, , the data of the remained U bands are sorted by the energy parameter of the 2D image corresponding to each band. In order to extract informative bands, the top P bands with bigger energy values are extracted from U bands by using the following formula:where indicates rounding to the nearest integer towards minus infinity.

3.2. Joint Spectral-Spatial Processing

This processing requires a mean filter which is applied to each 2D image of P bands. The selection of filter template is related to the image size as well as the number of samples for every land cover type. For example, in the Indian Pines dataset, some classes have very few samples. Among these classes, if the filter template is too small, the spectral information of the neighborhood cannot be fully exploited. On the other hand, if bigger template is used, the image blur will be enlarged. Therefore, choosing a suitable filter template is of great importance. Here, we design the size of filter template to adapt experimental datasets.

For bands of hyperspectral data, the collected data for each band is a 2D image with size . Let denote a filter template per band, where is a matrix with size and every value in the matrix is 1. For different datasets, the values and represent the length and width of the 2D image corresponding to each band, and is set as follows:where indicates the number of samples for each class, , and is the number of land cover types.

3.3. Format Conversion

According to the parameter settings in following part, the number of training parameters is related to the input data size, kernel size, and feature map size. The input data with 1D vector column results in larger kernel and feature map which brings huge computational cost to CNN. We make a contribution to reduce the number of training parameters before the input data is fed into network, which is the format conversion using 2D matrix representation feature. Multiband data of each pixel perform format conversion, as shown in Figure 2. Format conversion, which represents 1D array () corresponding to a single pixel, is converted to the 2D square () filled by column, the data of 1D array from 1 to is placed in first column of 2D square matrix, then the next data from to is placed in second column, and so on. From (1), it can be inferred that , so the 1D array of just fills in the square matrix of .

3.4. Architecture of the Proposed CNN

A four-learning-layer based CNN classifier is proposed to extract features and classify hyperspectral data, as shown in Figure 3.

The total spectral data to be processed has a total of pixel vectors, which are divided into training samples and testing samples according to the proportion. To make more concrete, we randomly select the number of training samples according to the batch size (which is the number of training samples per training session) fed into the network and get parameters of the classifier by training these samples. Next, we could observe that the classification results are obtained by inputting the testing samples into the trained network. The parameter settings for each layer are given in Table 1.

4. Experimental Results and Discussion

All programs are implemented using the deep learning toolbox [22] based on MATLAB R2016a language, and the toolbox offers some deep learning templates that allow researchers in remote sensing to solve the issue of large-scale image classification. The experimental results are generated on a PC equipped with an Intel Core i7-7700 with 3.6GHz and Nvidia GTX 1060 6G graphics card.

4.1. Experimental Datasets Description and Parameter Settings

The experiments are performed on two popular HSI classification datasets: Indian Pines [23] and Salinas (http://www.ehu.eus/ccwintco/index.php?title=Hyperspectral_Remote_Sensing_Scenes). Both datasets were captured by an airborne visible/infrared imaging spectrometer (AVIRIS) sensor in wavelength range .

4.1.1. Indian Pines Dataset

The classification methods are firstly applied to the Indian Pines dataset. Only 200 () bands are adopted after the removal of the spectral bands affected by atmospheric absorption. By sorting through the energy values, we finally select the 196 bands with larger energy for evaluation where takes 14. The dataset contains () pixels with a ground resolution of 20 m. Since the number of samples in the dataset is not evenly distributed based on the 16 land cover types (), we randomly select 10% of the land cover pixels as training samples and the remaining 90% as testing samples and set the minimum number of training samples to 10 for each class. Table 2 shows each class of experimental land cover information and the corresponding number of training and testing samples; we get a total of 10198 samples with 1000 training samples and 9198 testing samples. In addition, each pixel must be scaled to . The layer parameters of this dataset in the proposed four-layer CNN classifier are given in Parameter Settings for Indian Pines/Salinas (Table 3).

4.1.2. Salinas Dataset

For the Salinas dataset, before the experiments, 20 noisy water absorption bands were discarded and only 204 () bands remained for evaluation. By sorting through the energy values, we discard some bands and finally select the 196 bands with larger energy for evaluation where takes 14. The dataset contains ( and ) pixels with a ground resolution of 3.7 m and a total of 16 land cover types. We randomly pick up half of the all ground-truth samples for experiment since the number of all samples is so large to increase computation burden, then we randomly select 10% of them for training samples and the remaining 90% for testing sample s and set the minimum number of training samples to 10 for each class. Table 4 shows each class of the experimental land cover and the corresponding numbers of training and testing samples. A total of 2700 training samples and 24349 testing samples are selected as the original data for the experiment. In addition, each pixel is normalized to . The layer parameters of this dataset in the proposed classifier are also given in Parameter Settings for Indian Pines/Salinas.

4.2. Comparison with Other Classification Algorithms

The test accuracies of the FMRSS Net framework (FMRSS) for the Indian Pines and Salinas datasets are compared with sparsity-based and deep learning-based algorithms in Table 5. All the classifiers are trained on the same train set and tested on the same test set for a fair comparison. According to (2), the filter template with size and are chosen for the Indian Pines and Salinas datasets. Among sparsity-based algorithms, SAE [22], SAE-LR [21], and BTC-WLS [17] are compared with our framework. A network structure of 196-100-16 is to carry out 500 times unsupervised training on each SAE network and 10,000 times supervised training on the entire classification network with the sigmoid as activation function [22]. Note that because deep learning-based methods may perform poor when training samples are not enough, in the comparison experiments the number of training samples in SAE-LR is set as 60% of all labeled pixels for Indian Pines dataset, while, for the Salinas dataset, we randomly pick up half of the all ground-truth samples for experiment, then we randomly select 60% of them for training samples, 20% for validation samples, and the remaining 20% for testing samples. The neural networks are constructed as 196-60-60-60-60-60-16 for the Indian Pines dataset and 196-30-30-30-30-30-16 for the Salinas dataset. The experiments considering joint spectral-spatial information are carried out with 3 principal components and 5000 epochs of pretraining and 100,000 epochs of fine-tuning [21]. Parameters of the learning algorithm of the BTC-WLS as well as the number of output classes are set equal to the values mentioned in [17] with the only exception that testing set size for Salinas is set to half of the original paper. Among deep learning-based classifiers, the simple 1D CNN architecture is implemented with the same architecture and hyperparameter values as mentioned in [13]. The proposed FMRSS and 1D CNN both are trained 10,000 times with sigmoid as activation function. Obviously, compared with several other algorithms, our proposal achieves a better performance on the two datasets in terms of overall accuracy (OA) of classification. In addition, classification maps resulting from the Indian Pines and Salinas scenes using our framework are presented in Figures 4 and 5.

We present the confusion matrix of our framework for the Indian Pines in Table 6 where the index number (1, 2, …16) represents the corresponding class in Table 2. The cell in the ith row and jth column means the percentage of the ith class samples (according to ground truth) which is classified to the jth class. For example, 99.77% of class 2 (Corn-notill) samples are classified correctly, but 0.16% of class 2 (Corn-notill) samples are wrongly classified to class 3 (Fallow). The percentages on diagonal line are the classification accuracies of corresponding classes. Furthermore, the performance verifies that the proposed framework has discriminative capability to extract subtle visual features, which is even superior to human vision for classifying complex curve shapes.

Compared with BTC-WLS algorithm, the proposed framework has higher classification accuracy not only for the overall dataset but also for certain specific classes (classes 2, 3, 5, 10, 11, and 15) on Indian Pines dataset, as shown in Figure 6. However, Figure 6 only shows accuracy in percent and ignores the number of testing samples. For clearer illustration of the advantage of the proposed algorithm, Figure 7 presents the number comparison of misclassification samples by FMRSS compared with BTC-WLS. Statistical results indicate that the total number of misclassified samples acquired by FMRSS (which is 208) is far less than BTC-WLS (which is 332).

4.3. Comparison of Different Parameter Settings

Table 7 shows the classification results under different filter templates on two datasets. The results show that the filter templates of and are appropriate for Indian Pines and Salinas datasets, respectively. For the Indian Pines dataset, there exist some classes (1, 7, 9, 16) with low number of samples where the spectral information of neighborhood is not fully utilized when the filter template size is , and the filter template with size enlarges image blur, and the size of is the most suitable filter template that utilizes spatial and spectral information properly. Based on the above analysis, this method can achieve the best classification accuracy of 97.74% and 99.29% on two datasets, respectively.

The depth of network is also an important factor affecting the network structure and determining the quality of the extracted data characteristics. Table 8 tests the effect of network depth on the classification results for different datasets. Two-learning-layer based network includes convolution layer and fully connected layer, while three-learning-layer based network includes convolution layer, max pooling layer, and fully connected layer. Four-learning-layer based network is the proposed classifier in this paper by referring to Section 3.4. The experiments show effectiveness of the proposed four-layer structure.

4.4. Analysis of Computational Cost

In addition, CNN share weights, which significantly decreases the number of parameters needed to be trained in comparison with other deep approaches. However, the number of parameters is still high when one needs to deal with hyperspectral data. The proposed format conversion addresses this issue which converts the 1D vector column into a matrix representation and thus reduces large-scale parameters and network complexity. The unified input bands and output classes are set to 196 and 16. From the “Parameter Settings for Indian Pines/Salinas”, it can be obviously obtained that the total number of training parameters of the proposed framework is .

Different from the proposed FMRSS Net, each input data fed into CNN [13] is a 1D vector column. The input layer is . The first hidden convolutional layer C1 filters the input data with 20 kernels of size . Layer C1 contains nodes. The max pooling layer M2 is the second hidden layer, and the kernel size is . Layer M2 contains nodes, and there is no parameter in this layer. The fully connected layer F3 has 100 nodes. The output layer has 16 nodes. Therefore, the total number of training parameters in [13] is .

Table 9 shows the comparison of classification and computational cost both on two datasets to certify effectiveness of format conversion for our frameworks: the framework with 1D vector column and 2D matrix representation input data. Compared with 1D vector column input data, 2D matrix representation reduces significantly 60772 parameters. In addition, compared to the 1D convolution kernel that only extracts adjacent columns information of each pixel, the proposed CNN model employs the 2D convolution kernel which can utilize the neighborhood information of each pixel. In addition, the proposed CNN model with four learning layers can complete the layer-by-layer feature extraction of the image through multiple convolution layers. It can be obtained that the proposed framework can achieve an improved classification accuracy by 7.30% and 6.54% on Indian Pines and Salinas, respectively.

5. Conclusions

In this paper, a CNN-based classification framework has been proposed to directly address the problems of the training parameter burden and spatial variability of spectral signature pertaining to HSI classification. The matrix representation-based spectral-spatial feature learning and extensive parameter sharing in the neural network help achieve superior classification performance with fast speed than other popular methods on benchmark HSI classification datasets. Experiments results show that, compared with BTC-WLS algorithm, the proposed framework achieves a classification accuracy improvement of 1.35% on Indian Pines dataset and 0.45% on Salinas dataset. Likewise, compared with SAE-LR algorithm, this framework improves OA by 6.39% on Indian Pines dataset and 6.56% on Salinas dataset. Future work will focus on filtering parameters adaptive technique and semi-supervised algorithms combined with CNN.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (nos. 61102139 and 61572524) and the Fundamental Research Funds for the Central Universities of Central South University (2018zzts181). The authors would like to thank Deliang Xiang at Beijing Academy of Military Sciences (China) for greatly helping to improve the quality of this paper.