Abstract

Inherent spectral characteristics of hyperspectral image (HSI) data are determined and need to be deeply mined. A convolution neural network (CNN) model of two-dimensional spectrum (2D spectrum) is proposed based on the advantages of deep learning to extract feature and classify HSI. First of all, the traditional data processing methods which use small area pixel block or one-dimensional spectral vector as input unit bring many heterogeneous noises. The 2D-spectrum image method is proposed to solve the problem and make full use of spectral value and spatial information. Furthermore, a batch normalization algorithm (BN) is introduced to address internal covariate shifts caused by changes in the distribution of input data and expedite the training of the network. Finally, Softmax loss models are proposed to induce competition among the outputs and improve the performance of the CNN model. The HSI datasets of experiments include Indian Pines, Salinas, Kennedy Space Center (KSC), and Botswana. Experimental results show that the overall accuracies of the 2D-spectrum CNN model can reach 98.26%, 97.28%, 96.22%, and 93.64%. These results are higher than the accuracies of other traditional methods described in this paper. The proposed model can achieve high target classification accuracy and efficiency.

1. Introduction

Hyperspectral images (HSIs) are typically composed of hundreds of spectral data channels in the same scene. HSIs can provide continuous data in space and spectrum through combined imaging and spectrum technology. Hyperspectral data are important in monitoring information of the Earth’s surface because the spectral information provided by the hyperspectral sensor increases the accuracy of the resolution of target materials and thus improves classification accuracy [1].

At first, scholars mainly use artificial extraction of image features for object identification classification of remote sensing images using a local binary pattern, histogram of oriented gradient [2], and Gabor filter [3]. However, this method is ineffective in processing hyperspectral data with the increase in dimension. Thus, feature extraction and classifier are combined, thereby yielding a satisfactory classification effect. Methods for feature extraction include principal component analysis (PCA) [4], independent component analysis (ICA) [5], and linear discriminant analysis (LDA) [6] and robust PCA. The classifier went through a process from fuzzy K-nearest neighbor algorithm [7], naive Bayes with deep feature weighting [8], and logistic regression [9] to support vector machine (SVM) [10]. SVM improves classification performance by extending classification kernel [11]. However, these combinatorial methods demonstrate the following significant limitations. (1) Feature extraction uses linear transformations to extract potentially useful features from the input data. Hyperspectral data are essentially nonlinear considering a complex light-scattering mechanism [12]. (2) Most traditional classification methods only consider single-layer processing, which reduces the capability of feature learning, and are unsuitable for high-dimensional data.

Neural networks (NNs) with multiple layers and hidden nodes are more suitable than shallow classifiers, such as SVM, in building an HSI data model [13]. The NNs, including multilayer perceptron [14] and radial basis function [15], have been studied for classifying remote sensing data. Researchers have proposed a semisupervised NN framework for large-scale HSI classification [16]. Various deep NNs (DNN) have been developed according to system architecture and activation functions; these networks include deep belief network (DBN) [17], deep Boltzmann machine [18], and AutoEncoder (AE) [19]. In 2014, a stacked AE (SAE) was used for HSI classification [20]. An improved AE based on sparse constraints was then proposed [21]. The DBN is another DNN model that was proposed in 2015 [22]. The depth model can extract robust features and is superior to other methods in terms of classification accuracy.

Convolution NN (CNN) [23] uses local receptive fields in efficiently extracting spatial information and sharing weights to significantly reduce the number of parameters. CNNs are used to extract the spatial spectral features of hyperspectral images for classification [24], and their performance was better than that of traditional classifiers such as SVM. In addition, a method using a virtual sample enhanced to limited labeled samples was proposed in [25]. A previous study proposed the use of a greedy layer unsupervised pretraining to form a CNN model [26]. However, the application technology of CNN in hyperspectral classification remains imperfect, and several shortcomings, such as easy saturation of the training gradient, low classification accuracy, and poor model generalization, should be addressed.

The spectral values of the HSIs in the third dimension are approximately continuous, and the curves of each feature possess a unique spectral plot that is different from those of other classes. In the traditional classification methods, one-dimensional spectral vectors are used as the final form of input data [27, 28] or neighboring pixels are used to form small regional pixel blocks as input data [29, 30]. Although the former simplifies the complexity of deep learning network training, it omits spatial dimension information of spectral values at the same time. The latter combines multiple pixels into one sample, which introduces heterogeneous noises and aggravates the problem of missing hyperspectral data.

Compared with the traditional CNN methods, this study designs a 2D-spectrum CNN model as follows: (i)Hyperspectral pixels have rich spectral information. The traditional data processing methods which use small area pixel block or one-dimensional spectral vector as input unit bring many heterogeneous noises. In this paper, we convert the spectral value vector to 2D-spectrum image, so that the optimization of all CNN model parameters (including the BN parameters) is based on the spectral values of the pixel points and spectral space information. The target of fully extracting spectral spatial information can be achieved while heterogeneous noises are also avoided. In addition, a multilevel BN algorithm is achieved for the first time, and the effect of network acceleration is obvious.(ii)A BN algorithm is introduced to reduce the vanishing gradient problem and dynamically accelerate the training speed of the DNN by reducing the scaling and initialization of the dependent parameters. A small area pixel block was selected as the input unit. Liu et al. [30] used the BN algorithm to the CNN for the HIS. However, the introducing heterogeneous noises and wasting scarce samples will weaken the BN algorithm’s role in network regularization and accelerated training.(iii)Softmax loss models are used instead of combining Softmax regression and multinomial logistic loss models; thus, the output of the last layer competes with one another to improve the classification accuracy. The experimental results show that the proposed CNN-based HSI classification model exhibits high accuracy and efficiency in the HSI dataset.

2. CNN-Based Classification Model

The researchers found that the human visual system can effectively solve the problem of classification, detection, and identification, with the rapid development of modern nervous systems. This development motivates researchers on biological visual systems to establish advanced data processing methods [31]. Cells in the cortex of a human visual system are only susceptible to small areas, and accepting cells in the field can exploit the local spatial correlation in the image.

The CNN architecture uses two special methods, namely, local receptive field and shared weights. The activation value of each convolution neuron is calculated by multiplying the local input with weight , which is shared in the entire input space (Figure 1). Neurons that belong to the same layer share the same weight. The use of specific architectures, such as local receptive field and shared weights, reduces the total number of training parameters and facilitates the development of an efficient training model.

The complete CNN architecture consists of convolution and pooling layers. The convolution layer alternates with the pooling layer, thereby mimicking the properties of complex and simple cells in the mammalian visual cortex [32]. In the CNN, the input data are a matrix or tensor with a 3D spatial structure, where (, ), (, ), and (, ) represent the size of the spatial dimension of input data, convolution kernel, and output data, respectively. The number of convolution kernel feature channels is represented by , and represents the 3D data. where is the input data, is the convolution filter, and is the output data. The 1D signal is convoluted by filter to calculate signal as follows: where is the neuron offset and is the convolution kernel matrix of the .

The pooling layer is typically obtained after the convolution layer. The most common pooling function is max pooling. This function calculates the maximum response of each feature channel in the region. The feature map becomes robust to the distortion of the data and achieves a high invariance through the pooling. The pooling layer can also decrease the size of the feature map, thereby reducing computational burden.

The traditional CNN processing flow is designed in accordance with the data structure characteristics of the HSI classification task (Figure 2). The training data are obtained through forward propagation to determine the actual output and then compared with the real data tag competition. Stochastic gradient descent (SGD) is used to modify the synapses and parameters of the network structure, and several iterative trainings are conducted to form a network model. Test data are inputted into the network model. The output data obtained by feature extraction are matched with the actual data, and the result is classified by the competitive output model.

The 3D data of the HSIs reach hundreds, and a numerical approximation is considered continuous. The curve of each pixel possesses a unique spectral plot that differs from those of other categories, and this plot is difficult to distinguish by the human eye. However, the CNN exhibits a better performance than several human visual aspects. Therefore, this study uses spectral labels to improve the classification performance of the CNN models on the HSIs.

3. Proposed CNN-Based HSI Classification Model

3.1. BN

The SGD is used to train NN in CNN training. This method is simple and effective, but the model parameters should be carefully adjusted. In particular, the learning rate and initialization parameters of the model are added in the optimization. This step significantly reduces the speed of the CNN training. The entire network should conform to the new data distribution when the distribution of the input data at the network layer changes, thereby resulting in the decreased training speed of the network and saturated gradient. If the nonlinear input data distribution is ensured to be stable, then the probability of nonlinear saturation problem will be minimal and can accelerate the training of the network. Therefore, the BN algorithm is proposed to eliminate this phenomenon and expedite the training of the network.

In the BN algorithm, for each hidden layer of neurons, the input distribution, the value interval of which is closer to the limit saturation region through nonlinear function mapping, is forced back to the normal distribution of the comparison standard, with a mean of 0 and a variance of 1. Thus, the input value of the nonlinear transformation function falls into the region, which is sensitive to the input data, to avoid a gradient disappearance problem. The most mature technique in the early DNN normalization operation is whitening; however, whitening the input of each layer would result in excessive computational costs and computational time and not all differential. Thus, BN is used in two simplified ways.

The first step indicates that zero mean and variance normalization for scalar features are used instead of whitening. Simultaneously, the input and output of the layer are normalized. The following formula is used for NNs with -dimensional input to normalize each dimension, where the predicted value and variance can be calculated from the corresponding batches.

The primitive activation value that corresponds to the neuron is converted by subtracting the corresponding batch mean and dividing it by the variance . If the normalization for input of NN layer is rather simple, then the characterization capability of the layer is reduced. Thus, the BN introduces a pair of parameters and for each activation value , which can scale and translate the normalized input as

The two parameters are similar to the network parameters and are trained and modified in the same way. The characterization capability of the model is then restored. When , it can obtain the original activation value based on the output and restore representation ability. That is, the network can restore the feature distribution to be learned by the original network.

The second step denotes that NN training is based on the entire training dataset. Theoretically, the entire training dataset can be used to normalize the activation value. However, it cannot be applied to the SGD because of a large amount of calculation of the dataset. The BN introduces min-batch in the SGD and calculates the corresponding mean and variance by using the minimum batch data. If a minimum batch data exists, then the corresponding size is . Only one of the activation values is considered because each dimension of the activation value is normalized, while the variable is then ignored. Each has activation values. The BN algorithm can be obtained by Formula (5).

The mean and variance of the minimum batch data are defined as follows: in which and can be obtained after the normalization. where is a constant (tends to be 0) to ensure the stability of the calculation of variance.

The data computation becomes large and complex in applying the BN algorithm to the CNN if each neuron is normalized in every layer. Thus, this algorithm is based on the idea of weight sharing, and the mean and variance of the activation value are obtained for the entire map. Input , output , parameter , and are defined as where , , , and represent the length and width of the input and output data, number of characteristic channels, and min-batch size, respectively. We explicitly define the input and output arrays as 4D data to process the feature mappings by batch. The output characteristic map is expressed by the following formula:

The BN is performed to the activation value of each hidden layer of neurons, which can be regarded as an add-on operation layer. It is located after the activation value obtained and before the nonlinear activation function, as depicted in Figure 3.

The proposed BN algorithm possesses the following characteristics after the theoretical analysis and experimental verification. (1) A large initial learning rate can be selected so as to improve training speed. (2) Dropout can be removed and L2 weight attenuation coefficient can be reduced. (3) Local response normalization (LRN) can be replaced. (4) The training data can be completely disrupted (i.e., a sample is not frequently selected in each batch of training).

3.2. Softmax Loss Models

The Softmax regression model is a spread of logistic regression model in solving multiclass problems. Following the data structure of Section 4.1, it can obtain the following output of the regression function.

This formula is not limited by the number of feature channels and is applied to all spatial positions in a convoluted manner to translate linear predictions into categorical probabilities. The linear prediction results of the category serve as input to the regression model, and probability values (likelihood), which represent the possibility of the data belonging to different categories, are obtained. The Softmax regression model can be regarded as the combination of the exponential form of activation function and normalized operators. The classification loss function is aimed at comparing the prediction with the real class label . The classification loss is defined as follows: where , , and is the size of the 3D data. The 1D vector represents class fraction, and represents a real class label. Logarithmic loss function or logarithmic likelihood loss function is commonly used in logistic regression, which is based on the maximum likelihood principle. Thus, vector represents the posterior probability of the different classes. The output of the loss function is the negative logarithmic probability of the real label. where is the output of the Softmax regression model and is numerically unstable. On the one hand, the score should compete with other scores to obtain a meaningful logarithmic loss. If it is not the case, then the minimization of Formula (13) can be achieved by maximizing all , but the real prediction effect is that is larger than . On the other hand, the Softmax regression model allows the score to compete through the normalization factor. This study proposes the Softmax loss models, which combines the calculation module of the regression model and calculation module of logarithmic loss into a single one.

The combined modules result in a stable value of the output fraction x. By combining the logarithmic loss with Softmax, the loss model automatically makes the score compete , when . Although this model is similar to the final output result of the logarithmic loss function, the experimental results show that the Softmax loss model has the following advantages. (1) The calculation steps are minimal, while the calculated amount is small. (2) Numerical gradients are relatively stable. (3) Competitive output can improve the classification accuracy.

3.3. Two-Dimensional Spectrum CNN Model

The hyperspectral data has a small total number of pixels, while a single pixel has rich spectral values. According to this, the 2D-spectrum method is proposed in the paper, which converts the pixel spectral vectors into two-dimensional spectral images as the input data of CNN. The convolution network can fully utilize the spatial position information between the different spectral values to improve the classification accuracy, as shown in Figure 4. The network adopts multiple convolution and max pooling layers alternately. BN layers, which regulate the data distribution and accelerate network training, are insert among the one, four, and seven layers. The Softmax loss model is then used to control the output. The 2D-spectrum CNN structure is shown in Figure 5.

4. Experimental Results

In this chapter, we first describe the dataset used in the experiment. Furthermore, the BN algorithm and Softmax loss model are used to verify the HSI classification performance of the CNN model. Finally, the proposed method is compared with other similar methods to determine the advantages and disadvantages of the 2D-spectrum CNN model. The overall accuracy (OA) and Kappa coefficients are used as the performance measurements for each type of classification accuracy (percentage). The mathematical relationship between error and OA is

Each classification result is averaged over 10 runs to avoid any deviation caused by random sampling. All of the experiments are tested on a desktop with NVIDIA GeForce GTX 1070 8G GPU, 16 GB memory, and 64-bit Windows 7 OS using MATLAB 2014b.

4.1. Hyperspectral Datasets

In the experiment, the hyperspectral data of the network model include Indian Pines, Salinas, and Kennedy Space Center (KSC) datasets and are applied to the new dataset, Botswana. Table 1 summarizes the detailed information of the four datasets.

Figures 68 exhibit the characteristics of the four datasets. Tables 25 list the number of samples trained and tested for the corresponding dataset and are assigned according to (train) 1 : (test) 3. According to previous experiments, a considerably minimal training data will produce an underfitting phenomenon. The labels 1, 7, 9, and 16 of the Indian Pines datasets are insufficient. The above labels allocate the total amount of samples according to 3 (train) : 1 (test) to avoid underfitting and to ensure that the training sample is sufficient.

The learning rate of the CNN network model is [1e-03 5e-04 1e-05], the corresponding number of training iterations is 50, 30, and 20, and the number of samples per batch is 100.

4.2. BN Performance Verification

The BN operation layer is added in the CNN to solve the problem of modified distribution of the internal nodes caused by the change in the input data of the traditional CNN. The addition of this layer can solve the problem of training saturation and gradient disappearance, significantly increase the training speed, and improve the classification performance. According to Section 4.1, the BN operator layer can replace the traditional dropout layer and the local response normalization layer (LRN) and achieve a more satisfactory result, as presented in Figure 9 and Table 6.

Table 6 indicates that dropout and LRN are trained on the network to accelerate and prevent the overfitting phenomenon. However, the BN operation layer in training acceleration and classification accuracy are better than the two areas by data comparison, and the experimental results depict obvious differences. Figure 10 denotes the test results for error in Table 6.

4.3. Softmax Loss Models

The CNN in this study uses the Softmax loss models in the output to replace the traditional Softmax regression and multinomial logistic loss models. Table 7 compares the classification accuracy among the three models.

Figure 10 is a more intuitive histogram. The results of four hyperspectral data experiments show that the Softmax loss model has the highest classification accuracy, compared with the multinomial logistic loss model.

4.4. Comparison with Other Approaches

The proposed method in this study is compared with recent works on the HSI classification studies. The training accuracy evaluation indicators, namely, OA and Kappa coefficients, are important criteria for assessing network model classification performance. In Table 8, the proposed 2D-spectrum CNN model is compared with other advanced deep learning methods. The comparative data is derived from the work of Hu et al. [27], and the CNN data based on PPFs is from the research of Li et al. [28]. The data of Band-Adaptive Spectral-Spatial (BASS) net architecture is from the paper of Santara et al. [29].

Figures 11 and 12 display the ground truth, training set, test set, and classification maps of Indian Pines and Salinas. It can be seen from the comparison of the performance that the classification performance of the 2D-spectrum CNN model is superior to other methods in different hyperspectral data in terms of overall accuracy and Kappa coefficient.

5. Conclusion

This paper proposes the 2D-spectrum CNN model that adds multilevel BN operating layers for HSI classification. The output uses a Softmax loss model for classification. In addition, many of the relevant factors include the number of iterations, learning rate, size of the input data, and size of the filter and feature graph. All of them will affect the final classification performance. The parameters obtained the optimal solution and reached the ideal effect in the experiments. The experiment results show that the CNN model proposed in this paper provides excellent performance. The training process converges quickly, thereby indicating that this method can be applied to large datasets. Furthermore, high classification accuracy can be achieved by applying sufficient iterations and number of datasets.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Authors’ Contributions

Hongmin Gao and Shuo Lin contributed equally to this work and should be considered co-first authors.

Acknowledgments

This study is supported by the National Natural Science Foundation of China (no. 61701166), Projects in the National Science & Technology Pillar Program during the Twelfth Five-Year Plan Period (no. 2015BAB07B01), the Fundamental Research Funds for the Central Universities (no. 2018B16314), the China Postdoctoral Science Foundation (no. 2018M632215), National Science Foundation for Young Scientists of China (no. 51709271), and Young Elite Scientists Sponsorship Program by China Association for Science and Technology (no. 2017QNRC001).