Abstract

In recent years, deep learning has been widely used in hyperspectral image (HSI) classification and has shown good capabilities. Particularly, the use of convolutional neural network (CNN) in HSI classification has achieved attractive performance. However, HSI contains a lot of redundant information, and the CNN-based model is limited by the receptive field of CNN and cannot balance the performance and depth of the model. Furthermore, considering that HSI can be regarded as sequence data, CNN-based models cannot mine sequence features well. In this paper, we propose a model named SSA-Transformer to address the above problems and extract spectral-spatial features of HSI more efficiently. The SSA-Transformer model combines a modified CNN-based spectral-spatial attention mechanism and a self-attention-based transformer with dense connection. The SSA-Transformer model can combine the local and global features of HSI to improve the performance of the model. A series of experiments showed that the SSA-Transformer achieved competitive classification accuracy compared with other CNN-based classification methods using three HSI datasets: University of Pavia (PU), Salinas (SA), and Kennedy Space Center (KSC).

1. Introduction

Hyperspectral image (HSI) contains rich information in both spectral and spatial dimensions with high correlation [1, 2]. Based on such advantages, HSI has been applied in many fields, such as mineral exploration, environmental monitoring, and urban development. So far, much effort has been made in the field of HSI analysis and processing, including classification [3], anomaly detection [4, 5], and dimensionality reduction [6]. Previous studies of HSI classification mostly used support vector machines (SVM) [79], k-nearest neighbor (k-NN) [10], and multinomial logistic regression (MLR) [11]. However, these models heavily rely on experts’ domain knowledge and engineering experience.

With the development of deep learning methodologies, multiple HSI classification methods have been developed and widely used in the past few years, including Stacked Autoencoder (SAE) [12], Deep Belief Network (DBN) [1315], and Recurrent Neural Network (RNN) [16, 17]. In addition, CNN has the advantages of directly processing 3D image patches and extracting a large amount of spatial context information, so a large number of CNN-based models have appeared in the field of HSI classification [1823]. Hu et al. [18] used convolutional neural networks for HSI classification for the first time. However, the model only contains 1D convolution kernels, so only spectral information of HSI is used in classification and does not consider the spatial context information of HSI, which could potentially impact the accuracy of the model. Later, various models have emerged that use both spatial and spectral information for classification. He et al. [19] used multiple 3D convolution kernels of different sizes to build an M3D-DCNN model, which can extract multiscale spectral-spatial information on HSI. Gao et al. [20] used the small convolution and dense connection in their model to extract spectral-spatial features. The depth of the above model is shallow, and the performance of the model is not good enough. The CNN-based model can extract more rich features by deepening the depth of the model. Paoletti et al. [21] proposed a deep residual network (pResNet) by stacking the pyramid bottleneck residual units derived from the pyramid residual network [24]. The depth of the model can reach more than 30 layers, which can extract rich spectral feature and spatial feature. This model contains a large number of 2D convolution kernels, and the performance of the model is much better than the above models, but as the depth of the model becomes deeper, the training time becomes longer. Li et al. [22] proposed a dual-branch network to extract spectral and spatial information separately; however, stacking multiple 3D convolution kernels also caused the model training time to be too long, and the performance of the model did not improve much. The receptive field of CNN is limited by the small-sized convolution kernel. As a result, the CNN-based model cannot extract global features, which causes a bottleneck in the performance of the CNN-based model.

To solve the above problems, many transformer-based models have emerged. Considering the large spectral dimension of HSI, HSI can be regarded as sequence data. Just as the word vector in the NLP field represents the meaning of a word, the spectral vector of the HSI pixel represents land cover information. Moreover, the spatial information of HSI is similar to the context of the target word in the NLP field [25]. The transformer model was originally used in natural language processing (NLP) [2628]; the self-attention mechanism of this model can mine the global features of the sequence, which makes it a great success in the field of NLP. The use of self-attention-based transformer model can make better use of the correlation of HSI information and can extract the global features of neighborhood pixel blocks. However, the performance of many transformer-based models is still not good enough. The model proposed by Hu et al. [29] combines 1D-CNN and Vision Transformer (ViT) [30], but the overall accuracy on PU and SA datasets is only 93.77% and 96.15%, respectively. The reason is that ViT directly segments the input image and cannot handle the low-level features of the input image well [31]. Inspired by the model proposed by Yuan et al. [31], we use a spectral-spatial attention mechanism to process neighborhood pixel blocks to obtain feature maps and then segment the feature maps. In this study, we propose a model that combines a CNN-based spectral-spatial attention mechanism and a self-attention-based transformer (SSA-Transformer). The advantage of the SSA-Transformer is that it can extract local and global features of HSI data and improve classification results. Specifically, the spectral-spatial attention mechanism is used to extract the local features of the neighborhood pixel blocks, reduce redundant information, then process them into sequences, and finally extract the global features by the self-attention-based transformer encoder block. We compared the proposed SSA-Transformer with other CNN-based methods on three HSI public datasets revealing its competitive classification performance.

The main contributions of this work can be summarized as follows:(1)In our proposed model, we use a spectral-spatial attention mechanism to extract the low-level features of the neighborhood pixel blocks, which solves the disadvantage that the transformer part of the model cannot extract the rich low-level features of the input image.(2)Our model combines the advantages of both CNN and transformer. The local features of HSI can be extracted through the CNN part of the model, while the global features of HSI can be extracted through the transformer part of the model. Therefore, the model can effectively extract the local and global features of HSI for more efficient classification.(3)We applied dense connection to connect the extracted features of each transformer encoder block directly to all subsequent transformer encoder blocks to further improve the flow of information among transformer encoder blocks and reduce information loss.

2. Methodology

In this section, we first explain the details of the spectral-spatial attention mechanism. Next, we introduce the principles of linear embedding and transformer encoder. Finally, we discuss the overall architecture of the proposed HSI classification method.

2.1. Spectral-Spatial Attention

The transformer-only model will directly segment the input image, but this will cause the model to fail to extract rich low-level features. Therefore, we use the attention mechanism to extract the rich low-level features of the neighborhood pixel blocks of the input model. Specifically, we use the modified CBAM [32] as the spectral-spatial attention mechanism for feature extraction of neighborhood pixel blocks. The spectral-spatial attention module consists of a spectral attention module (SeAM) and a spatial attention module (SaAM) [33]. SeAM is used to select spectral features that are useful for classification, while SaAM is used to select spatial features that are useful for classification.

Figure 1 shows the structure of the entire spectral-spatial attention mechanism. We first utilize convolution operations on the HSI neighborhood pixel block , where H and C represent the spatial size and the spectral dimension, respectively. Next, we extract features in the input data that contribute to classification by SeAM and SaAM, respectively. Note that none of these steps change C. Finally, 1 × 1 Conv is used to extract discriminative features on the neighborhood pixel blocks and reduce the number of dimensions of the spectral dimension. During this process, useless information is discarded to avoid the risks of reducing classification performance. We will introduce the detailed process of the spectral-spatial attention mechanism in the next section. The overall attention calculation can be summarized as follows:where denotes the elementwise multiplication, consists of two 3 × 3 convolution layer, denotes the spectral attention module, denotes the spatial attention module, and consists of one 1 × 1 convolution layer.

2.2. Spectral Attention Module

For different classes of pixels in HSI, the spectral bands that contribute to the classification are different, and some spectral bands will reduce the accuracy of the classification [34]. Therefore, the role of SeAM is to strengthen the contribution of the spectral bands that are helpful to the classification and weaken the contribution of the spectral bands that are useless or even harmful to the classification. This module maps the input into a weight vector to indicate the contribution of each spectral band to the classification result.

The structure of SeAM is shown in Figure 2. The module first generates two 1 × 1 × C vectors, and , is generated by global average pooling, and is generated by global max pooling. After that, the two vectors are first passed through the F1 fully connected layer for dimensionality reduction, and then the dimensionality is restored through the F2 fully connected layer. Next, the spectral weight vector is generated by the addition of these two vectors and processed by the ReLU activation function. is calculated bywhere denotes the ReLU activation function.

Finally, the spectral weight vector is multiplied by the input spectralwise to get the output .

2.3. Spatial Attention Module

All pixels of a neighborhood block are initially considered to be the class of the center pixel; that is, the contribution of all neighbor pixels to the class of the center pixel is initially the same [33]. However, there is no way to distinguish the contributions of different pixels in the neighborhood, which may affect the classification of pixels located on the boundary between two different categories. In addition, not all pixels in the neighborhood contribute to the classification of the center pixel, and some of them may even reduce the classification effect [33]. Therefore, the role of SaAM is to enhance the contribution of pixels that are helpful for classification and weaken the contribution of pixels that are useless or even interfere with the classification. The structure of SaAM is shown in Figure 3. This module first calculates the average value and maximum value of the elements of the spectral dimension, respectively, and obtains the outputs and with the shape of .

Next, we concatenate these two outputs and go through a convolution operation and a sigmoid activation function to get a new output, which represents the contribution of each pixel. The specific calculation process is as follows:

Finally, is multiplied by the input spatialwise to get the output .

2.4. Linear Embedding

Transformer abandons the sequence dependency characteristics of RNNs and introduces a self-attention mechanism. The self-attention mechanism can capture the global information (long-term correlation) of the input patch at any location [35]. But it will cause the input vector to lose the positional relationship. Therefore, ViT solves this problem by processing the sequence into a linear embedding sequence [30]. The overall process is shown in Figure 4: first segment the input data into patches, then flatten it into vectors, then add an extra vector for classification, and finally add a position code to each vector.

2.5. Transformer Encoder Block

Inspired by transformer [36], Vision Transformer [30] based on self-attention has successfully applied it in the field of computer vision. The self-attention mechanism in the transformer model can extract global features, which is the key to its attractive effect [30].

Figure 5 shows the architecture of the transformer encoder block, each of which consists of a multihead self-attention mechanism sublayer and a feedforward network sublayer. Residual connections are used between each sublayer and normalize the input of each sublayer using LayerNorm (LN). Self-attention mechanism can be defined aswhere k, Q, V, and the output are matrices. K, Q, and V are obtained by multiplying the input matrix by , , and matrices. Use the dimension of Q to participate in scaling.

Note that this is not the only self-attention mechanism in each transformer encoder block. There are multiple such self-attention mechanisms, which constitute a multihead self-attention mechanism. Finally, we define the multihead self-attention mechanism aswhere is a weight matrix and h is the number of the heads.

The feedforward network in each transformer encoder block consists of two full connection layers and a GeLU activation function, which can be defined aswhere denotes the GeLU activation function.

2.6. Overview of the Proposed Model

Figure 6 shows the overall architecture of our proposed model. First, we take each labeled pixel as the center to extract a neighborhood pixel block of size h × h × c, where h is the length and width of the pixel block and c is the spectral dimension of different HSIs. We use padding operations for edge pixels that cannot be directly extracted into pixel blocks. Finally, we get sample data of shape (n, h, h, c), where n is the total number of samples.

Then, we use the spectral-spatial attention module to extract the spatial and spectral features of the sample data and reduce redundant information. The spectral-spatial attention mechanism reduces the redundant information of the input data in the spectral band, and the shape of the output data is (n, h, h, k), where k is the number of spectral bands retained by the data after processing through a 1 × 1 convolutional layer.

Next, we segment each output data with shape (h, h, k) into patches with shape (p, p, k). We set p to 3. The patches of shape (p, p, k) will be reshaped into a one-dimensional vector of length k × P × P. The shape of the data can be redefined as (N, D), where N is the length of the sequence, the size is , D is the dimension of each vector of the sequence, and the size is p × p × k.

Finally, by adding the embedding vector and the position code, we finally create a matrix of size (batch size, N + 1, D) to use as the input to the transformer part of our model. We use multiple transformer encoder modules to continuously extract image features and use the dense connection structure to reduce the loss of information.

3. Experiments

In this section, we first introduce these three HSI datasets used to measure the performance of the model: Kennedy Space Center (KSC), University of Pavia (PU), and Salinas (SA), as illustrated in Figures 79. The details of the datasets are shown in Table 1. Next, we specify the model configuration process. Finally, we analyze the four factors that affect the performance of the proposed model. We choose overall accuracy (OA), average accuracy (AA), and KAPPA coefficient (κ) as the measurement indices of SSA-Transformer performance.

In Salinas dataset, we randomly selected 10% of the dataset for training for our experiments. In Kennedy Space Center dataset, we randomly selected 200 samples per class object as the training set. In University of Pavia dataset, we randomly selected 400 samples per class object as the training set. A detailed experimental analysis is presented in this section. When the number of labels in some categories of the dataset is too small to be selected, 80% of the total number of labels in this category are selected as the training set. We randomly take out 25% of the training set to serve as the validation set. To be fair, we used randomly selected training data for ten experiments in all subsequent experiments and presented the mean and standard deviation of the experimental results.

3.1. Experimental Datasets

Kennedy Space Center (KSC). This dataset was collected by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensors in the Kennedy Space Center. It has a total of 224 spectral bands. After removing the water absorption and low signal-to-noise ratio (SNR) bands, the remaining 176 bands are used for experiments. Its size is 512 × 614 pixels, with a total of 5,211 marked pixels and 13 land cover categories. Table 2 lists the specific division of the dataset.

University of Pavia (PU). This dataset was collected by ROSIS sensors in the urban area of the University of Pavia in northern Italy. It has 115 spectral bands. After removing the bands affected by noise, 103 bands are left for experiments. It has a size of 610 × 340 pixels, a total of 42,776 marked pixels, and 9 land cover categories. Table 3 lists the specific division of the dataset.

Salinas (SA). This dataset is acquired by the AVIRIS sensor. The database has 224 spectral bands, with 20 water absorption bands removed, leaving 204 bands for experiments. The size of Salinas is 512 × 217 pixels. There are a total of 54,128 marked pixels and 16 land cover categories. Table 4 lists the specific division of the dataset.

3.2. Experimental Configuration

To evaluate the performance of the model proposed in this paper, the experiments are implemented on a computer with an AMD CPU R7-4800 at 2.9 GHz, a memory size of 16 GB, and an RTX2060 graphical processing unit (GPU).

The model proposed in this paper was implemented by Python version 3.7.0 and the deep learning framework of PyTorch version 1.2.0. Optimization is performed by SGD optimizer [37]. The loss function of our proposed model uses the cross-entropy function. In the experiment on the PU dataset, the learning rate is set to 0.01, and it decays to 0.001 in the 41st epoch. In the experiment on the SA dataset, the learning rate is set to 0.005, which decays to 0.001 at the 41st epoch and decays to 0.0001 at the 81st epoch. In the experiment on the KSC dataset, the learning rate is set to 0.01, decays to 0.001 at the 41st epoch, and decays to 0.0001 at the 81st epoch.

3.3. Parameter Setting

Some factors have a significant impact on the classification performance of the model, and we analyze the impact of these factors on the model in this subsection. These factors are batch size, spatial size, training sample, and the number of heads of multihead self-attention. The total epochs of the three dataset experiments of PU, SA, and KSC are set to 80, 120, and 120, respectively.

3.3.1. Batch Size

A batch size matching the model can effectively improve the accuracy of the model and improve memory utilization. We test the accuracy of the model when the batch size is 16, 32, and 64 with results shown in Figure 10. Our experiments show that when the batch size of the SA, PU, and KSC datasets is all 64, the model performs the best for classification on the three datasets.

3.3.2. Spatial Sizes

The spatial sizes determine the spatial information that the model can use for classification and has a great impact on the performance of the model. To evaluate the impact of the spatial size on the performance of the SSA-Transformer, we choose the spatial size of 9, 15, and 21 for the experiment. Figure 11 shows the performance (OA) of the spatial size on the SSA-Transformer. We observed that as the spatial size increases, the accuracy of the model did not necessarily increase. This is because as the spatial information increase, there will be more pixels interfering with the classification.

Considering that larger spatial sizes will lead to higher computational costs, the spatial sizes of SA, PU, and KSC datasets are set to 9×9, 15×15, and 9×9, respectively.

3.3.3. Training Sample

We consider utilizing 5%, 10%, and 15% of the sample data in SA and 200, 300, and 400 samples per class in KSC and PU as the training set, respectively. The rest of the data is used as the test set. Figure 5 shows the results obtained by training our proposed model on the corresponding sample dataset. From Table 5, we can see that, in the experiment on KSC, the accuracy of the three sets of data is not much different. The experiments on the remaining two datasets are that the larger the training set, the higher the accuracy. The reason is that the training set can alleviate the overfitting problem of the model.

There is a trade-off between the performance and training time of the model. For the Pavia University dataset, we employ a 400 per class strategy. For KSC, we employ a 200 per class strategy. For Salinas, we employ 10% of the sample data.

3.3.4. The Number of Heads of the Multihead Self-Attention Mechanism

The multihead self-attention mechanism can focus on different positions and can more effectively mine the relationship between the various vectors of the sequence. Therefore, we choose head = 4, 6, and 8 for the experiment. Figure 12 shows the effect of different number of heads on the accuracy of the model. There is a trade-off between the performance and training time of the model. The number of heads of SA, PU, and KSC datasets is set to 8, 6, and 8, respectively.

4. Results and Discussion

In this section, we used several recently developed typical CNN-based models to measure the performance of our proposed model, including 1D-CNN [18], M3D-DCNN [19], SC-FR [20], pResNet [21], and DBDA [22]. We repeat all experiments in the three datasets 10 times to ensure the fairness of the experiment. We uniformly use the spatial size, training sample, and batch size determined in Section 3 as the input of the comparison model and the model we proposed. The evaluation indicators OA, AA, and KAPPA coefficients are expressed in the form of “mean ± standard deviation.” In addition, we also use the variance of OA and the variance of AA to express the volatility of accuracy.

4.1. Comparing with Other Methods

The classification results for each of the methods are shown in Tables 68. Experimental results demonstrate that our proposed model achieves the best performance on the PU and SA datasets. On the KSC dataset, compared with DBDA, our proposed model is 0.01%, 0.01%, and 0.02% lower than OA, AA, and KAPPA, respectively, but the gap is not significant. For the proposed model, in the Salinas dataset, compared with 1D-CNN, the OA, AA, and KAPPA of our model are 10.02%, 9.36%, and 13.42% higher, respectively. This is because 1D-CNN does not only extract the spatial feature of HSI but only extract the spectral feature of HSI. M3D-DCNN, pResNet, and SC-FR are models based on 3D pixel blocks that do not use the attention mechanism. M3D-DCNN uses a variety of convolution kernels of different sizes to obtain multiscale information. Even if 3D convolution is used, since the attention mechanism is not used, the OA, AA, and KAPPA are 3%, 1.55%, and 3.34% lower on the Salinas dataset compared with our proposed model, respectively. pResNet and SC-FR are 2DCNN-based models, but compared to our proposed model, OA is 0.04% and 1.18% lower on the Salinas dataset, respectively. This result shows that it is difficult to fully extract features only relying on 2DCNN. DBDA is a model based on 3DCNN. Although it uses spatial attention mechanism and spectral attention mechanism to extract spatial and spectral features, its OA on the PU and SA datasets is 0.17% and 0.43% lower than our proposed model, respectively. The reason is that this model cannot use the global information of neighborhood pixel blocks for classification. Although, in the KSC dataset, the OA of DBDA is 0.01% higher than our proposed model, the accuracy of our proposed model on class 5 (Oak/Broadleaf) reaches 100%, while the accuracy of DBDA on class 5 is only 99.52%. The performance of our proposed model on the three datasets shows that, compared to the CNN-based model, the model that combines transformer and CNN can also reach a competitive level of accuracy.

Figures 1315 visualize the classification results of our proposed model and the other five models on three datasets. It can be clearly seen in the classification map that there are a lot of noise points on the 1D-CNN classification map because the model does not extract the spatial features of HSI. The rest of the model used for comparison and the model we proposed all use the spatial information of HSI to help classification. Thus, the noise points problem is solved. Moreover, since M3D-DCNN, SC-FR, and pResNet do not use an attention mechanism, these models are more likely to be disturbed by pixels and spectral bands that do not contribute to the classification. For example, on the SA dataset, none of these models can accurately mark class 15 (Vinyard_untrained), and our proposed model marks class 15 most accurately. Although DBDA also uses the attention mechanism, it can be observed in the classification map that the classification effect is not as good as the model we proposed. Specifically, by comparing ground-truth images, our proposed model achieves a more accurate and smooth classification effect.

The above experiments can prove that our proposed model can achieve competitive performance compared with the CNN-based model. But balancing performance and efficiency is also important for the model. Table 9 shows the training time and test time of pResNet, DBDA, and our proposed model on PU, KSC, and SA. Our proposed model has a decrease in training time compared with DBDA and pResNet. Although DBDA performs better on KSC than our proposed model, the training time of our proposed model is only 65% of DBDA, which shows that our model achieves a better balance between efficiency and accuracy.

4.2. Computing Time for Selecting Different Numbers of Bands

When we introduced the spectral-spatial attention module in Section 2, we mentioned that this module will select appropriate bands in the last layer (i.e., the 1 × 1 convolution layer) to reduce redundant information, which can also reduce the time required for the model to train and test. Table 10 shows the training time and test time of the model when the spectral-spatial attention module selects 16, 32, and 64 bands. We can find that as the number of selected bands decreases, both the training time and the testing time of the model decrease.

4.3. Effectiveness of the Dense Connection

Dense connection can improve the flow of information between transformer encoder blocks and reduce the loss of information. To prove the effectiveness of dense connection, we removed dense connection and compared the performance of these two models.

Figure 16 shows the improvement of model performance by dense connections. A model with a dense connection can achieve higher accuracy. We conclude that dense connection can improve the performance of model classification.

4.4. Effectiveness of the Spectral-Spatial Attention Module

In Section 2, we explain the role of the CNN-based spectral-spatial attention module. To prove the effectiveness of the spectral-spatial attention module, we removed the spectral-spatial attention module, spectral attention, and spatial attention, respectively, and compared the performance of these four models.

The impact of the spectral-spatial attention module on the model performance is shown in Figure 17. The performance of the model is greatly improved by extracting low-level local features from neighborhood pixel blocks. This shows that combining the local features extracted by CNN and the global features extracted by transformer can more effectively improve the performance of the model. It is worth noting that, on the SA dataset, the accuracy of our proposed model is improved after removing the spectral attention. The reason is that many pixels with different labels in the SA dataset have similar spectral characteristics. After adding spectral attention, the model pays too much attention to the spectral features. We conclude that spectral-spatial attention module can improve the performance of model classification.

5. Conclusion

In this paper, we propose a model that combines transformer and a CNN-based spectral-spatial attention mechanism. This model can separately extract the local and global features of HSI. The experimental results show that the model combining transformer and CNN has better performance than the CNN-based model. The model first uses a spectral-spatial attention mechanism to extract local features and reduce the impact of redundant information on classification, then converts neighborhood pixel blocks into sequences, and extracts global features through the transformer part of the model. Finally, it is classified through the fully connected layer.

In the experiment, we first analyzed the influence of batch size, spatial size, training samples, and number of heads of the multihead self-attention mechanism on classification accuracy. Next, we compared the experimental results of our proposed model with the other five models on three public datasets. The experimental results show that, compared to several other models based entirely on CNN, the model combined with CNN and transformer also achieved competitive accuracy. Meanwhile, the experiments show that it is completely feasible to use transformer for HSI classification.

Future research should focus more on efficient transformer encoder block and attention mechanisms to process HSI information. By combining the local and global features of HSI more effectively, the accuracy of the HSI classification model can be further improved, and a more effective HSI classifier can be constructed.

Data Availability

The data that support the findings of this study are openly available in Hyperspectral Remote Sensing Scenes at https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grants nos. 62176087 and 41801310) and the Technology Development Plan Project of Henan Province, China (Grants nos. 202102210160, 202102110121, and 202102210368).