Journal of Sensors

Volume 2016 (2016), Article ID 3632943, 10 pages

http://dx.doi.org/10.1155/2016/3632943

## Stacked Denoise Autoencoder Based Feature Extraction and Classification for Hyperspectral Images

^{1}Faculty of Mechanical and Electronic Information, China University of Geosciences, Wuhan, Hubei 430074, China^{2}Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China

Received 25 December 2014; Revised 11 May 2015; Accepted 21 June 2015

Academic Editor: Jonathan C.-W. Chan

Copyright © 2016 Chen Xing et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Deep learning methods have been successfully applied to learn feature representations for high-dimensional data, where the learned features are able to reveal the nonlinear properties exhibited in the data. In this paper, deep learning method is exploited for feature extraction of hyperspectral data, and the extracted features can provide good discriminability for classification task. Training a deep network for feature extraction and classification includes unsupervised pretraining and supervised fine-tuning. We utilized stacked denoise autoencoder (SDAE) method to pretrain the network, which is robust to noise. In the top layer of the network, logistic regression (LR) approach is utilized to perform supervised fine-tuning and classification. Since sparsity of features might improve the separation capability, we utilized rectified linear unit (ReLU) as activation function in SDAE to extract high level and sparse features. Experimental results using Hyperion, AVIRIS, and ROSIS hyperspectral data demonstrated that the SDAE pretraining in conjunction with the LR fine-tuning and classification (SDAE_LR) can achieve higher accuracies than the popular support vector machine (SVM) classifier.

#### 1. Introduction

Hyperspectral remote sensing images are becoming increasingly available and potentially provide greatly improved discriminant capability for land cover classification. Popular classification methods like -nearest-neighbor [1], support vector machine [2], and semisupervised classifiers [3] have been successfully applied to hyperspectral images. Besides, some feature matching methods in the computer vision area can also be generalized for spectral classification [4, 5].

Feature extraction is very important for classification of hyperspectral data, and the learned features may increase the separation between spectrally similar classes, resulting in improved classification performance. Commonly used linear feature extraction methods such as principal component analysis (PCA) and linear discriminant analysis (LDA) are simple and easily implemented. However, these methods fail to model the nonlinear structures of data. Manifold learning methods, which are proposed for nonlinear feature extraction, are able to characterize the nonlinear relationships between data points [1, 6, 7]. However, they can only process a limited number of data points due to their high computational complexity. Deep learning methods, which can also learn the nonlinear features, are capable of processing large scale data set. Therefore, we utilized deep learning for feature extraction of hyperspectral data in this paper.

Deep learning is proposed to train a deep neural network for feature extraction and classification. The training process includes two steps: unsupervised layer-wise pretraining and supervised fine-tuning. The layer-wise pretraining [8] can alleviate the difficulty of training a deep network, since the learned network weights which encode the data structure are used as the initial weights of the whole deep network. The supervised fine-tuning that is performed by logistic regression (LR) approach aims to further adjust the network weights by minimizing the classification errors of the labeled data points. Training the network can achieve both high level features and classification simultaneously. Popular deep learning methods include autoencoders (AE) [9], denoised autoencoders (DAE) [10], convolutional neural networks (CNN) [11], deep belief networks (DBN) [12], and convolutional restricted Boltzmann machines (CRBM) [13]. In the field of hyperspectral data analysis, Chen utilized AE for data classification [14], and Zhang utilized CNN for feature extraction [15].

In this paper, we focus on the stacked DAE (SDAE) method [16], since DAE is very robust to noise, and SDAE can obtain higher level features. Moreover, since sparsity of features might improve the separation capability, we utilized rectified linear unit (ReLU) as activation function in SDAE to extract high level and sparse features. After the layer-wise pretraining by SDAE, LR layer is used for fine-tuning the network and performing classification. The features of the deep network that are obtained by SDAE pretraining and LR fine-tuning are called tuned-SDAE features, and the classification approach that utilizes LR classifier on the tuned-SDAE features is hereafter denoted as SDAE_LR in this paper.

The organization of the paper is as follows. Section 2 describes the DAE, SDAE, and SDAE_LR approaches. Section 3 discussed the experimental results. Conclusions are summarized in Section 4.

#### 2. Methodology

Given a neural network, AE [14] trains the network by constraining the output values to be equal to the input values, which also indicates that the output layer has equally many nodes as the input layer. The reconstruction error between the input and the output of network is used to adjust the weights of each layer. Therefore, the features learned by AE can well represent the input data. Moreover, the training of AE is unsupervised, since it does not require label information. DAE is developed from AE but is more robust, since DAE assumes that the input data contain noise and is suitable to learn features from noisy data. As a result, the generalization ability of DAE is better than that of AE. Moreover, DAE can be stacked to obtain high level features, resulting in SDAE approach. The training of SDAE network is layer-wise, since each DAE with one hidden layer is trained independently. After training the SDAE network, the decoding layers are removed and the encoding layers that produce features are retained. For classification task, a logistic regression (LR) layer is added as output layer. Moreover, LR is also used to fine-tune the network. Therefore, the features are learned by SDAE pretraining in conjunction with LR fine-tuning.

##### 2.1. Denoise Autoencoder (DAE)

DAE contains three layers: input layer, hidden layer, and output layer, where the hidden layer and output layer are also called encoding layer and decoding layer, respectively. Suppose the original data is , where is the dimension of data. DAE firstly produces a vector by setting some of the elements to zero or adding the Gaussian noise to . DAE uses as input data. The number of units in the input layer is , which is equal to the dimension of the input data . The encoding of DAE is obtained by a nonlinear transformation function:where denotes the output of the hidden layer and can also be called feature representation or code, is the number of units in the hidden layer, is the input-to-hidden weights, denotes the bias, stands for the input of the hidden layer, and is called activation function of the hidden layer. We chose ReLU function [17] as the activation function in this study, which is formulated as

If the value of is smaller than zero, the output of the hidden layer will be zero. Therefore, ReLU activation function is able to produce a sparse feature representation, which may have better separation capability. Moreover, ReLU can train the neural network for large scale data faster and more effectively than the other activation functions.

The decoding or reconstruction of DAE is obtained by using a mapping function :where is the output of DAE, which is also the reconstruction of original data . The output layer has the same number of nodes as the input layer. is referred to as tied weights. If is ranged from 0 to 1, we choose softplus function as the decoding function ; otherwise we preprocess** x** by zero-phase component analysis (ZCA) whitening and use a linear function as the decoding function:where . DAE aims to train the network by requiring the output data to reconstruct the input data , which is also called reconstruction-oriented training. Therefore, the reconstruction error should be used as the objective function or cost function, which is defined as follows:where cross-entropy function is used when the value of input is ranged from 0 to 1; the square error function is used otherwise. denotes th element of the th sample and is L2-regularization term, which is also called weight decay term. Parameter controls the importance of the regularization term. This optimization problem is solved by using minibatch stochastic gradient descent (MSGD) algorithm [18], and in (5) denotes the size of the minibatch.

##### 2.2. Stacked Denoise Autoencoder (SDAE)

DAE can be stacked to build deep network which has more than one hidden layer [16]. Figure 1 shows a typical instance of SDAE structure, which includes two encoding layers and two decoding layers. In the encoding part, the output of the first encoding layer acted as the input data of the second encoding layer. Supposing there are hidden layers in the encoding part, we have the activation function of the th encoding layer:where the input is the original data . The output of the last encoding layer is the high level features extracted by the SDAE network. In the decoding part, the output of the first decoding layer is regarded as the input of the second decoding layer. The decoding function of the th decode layer iswhere the input of the first decoding layer is the output of the last encoding layer. The output of the last decoding layer is the reconstruction of the original data .