Abstract

It has been common to employ multiple features in the identification of the images acquired by hyperspectral remote sensing sensors, since more features give more information and have complementary properties. Few studies have discussed the combination strategies of multiple feature groups. This study made a systematic research on this problem. We extracted different groups of features from the initial hyperspectral images and tried different combination scenarios. We integrated spectral features with different textural features and employed different dimensionality reduction algorithms. Experimental results on three widely used hyperspectral remote sensing images suggested that “dimensionality reduction before combination” performed better especially when textural features performed well. The study further compared different combination frameworks of multiple feature groups, including direct combination, manifold learning, and multiple kernel method. The experimental results demonstrated the effectiveness of direct combination with an autoweight calculation.

1. Introduction

The analysis of hyperspectral images has been more and more discussed in recent years. In the classification problem, it has been widely accepted that features from different views help to better recognize objects. It is common to extract multiple features before the classification procedure. Therefore, researchers have made great efforts in several aspects. First, for a long time, people have been striving to extract desirable and suitable features from hyperspectral remote sensing images for better representation [16]. It has been recognized that linear features are less effective than nonlinear features [7], while it reduces efficiency to obtain nonlinear features in many cases. Some efforts have been made to combine both features simultaneously [8]; second, more recently, many researchers have successfully designed different frameworks [912] to organize different types of features, like texture features, shape features, and spectral features, because the views from different feature spaces have particular statistical properties [13]. Finally, great efforts have been made in the classification process. Classifiers based on kernels [14] have dealt well with the Hughes phenomenon [15]. It has led to a trend towards multiple kernel learning (MKL) [1618], in which different groups of features have different kernel matrices and finally a composite kernel is yielded.

In general, the above approaches combine features extracted from peculiar bands or features (e.g., top principal component) with the initial hyperspectral bands and then put different groups of features into dimensionality reduction (DR) frameworks or classifiers [19, 20]. However, complementary properties of multiple features have not been widely considered or analyzed. Few approaches have taken global features which can be transformed or converted from the complete hyperspectral bands as a group of input features. More frequently, approaches like principal component analysis (PCA) [21], linear discriminant analysis (LDA) [22], isometric feature mapping (ISOMAP) [23], and Laplacian eigenmaps (LE) [24] have been only exploited as global dimensional reduction techniques. The low-dimensional output features have reduced the contribution of each group of features [9, 10].

In this paper, we first made a systematic research on two schemes for hyperspectral image classification based on multiple features by utilizing different DR tools (linear and nonlinear) and different types of features. One scheme was combining spectral features with other features before the DR process; the other was reducing the dimensionality of the original hyperspectral features before combining with other extracted features. Based on the experimental results on three hyperspectral datasets, we suggested that alternative decisions should be made in different circumstances. Based on the research, we further compared different combination frameworks in hyperspectral remote sensing image classification. We selected complementary features including linear and nonlinear global features (in this paper, we take features converted from all the bands as “global features”) and two kinds of textural features (extracted from certain bands or layers). Three combination frameworks were tested on three frequently used hyperspectral datasets, which comprised two scenes collected by airborne visible/infrared imaging spectrometer (AVIRIS) over the Indian Pine region and Salinas valley, and one scene collected by the reflective optics spectrographic imaging system (ROSIS) over Pavia University.

The remainder of the paper is organized as follows. Section 2 provides the details about the source of three hyperspectral remote sensing datasets employed in the experiments and the process of the proposed method. Then, the experimental results are reported in detail in Section 3, including the comparison of two classification schemes based on multiple features and the test of the combination frameworks with the results both in accuracy and visual perspectives. Discussions based on the classification results are also included in this section. Finally, a general summary of the paper is represented in Section 4.

2. Methodology

2.1. Multiple Feature Extraction
2.1.1. Dimensionality Reduction of Spectral Features

In this paper, we consider features achieved from all the spectral bands as global features (GF). Generally, GF are divided into two categories. One is linear features, which commonly derive the information from the original spectral image bands by multiplying a transformation matrix. Among them, PCA is a conventional linear transition without class label information and has quite high time efficiency; the other is nonlinear features including manifold learning features and kernel features [25, 26]. More and more investigators have focused on absorbing or exploiting nonlinear features in the classification problems because linear features do not take into account the underlying nonlinear class boundaries [27]. However, linear and nonlinear global features are rarely combined in previous studies and the complementary properties of two types of features have not been widely discussed.

Representative manifold learning algorithms for DR comprise locally linear embedding (LLE) [28], ISOMAP, Laplacian eigenmaps (LE), and local tangent space alignment (LTSA) [29]. This paper extracts ISOMAP and LE features from the hyperspectral datasets.

2.1.2. Textural Feature Extraction

In this paper, we take textural features derived from certain spectral bands or features as TF (textural feature). Many approaches have supplemented TF to spectral features because they complement features from different perspectives and give some detailed information. Two frequently used textural features are exploited in the scheme, including filter-based Gabor features and statistical GLCM features.

(1) 2D Gabor Textural Feature. We consider the procedure in [30]; a Gabor function is defined as where is the image location in spatial domain and frequency vector determines the scales and directions of Gabor functions. It is defined as

In our experiment, parameter is fixed to 2. The scale parameter ranges from 0 to 3, and the direction parameter ranges from 0 to 7, which stands for 4 scales and 8 directions. and are both integers. Parameter is fixed to which represents the number of oscillations under the Gaussian envelope. According to [10], the textural images derived from Gabor filters are the real part of convolving the image with different and .

(2) GLCM Textural Feature. The gray-level cooccurrence matrix (GLCM) [31] textural feature is a widely used statistical feature. Given a certain distance and a direction, a gray-level cooccurrence matrix is built by calculating the probability of the occurrence of two gray levels from a pixel. Various features can be obtained from GLCM, and we extract 8 features for the combination in the experiment, including mean value, variance, homogeneity, contrast, dissimilarity, entropy, second moment, and correlation (details of calculation process can be found in [31]). In the experiment, the grayscale quantization level is fixed to 64 and the preprocess window is 3 × 3.

2.2. Combination Scenarios
2.2.1. Weight Estimation by Average Distance Measurement

Different types of features make different contributions to various classification problems. The paper automatically estimates the weights of different feature groups based on the average distance measurement in Euclidean space. The procedure is as follows: (1)Concatenate and normalize different groups of features.(2)Calculate the mean values and variances of all classes for each feature from the sample set and compute the standardized distances between each two classes for each feature. The standardized distance is calculated as follows:where is the standardized distance for each feature, and represent the mean values of class and class , and and represent the standard deviation of class and class . According to the standardized distances, if a couple of selected features give rise to the smallest distance within classes and the biggest distance between classes, the best classification result is likely to be yielded. Next, we extend the method to fit multiple classes and multiple features. (3)Calculate the sum of the standardized distances between each two classes for each feature and allocate the weight multipliers for different feature groups: the expression of a sample is defined as , in which is the number of the group of features and is the number of feature groups (; PC, LE or ISOMAP, Gabor, and GLCM). We define as the weight multiplier of the group. can be calculated byin which where is the average standardized distance of the feature in the group between classes and (4) and is the number of all the classes. (4)Renew the representation of the sample .

The advantages of the weight estimation method are as follows: (i)Each type of features has its own weights, which maintains the specific properties of different features.(ii)The weight multipliers are around 1, so the method does not influence much on the normalized values of the features.(iii)The method has high time efficiency without iterative procedure.

2.2.2. Combination and Classification

As discussed before, we have got several global features and two kinds of textural features. For GF, we select the first 10 PC, 10 ISOMAP features, and 10 LE features, while for TF, 32 Gabor features (4 scales associated with 8 directions) and 8 GLCM features are extracted from the PCA top component. The two textural features have been recognized to have complementary properties [32]. For GF, 10 features not only avoid large amount of the calculation but also give sufficient information, while for TF, the 8 GLCM features are employed frequently as well as the Gabor features. Although different studies have selected different numbers of scales and directions for Gabor features, we use relatively fewer features in order to increase the efficiency of the calculation procedure. We do not analyze the properties of different parameters in detail in this paper. For the first 2 datasets, ISOMAP features are employed as the only features based on manifold learning while for the third dataset, we try to utilize LE features instead of ISOMAP. As described before, the features actually contain 4 different types, including 10 linear GF (PC), 10 nonlinear GF (ISOMAP or LE), 32 filter-based TF (2D Gabor), and 8 statistical TF (GLCM). Features are normalized before inputting the following approaches.

The paper addressed three scenarios to combine multiple features. (i)Scenario 1: direct combination

We directly combine different types of feature vectors, such as GF and TF; thus, a longer feature vector is formed. (ii)Scenario 2: dimensionality framework

The state-of-the-art dimensionality algorithm LE has been widely discussed and utilized in many studies [33]. We consider the method reported in [9]. (iii)Scenario 3: multiple kernel method

The existing multiple kernel learning algorithms have to calculate the weight factors through an iterative process. To avoid the complex iterative calculation, we estimate the weight of different features before classification. We design the method by the distance measurement. Finally, the basis kernels are multiplied by the relevant weight factors. Then, it is converted to a simple kernel classifier. When the combinations are decided, the SVM [34] classifier is employed to test on these features with its parameters and confirmed through cross-validation with the training samples [35].

3. Experimental Section

3.1. Hyperspectral Image Data

Three commonly used datasets were tested in the experiments. All of them were acquired by hyperspectral sensors with different spatial resolutions. Researchers have been trying to improve the classification performance on these scenes for a long time. In this paper, experiments are carried out on these scenes under similar experiment conditions.

3.1.1. Indian Pine Scene

The Indian pine (IP) dataset, derived from the airborne visible/infrared imaging spectrometer (AVIRIS), is one of the most commonly used hyperspectral image data for test. The resolution of the image is 30 m, and the size is 145 × 145 pixels. The sensor contains 220 bands in which 62 bands have to be discarded due to water absorption or noise and finally 158 valid bands are reserved in the area. The dataset mainly covers agricultural lands with 10171 labeled data points divided into 12 classes. In the experiment, 5% of the labeled data points are considered as training samples for each class. Table 1 lists the exhaustive class information and the number of samples. Figure 1 shows the image and the labeled condition.

3.1.2. Salinas Scene

Salinas (SL) dataset (Figure 2) was also acquired by AVIRIS. 204 valid bands are selected from the total 224 bands. The resolution of the scene is 3.7 m, and the size is 512 × 217 pixels in which 54129 pixels are labeled. 1% of the labeled pixels are considered as training samples for each class. The dataset is divided into 16 classes with the details listed in Table 2.

3.1.3. Pavia University Scene

The Pavia University (PU) dataset (Figure 3) was acquired by ROSIS, and the location is Pavia University, Italy. The resolution is 1.3 m which ranks the highest among the three datasets. The image size is 610 × 340 pixels, including 207400 data points. 113 valid bands are selected from the total 115 bands with 2 noisy bands removed. Different from the former scenes, PU mainly covers artificial lands. 2% of 42776 labeled pixels are considered as training samples for each class. The land cover details are listed in Table 3.

3.2. Research on Different Dimensionality Reduction Scenarios

A systematic research is made on two DR scenarios for hyperspectral image classification based on multiple features. Different DR tools (linear and nonlinear) and different types of textural features are employed. One scenario is the conventional procedure characterized by combining hyperspectral bands with other features before reducing the dimensionality; the other is featured by reducing the dimensionality of the original hyperspectral bands before combining other extracted features. In addition, classification scenarios by only spectral features and only textural features help to compare and analyze the results. As a result, four scenarios are listed in Table 4. In scenario 3, we search the best output dimension among 5-45 with the interval of 5 for each dataset. In scenario 4, the first 10 features are selected after the DR process.

We repeat ten times of independent experiments for each case. In each trial, the samples are randomly selected from all the labeled pixels and the selection strategy is stratified by classes. We calculated the average overall accuracy, kappa index, and the best for DR scheme in scenario 3. The parameters are optimized by the training samples.

Three datasets (IP, SL, and PU) are tested, and the performance is reported in Table 5. For each method, relates to the best performance in scenario 3. We can get from Table 5 that the classification accuracies in scenario 4 are generally higher than those in scenario 3, especially when textural features perform well by themselves. Even if textural features do not perform well, scenario 3 is not always superior to scenario 4. We may explore the reason referring to scenario 1 and scenario 2. When textural features outperform spectral features and have less numbers, scenario 4 obviously outperforms scenario 3. Features with a larger number may be dominated during a global DR transformation after the feature combination, regardless of whether the DR tool is linear or nonlinear. As a result, a higher accuracy yield is in scenario 4, owing to the good performance and sufficient feature numbers of textural features, like Gabor features. On the contrary, the initial hyperspectral features with both larger numbers and worse performance influence and reduce the accuracy (scenario 3). However, when spectral features outperform textural features, accuracies in scenario 3 are close or superior to those in scenario 4 according to the “PC-GLCM” and “LE-GLCM” methods in Table 5. In addition, Table 5 shows that, regardless of whether the DR algorithm is linear or nonlinear, it is not easy to find the empirical during the procedure of DR.

With the development of textural feature extraction technique, the performance of textural features researchers exploited often outperforms the initial hyperspectral features with less numbers. As a result, it is proper to reduce the dimensionality of hyperspectral features before combining with textural features or other features. Generally, we cannot exactly predict the performance of different groups of features, so we just reduce the dimensionality of hyperspectral features to a certain extent and select moderate numbers of low representation.

3.3. Combination Frameworks

In the experiment, we design 5 feature selection scenarios for comparison: hyperspectral bands only, GF only, TF only, integrating GF and TF, and integrating GF, TF, and hyperspectral bands. The results of 5 groups of features associated with 3 combining strategies are listed. In addition, for IP scenes, details of the DR framework (with the dimensions no more than 60) will be presented; for SL scenes, the classification accuracies with different groups of features will be shown in an intuitive way; for PU scenes, we will investigate the complementary properties of different feature groups. Also, complementary properties of linear and nonlinear features will be discussed. Finally, classification results with weight estimation and without weight estimation will be compared in accuracy.

10 independent experiments are repeated for each case. In each trial, the samples are randomly selected from all the labeled pixels and the selection strategy is stratified by classes. Overall accuracy (OA) is calculated by ten trails. We also get the average kappa index, execution times, and the best for the DR scheme in scenario 3. The parameters are also optimized by the training samples.

3.3.1. IP Scenes

Figures 46 present classification results based on different groups of features associated with 3 combining strategies. The accuracies can be seen in Table 6. In each figure, GF + TF has the best performance both in accuracy and visual perspectives. Figures 4(a), 5(a), and 6(a) yield the most misclassifications, and Figures 4(b), 5(b), and 6(b) show better results than do Figures 4(a), 5(a), and 6(a) with less features. Figures 4(c), 5(c), and 6(c) have much better results than Figures 4(b), 5(b), and 6(b) have, owing to better textural features. With the help of the integration of GF and TF, Figures 4(d), 5(d), and 6(d) have the fewest classification errors. It has to be mentioned that there is no need to add initial hyperspectral features in the proposed method, because the accuracy reduces when the spectral features are added referring to the results in Figures 4(e), 5(e), and 6(e). As a result, 60 input features yield the best result. It can also be discovered from Table 6 that nonlinear LE dimensionality reduction tools spend more time when dealing with the combined features. According to the comparison of three organization schemes, it can be concluded that the overall calculation amount of the nonlinear DR procedure is greater than that of the classification procedure with relatively higher inputted dimensions. Also, we find that Figure 4 outperforms Figures 5 and 6. So the dimensionality framework or multiple kernel method has not given rise to a better classification result than the direct way of combination. In addition, it is not easy to find a desirable dimension in the dimensionality reduction process (Figure 7).

3.3.2. SL Scenes

The SL dataset has more feature numbers and a higher resolution. Extremely high accuracy is yielded with the proposed scenario, and similar regulations in IP dataset can be found. The results are listed in Figures 810. Table 7 lists the accuracies by different scenarios. Among the 16 classes, it is challenging to distinguish class 8 and class 15, while the GF + TF strategy shows the perfect accuracy rate.

However, no matter what kinds of combination strategy are employed, the results have not varied a lot for one group of features and often remain in a certain range. The accuracies change more obviously and regularly with the variation of the input features (Figure 11). So we can suggest that the selection of features is more important than combining approaches in hyperspectral image classification.

However, no matter what kinds of combination strategy are exploited, the results do not vary a lot within a fix group of features. The accuracies just remain in a certain scope. The accuracies change more obviously and regularly with the variation of input features (Figure 11). So we can suggest that the selection of features is more important than the combination approaches in hyperspectral image classification. In addition, we can conclude that the proposed strategy applies to different combination schemes.

3.3.3. PU Scenes

Among the three datasets, PU has the highest resolution and the most pixels. Table 8 lists the accuracies by different scenarios. In this high-resolution dataset, the complementary properties of GF and TF have a more apparent representation. For example, in Figures 12, 13, and 14(a) and 14(b), with GF, the misclassification occurs frequently between class 2 (grass) and class 6 (bare soil), while TF discriminates the class pair well; class 1 (asphalt road) and class 4 (tree) are challenging classes for TF because of the close location, but GF or spectral features perform well between the 2 classes. Figure 15 presents a direct perspective of the complementary properties of different feature groups.

As has been discussed before, different groups of features have their specific properties. Different feature extraction algorithms yield different feature numbers, so it seems necessary to add weight factors to different groups of features in different circumstances. We can find the improvement of the autoweighting method in the proposed GF + TF strategy in Table 9.

3.4. Discussions

It is necessary to discuss the experimental results on the hyperspectral datasets.

The summary of the experiments with different DR schemes is as follows: textural features combined with a low-dimensional spectral features prove to be a more appropriate strategy according to experiments in Section 3.2, especially when textural features perform well and have less numbers. When textural features do not work well, it is not always the case. However, great efforts have been made to extract favorable textural features for hyperspectral image expression, so for most circumstances, the textural features we select are empirically superior. As a result, scenario 4 in Table 4 is recommended.

The summary of the experiments with the combination scenarios is as follows: for all the datasets, the GF + TF method performs best both in accuracy and visual perspectives; “global features” and “textural features” present clear complementary properties in the classification results. The GF + TF combination with only 60 input features works well without the incorporation of the original hyperspectral bands. The weight estimation method proves to be effective when dealing with multiple features. In addition, the number of the input features of the GF + TF strategy is independent of the band number of the images or the types of sensors.

The summary of the experiments with different feature combination frameworks is as follows: the results yielded by different combination strategies do not vary a lot if the input features are confirmed; compared with the “framework” strategy, the “direct combination” strategy with the only 60 GF + TF input features not only performs better but also avoids great calculating amount and large number of dimensions; multiple kernel method and nonlinear DR framework sometimes lead to desirable results, but the best is hard to determine, which can be reflected both in our experiments and in other studies.

We can conclude by the experimental results and the discussions above that what kinds of features to be combined influences the classification results to a larger extent than the strategies of integrating multiple features. In addition, multiple features with complementary properties lead to good classification results. In this paper, we exploit 4 different features for combination, including linear and nonlinear “global features” and filter-based and statistics-based “textural features,” which ensure the complementary and diverse properties.

Pattern recognition has been applied in many fields. Hyperspectral remote sensing image classification is a peculiar application and has its own characteristics. Compared with image recognition (like in [9, 13]) and medicine field (such as gene and protein classification [7]), remote sensing image classification problems have a relatively lower dimension or at least a smaller ( represents dimension and represents labeled samples). According to an overall statistics in [15], when (like some cases in body, face, object image recognition, or gene classification), the accuracy appears low (Figure 16). In this case, a DR algorithm will help to improve both the results and the classification efficiency. However, in the case of , the accuracy appears pretty well and a DR framework does not lead to evidently better results. In addition, it is not easy to find the best in different cases according to our experimental results and Figure 7. In fact, researchers have not found an effective way of confirming the most desirable . The latter case (a smaller ) may be the common application of hyperspectral remote sensing image classification in the temporary period of development (Figure 16). With the development of the hyperspectral sensors, the bands may increase; thus, the area of the red frame in Figure 16 may grow. However, the proposed GF + TF strategy may still be practical for use because it is independent of the bands of the sensors.

4. Conclusions

Hyperspectral sensors provide more details in spectra; however, problems are yielded along with the advantages. One is the high-dimension problem, and the other is the possibility of extracting multiple features from the hyperspectral images. In this paper, a systematic research is made to find an appropriate strategy to deal with classification problems of multiple features. Then, we further exploit the complementary features to improve hyperspectral image classification performance. Experiments on 3 hyperspectral datasets suggest that the scheme GF + TF is effective. The main contribution can be concluded as follows: first, based on the experiments, we suggest DR algorithms work better as feature acquirement methods than just reducing the dimensions. The paper further selects features from a different perspective in multiview problem. In the previous work, there have been feature combination ideas characterized by “spectral and nonspectral” or more recent schemes like “linear and nonlinear.” However, we present “global and nonglobal” strategy and take linear and nonlinear global features (like PC, ISOMAP, and LE) as a portion of features for the first time. We have also got the conclusion that features with different types are more likely to have complementary properties. Second, concluded from the experimental results, feature selection proves to be more important than how to organize multiple features in hyperspectral image classification problems. Third, a systematic research has been made on the combination frameworks of multiple features, including direct combination, manifold learning, and multiple kernel method. We have found that complex methods like manifold framework and multiple kernel do not lead to the increase in accuracy; instead, the direct combine strategy with an autoweight calculation performs the best. Finally, we have compared hyperspectral image classification problems with other applications of pattern recognition and clearly analyzed the characteristics of the former.

As future work, we will continue to find more complementary features for integration in hyperspectral remote sensing image classification based on the experiments. For example, shape features have been widely developed recently and have not been considered in the study. In addition, we will further discuss the internal redundancy of each group of features.

Conflicts of Interest

The authors declare no conflict of interests.

Authors’ Contributions

Yuntao Ma is the main author who proposed the basic idea, completed the experiments, and carefully revised this manuscript. Ruren Li and Guang Yang provided the useful suggestions on designing the approaches involved in our proposed strategy. Lishuang Sun and Jingli Wang helped to modify the manuscript.

Acknowledgments

The study was funded by the National Natural Science Foundation of China “Study on environmental impact mechanism of surface deformation monitoring in open pit mine and inversion methods under low coherence environment based on GB-SAR” (no. 51774204) and “TGRA Forest Cover and Dynamic Change Detection Based on Time Series Remote Sensing Images” (no. 2014QC018).