Abstract

Classification is one of the most challenging tasks of remotely sensed data processing, particularly for hyperspectral imaging (HSI). Dimension reduction is widely applied as a preprocessing step for classification; however the reduction of dimension using conventional methods may not always guarantee high classification rate. Principal component analysis (PCA) and its nonlinear version kernel PCA (KPCA) are known as traditional dimension reduction algorithms. In a previous work, a variant of KPCA, denoted as Adaptive KPCA (A-KPCA), is suggested to get robust unsupervised feature representation for HSI. The specified technique employs several KPCAs simultaneously to obtain better feature points from each applied KPCA which includes different candidate kernels. Nevertheless, A-KPCA neglects the influence of subkernels employing an unweighted combination. Furthermore, if there is at least one weak kernel in the set of kernels, the classification performance may be reduced significantly. To address these problems, in this paper we propose an Ensemble Learning (EL) based multiple kernel PCA (M-KPCA) strategy. M-KPCA constructs a weighted combination of kernels with high discriminative ability from a predetermined set of base kernels and then extracts features in an unsupervised fashion. The experiments on two different AVIRIS hyperspectral data sets show that the proposed algorithm can achieve a satisfactory feature extraction performance on real data.

1. Introduction

Hyperspectral imaging (HSI) provides simultaneously spatial and high resolution spectral data and helps to classify/recognize the materials that are challenging to discriminate with conventional imaging techniques [1]. However, it suffers from the curse of dimensionality. For instance, the curse of dimensionality causes increase in cost of storage, transmission, and processing of hyperspectral images. To overcome such challenges, dimensionality reduction techniques have been applied to hyperspectral data in the existing literature [2]. In general, HSI has spectral redundancy in many spectral channels. For this reason, dimension reduction or compression is possible and even necessary, especially for these bands.

Even though there are several dimension reduction approaches in the literature, including manifold learning [3, 4] and tensors [5], principal component analysis (PCA) [6] is the one among the popular techniques [79]. PCA is the discrete form of the continuous Karhunen-Loève Transform and it projects the data into a subspace so that the variance retained is maximized and the least square reconstruction error is minimized [10]. Use of PCA for dimensionality reduction in HSI is a computationally suitable approach and it helps preserve the most of the variance of the raw data. Although PCA has some theoretical inadequacies [11, 12] for use on remote sensing data, particularly hyperspectral images [13], the practical applications show that the results obtained using PCA are still competitive for the purpose of classification [14, 15]. The ability of PCA is limited for high-dimensional data since it relies on only second-order statistical information. The nonlinear version of the PCA, denoted as kernel PCA (KPCA), has been proposed to overcome these limitations [16].

Since the KPCA involves the higher-order statistics, it provides more information from the original data [17] and so it is employed in many applications including remote sensing data due to its satisfactory performance. In [18], classification performance of an artificial neural network has been demonstrated to outperform the classical approach using kernel principal components. Fauvel et al. [19] showed that the KPCA is better than the classical PCA in terms of classification accuracies. A general overview of feature reduction techniques for classification of hyperspectral images is presented in [9]. They performed comparative experiments between the unsupervised, e.g., PCA and KPCA, and supervised techniques, e.g., double nearest proportion (DNP) [20] and kernel nonparametric weighted feature extraction (KNWFE) [21]. Since the supervised learning techniques generally focus on improving class separability, these methods are expected to produce better results in terms of classification performance. The comparative results with KNWFE indicate that PCA and KPCA are still preferable to reduce dimensionality of hyperspectral images.

Fundamentally, KPCA is a version of PCA whose performance is greatly affected by the choice of the kernel and parameters. Namely, the selection of the optimal kernel and parameters is crucial for KPCA to achieve good performance. However, the application results show that no single kernel function can be best for all kinds of machine learning problems [22] and, therefore, learning of optimum kernels over a kernel set is an active research area nowadays [2327]. Li and Yang presented an ensemble KPCA method with Bayesian inference strategy in [28]. They exploited only Gaussian radial basis function (RBF) with different scale parameters as subkernels. Zhang et al. [29] have developed a method for unsupervised kernel learning in the KPCA, dubbed as A-KPCA, and applied the new method for object recognition problems.

The A-KPCA learns the kernels via an unsupervised learning approach. The 1D input vectors, e.g., feature vectors, are transformed into 2D feature matrices by different kernels. Each column of the feature matrix comes from corresponding 1D input vector. Nonlinear feature extraction (FE) is obtained from one set of projective vectors corresponding to the column direction of the feature matrices. The set of projective vectors corresponding the row direction of the 2D feature matrices is utilized for searching optimal kernels combination simultaneously. Despite having superior performance compared to KPCA, the A-KPCA has some critical limitations. Specifically, A-KPCA works completely unsupervised, and it is thus incapable of enhancing the class separability and it has no kernel preselection process. These are the main motivations of our work.

In this paper, a novel framework is introduced for hyperspectral FE and classification based on multiple KPCA models with an Ensemble Learning (EL) strategy in a semisupervised manner. EL is a process of combining multiple models, called experts, to set up a strong model for a specific machine learning problem [30]. Strong discriminative ability of individual experts and high diversity among them are required to produce satisfactory models [31, 32]. An acceptable classification performance highly depends on the class separability of features that is directly related to the discriminative ability. Inspired by EL, we extend the A-KPCA method by employing multiple kernels such that subkernels possessing higher discrimination ability are highlighted. The proposed approach, multiple kernel PCA (M-KPCA), learns an ensemble of multiple kernel principal components on an available labeled data set, and the final features are extracted via a weighted combination of all subkernels according to their separability performance. The early purpose of this paper is the utilization of the KPCA and A-KPCA in hyperspectral images and to determine impact of using nonlinear versions of PCA on classification performance. The further contributions and novelties in this paper can be summarized as follows: (1) a novel multikernel PCA strategy is presented by exploiting Ensemble Learning to evaluate and select the kernels; (2) M-KPCA acquires the superior classification results than PCA, KPCA, and A-KPCA by highlighting the subkernels with a class separability based weighting strategy; (3) M-KPCA produces better or competitive classification performance with other popular unsupervised FE methods like locality preserving projections (LPP) [33], random projections (RP) [34], and t-distributed stochastic neighbor embedding (t-SNE) [35]. After FE with all mentioned methods, the popular and robust support vector machines (SVMs) classifier is used for supervised classification. Since SVMs consider samples close to the class boundary, called support vectors, they show great performance even in high-dimensional data with small training samples [36, 37].

The paper is outlined as follows. Section 2 reviews the related work. In Section 3, the proposed framework of M-KPCA is presented. Next, a series of experiments are carried out on real data sets for verifying our method’s effect in Section 4. Finally, Section 5 concludes this paper.

2.1. KPCA Background

The raw data is projected into the feature space by a nonlinear mapping function and the useful information is concentrated into some principal components corresponding to the larger eigenvalues [19]. Define a learning set as , . Let be a nonlinear mapping from the input space to a high-dimensional feature space. The inner product in feature space is calculated by the kernel function in the original input space:where the superscript represents the transpose operation. Denote and . Assuming , i.e., data are centered in , then the total scatter matrix can be defined as . To compute the projective vector for optimal solution, the KPCA employs the following norm:

Computation of optimal projective vector provides solution for the eigenvalue problem: in which and eigenvectors . Hence, (2) can be rewritten as an equivalent problem:where is the kernel matrix. Solutions of (3) are corresponding to the largest eigenvalues; then is the solution vector of (2). The KPCA based FE does not include the nonlinear mapping as any kernel method, and it only needs a kernel function in the input space. To obtain better performance with KPCA, the parameters of the kernel are optimized. However, this optimization cannot produce adequate solutions for every application or data sets because of the nature of the kernel itself [22]. To overcome this drawback, an adaptive kernel combination technique is introduced in [29].

2.2. Adaptive KPCA (A-KPCA)

As pointed out in Section 1, the performance of KPCA is notably affected by the selection of kernels and its parameters. Therefore, it needs some extensions. Let be a set of nonlinear mappings. As mentioned in Section 2.1, the inner products in are described as the kernels. Using definition of , can be written. In this equation, is the Hilbert space as the direct sum of and the inner product in can be defined as

To construct a 2D feature matrix, a sample of learning set is transformed to high-dimensional feature space and then is obtained. Here, each column of corresponds to a nonlinear mapping generated by ’s. Thus, vector-based data is converted to matrix based format. Assuming ’s have zero means, i.e., , can be written. Equation (6) includes generated feature vectors. Appropriate and matrices must be determined to optimizewhere are projective vectors corresponding to columns of while corresponding to rows of . The purpose of is to extract features, while the purpose of is kernel selection. In other words, the unsupervised kernel learning and nonlinear FE are simultaneously realized according to projective vectors which are included in and . is the Frobenius norm of matrix, i.e., , where denotes the trace of a matrix. It can be defined as , where . Since the size of original is very large, i.e., , can be written instead of . Hence, is obtained. These calculations allow us to rewrite (6) as (7):where the constrains of (7) are and . Here is sized kernel matrix and it is constructed as follows:

To solve this optimization problem, inspired by Ye’s work [38], an iterative procedure is presented by the following theorem [29].

Theorem 1. Let and be the optimal solution to (7): then (i) eigenvectors corresponding to the largest eigenvalues of the matrix form for a given ; (ii) eigenvectors corresponding to the largest eigenvalues of the matrix create for a given .

After computing and , these matrices can be used to extract the nonlinear features for a test instance . Kernel matrix is constructed and then projected according to , so the nonlinear features are contained in . The A-KPCA method is given in Algorithm 1.

Input: Given training set
(a) Create the kernel matrix for each .
(b) Get initial and .
(c) For given , calculate the eigenvectors of corresponding to the largest eigenvalues.
(d) For given , calculate the eigenvectors of corresponding to the largest eigenvalues.
(e) , goto step (c) until convergence.
Output: L and N.

3. Multiple Kernel PCA (M-KPCA)

In Section 2, we have demonstrated that A-KPCA manipulates more than one subkernels. A mapping rule transforms input data samples into corresponding Reproducing Kernel Hilbert Space. Each kernel thus acquires a particular type of information from a given data set, thereby providing a partial description of view data. The value of this specific information may vary according to different machine learning tasks such as classification, clustering, dimensionality reduction, etc. For instance, in a classification problem, high discrimination ability of kernels yields the better results. Hence, we add this capability to A-KPCA with ideas of EL. Our proposed technique learns new representation for a hyperspectral image exploiting all available training data. It is thus independent of the classifier.

As seen in formulation (8) and Theorem 1, there are not any coefficients to quantify the contribution of subkernels in classification. In other words, the A-KPCA utilizes the unweighted summation. Nevertheless, the discriminative ability of kernels in FE plays significant role for the separability of the classifier. If we add a weighting coefficient on right side of (8), then it becomes

The discriminative ability of a kernel can be measured by an ideal kernel in a given classification task. Cristianini et al. [23] introduced a measure of similarity between two arbitrary kernels or between a kernel and an ideal kernel called kernel alignment (KA). The alignment between two regular kernels is given aswhere the Frobenius product of two Gram matrices and is defined as [23, 39]. This measure can be viewed as the cosine of the angle between and , so it fluctuates between for arbitrary matrices. However, since we consider only positive semidefinite Gram matrices in KA, the score is lower bounded by zero. The alignment can also be adopted to capture the degree of agreement between a kernel and the target label matrix, also considered as ideal kernel. A larger value of KA indicates the higher discriminative ability and it is one of the main strengths for a subclassifier such that they improve the ensemble effect in an EL strategy [40, 41]. An idealized kernel for a binary classification problem can be composed of the dot product of target labels, i.e., , and the alignment between a kernel and the ideal kernel is written as

Our goal is to construct an A-KPCA based algorithm which has improved separability of multiclass patterns. Here, kernel class separability (KCS) measure based on scatter matrix is employed to measure the class separability of training samples in feature space. The KSC is a general form of KA and it can be written in the form [42]:where and , respectively, stand for between-class scatter matrix and within-class scatter matrix in kernel space and the traces of them are obtained aswhere denotes the number of training samples in the th class, , and is the th sample in the related class. and are the mean vector for th class and the mean vector for all training samples, respectively. is the mapping function from the input space to the feature space as described in the beginning of Section 2.1. A larger value of signifies superior class separability in the training set. A maximization problem may thus be created to obtain optimal kernels and their parameters or eliminate weak kernels [43], but, in this paper, we directly exploit the value of (12) as the measure of discriminability; hencewhere .

After all, we extend the noniterative A-KPCA algorithm using kernel class separability measure with a semisupervised strategy. The proposed noniterative M-KPCA technique is given in Algorithm 2.

Input: Given training set with labels .
(a) Obtain ’s for corresponding pre-selected kernels using Eqs. (12) and (13).
(b) Create the kernel matrix for each as in Eq. (9).
(c) Calculate the eigenvectors and eigenvalues of . Sort the eigenvectors according to
the decreasing order of and select first eigenvectors .
(d) Calculate the eigenvectors and eigenvalues of . Sort the eigenvectors according to
the decreasing order of and select first eigenvectors .
(e) The final subspaces are and .
Output: and .

4. Experiments

In this section, we investigate the performance of the proposed M-KPCA algorithm compared with a number of conventional and state-of-the-art techniques on two Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) hyperspectral data sets. Our experiments are conducted on a machine with an Intel Core i5-2410M CPU at 2.30GHz and 8GB DDR-III RAM.

4.1. Data Sets and Experimental Setup

The first set is an airborne remote sensing data captured by the AVIRIS sensor over northwest Indiana on June 12, 1992. Indian Pines data has 16 labeled classes and 145 lines/scene and 145 pixels/line. Originally, the scene has 220 spectral bands (10 nm spectral bandwidth from 0.4 to 2.5 μm); after discarding the water absorption and noise bands, based on [44, 45], only 159 bands were used in the experiments.

Airborne hyperspectral data is acquired by AVIRIS sensor at 18 m spatial resolution over Kennedy Space Center (KSC) during March 1996 and has been employed as a second data source. Noisy bands and water absorption bands are removed. The remaining of the HSI data has 176 bands for 13 wetland and upland classes. Figure 1 shows the RGB image of Indian Pines and false color image of the KSC. Table 1 lists the summary of the data sets in our experiments. All samples in each data are adjusted in the range , as suggested in [46].

The PCA and KPCAs are implemented using the SIMFEAT toolbox [47]. In each experiment, a single kernel is selected for the KPCA. In addition to Gaussian radial basis function (RBF) kernel which is formulated as , we have employed three more kernels (see Table 2). Before solving the eigenvalue problem, the parameter in the RBF, Laplacian and Cauchy kernels should be selected or optimized. Unless otherwise stated the kernel parameter is set towhere is the centroid of the total training data [48]. The kernel parameter is nonoptimized and same for each exploited kernel given in Table 2. However, the aim of combination of nonoptimized kernels is to yield a better FE technique for classification.

The valuable parts of the obtained cumulative eigenvalues after eigen-decomposition for each method are shown in Figure 2. According to the cumulative eigenvalues of PCA, two principal components reach 99% of total variance for Indian Pines case. Nevertheless, in KSC case, three principal components are needed to reach 99% of information. According to these results, the new dimensions of Indian Pines and KSC for classification experiment are, respectively, defined as 2 and 3. However, hyperspectral information cannot be represented utilizing only the second-order statistics as it is pointed out in Section 1. From Figure 2, it can be derived that more kernel principal components (KPCs) are needed to realize the same amount of variance as for PCA. Note that the total number of components with PCA is equal to the number of bands, i.e., 159 for Indian Pines, while it is equal for KPCA to the size of the number of training samples, i.e., 2594, which is significantly higher. For the Indian Pines data set, the first 11, 51, 31, and 33 KPCs are needed to accomplish 99% of the cumulative variance with RBF, Laplacian, Cauchy, and histogram intersection (HIST) kernels, respectively. We observe that 8 KPCs are needed with the RBF, 56 with the Laplacian, 18 with the Cauchy, and 55 with the HIST kernel to achieve same amount of information considering to KSC results. In the case of A-KPCA and M-KPCA, p is set to 1 for kernel selection. For the Indian Pines data set, 35 adaptive KPCs and 20 multiple KPCs contain 99% of information and only 14 adaptive KPCs and 12 multiple KPCs for the KSC.

In order to demonstrate the first principal components (PCs) more efficiently, a subimage of size 100 × 100 in the KSC hyperspectral cube is selected. The first PCs for all of the methods are depicted in Figure 3.

After FE, SVM classifier has been employed for classification. For nonlinear SVMs, we have used the RBF kernel which is formulated in Table 2. The classification experiments and the optimization of parameters, C and σ, of SVMs are achieved using LIBSVM [49] with 5-fold cross validation technique. Since SVMs are designed to solve binary problems, various approaches have been proposed for multiclass situations such as remote sensing applications. The most popular approaches for multiclass classification are one-against-all (1AA) and one-against-one (1A1). In this paper, we have applied the 1AA strategy for each class. Each test sample is finally labeled as the class whose output score is maximum.

Finally, we compare the proposed M-KPCA algorithm against five state-of-the-art dimension reduction algorithms, i.e., linear discriminant analysis (LDA) [50], LPP, probabilistic PCA (pPCA) [51], RP, and t-SNE. LDA, LPP, pPCA, and t-SNE are implemented using the MATLAB toolbox [52] for dimensionality reduction, and RP algorithm is designed based on Wang’s work [53].

4.2. Comparison with KPCA and A-KPCA

The original data sets, termed as raw, are also classified for comparisons. Tables 3 and 4 compare the performance of all models numerically (class accuracies and overall accuracy (OA) in percentages) and statistically (kappa test) for the Indian Pines and the KSC data sets, respectively.

Inspection of Table 3 reveals A-KPCA outperforms PCA and all the four KPCAs. Further analysis shows that the KPCA performs significantly better than the conventional PCA. Regarding the OAs, it is clear that the M-KPCA based classification produces more accurate results when compared to the A-KPCA based classification. RBF kernel gives the best results for KPCA among the other kernel functions as seen in Table 3.

The results for the KSC data set are reported in Table 4. Regarding the PCA and KPCA results, FE does not improve the accuracies significantly. The comparison between KPCA and PCA shows that KPCA performs better than the PCA in terms of classification accuracies. Moreover, classification of the A-KPCA features is more precise that the one yielded employing the all KPCs. As with the previous experiment, the best results are obtained with the M-KPCA. Figures 4 and 5 represent the available labeled scenes and classification maps of all models for the Indian Pines and KSC data sets, respectively.

In the last experiment, we increase the number of KPCAs in both A-KPCA and M-KPCA utilizing different scale parameters in the same kernel. Tables 3 and 4 show that the best single kernel for each data set is different. Therefore, we, respectively, adopt the seven RBF and Cauchy kernel functions for Indian Pines and KSC such as their scale parameters in the range . The central parameter is determined by (15). The SVM is again employed for classification after FE. The selection of eigenvalues for each method is defined in 99% confidence interval. Table 5 summarizes the classification accuracies of this experiment. The results show that the M-KPCA-based features are better than the individual KPCAs and A-KPCA features on all data sets, no matter which kernel parameter is applied.

4.3. M-KPCA versus Other Dimension Reduction Algorithms

In this section, we compare our method (M-KPCA) with the five FE methods, i.e., LDA, LPP, pPCA, RP, and t-SNE. M-KPCA is constructed with the subkernels indicated in Table 2, and kernel parameters are determined from (15). Different values of the dimensionality number of the new subspaces are tested for the SVM classifier across the two data sets. A set of values are independently generated for the subspace dimension. The classification accuracy is reported for each model, and we plot the results in Figure 6.

Inspection of Figure 6 reveals that proposed method regularly outperforms the competing FE methods for multiclass classification in higher dimensions. For instance, if the number of extracted features is set to 50, M-KPCA improves over the best competing method RP by 5.19% in terms of OA on Indian Pines and by 3.78% on KSC. It can be also found from Figure 6 that t-SNE method is highly stable against any dimensional changes. On comparison of methods, we also observe that the performance of LDA and pPCA is limited for both data sets. Considering the lower dimensions (i.e., when the number of new dimension is assigned a value smaller than 10), the best features are produced by the t-SNE which is also the most time-consuming method. The rest of the methods are sorted as RP, LDA, LPP, pPCA, and M-KPCA in ascending order according to the average computation times.

5. Conclusion

In this paper, a novel semisupervised KPCA framework named multiple KPCA (M-KPCA) is proposed for effective feature extraction of hyperspectral images. It applies ensemble strategy to favor good candidate kernels during nonlinear projections. A noniterative algorithm is developed to simultaneously feature extraction and kernel combination based on a kernel class separability criteria. In terms of the number of kernels, KPCA uses only one base kernel with predefined parameter(s) (if existing). In terms of the kernel quality, A-KPCA has no measurement procedure to evaluate the efficiency of kernels. M-KPCA overcomes these drawbacks of both KPCA and A-KPCA.

Dimension reduced HSI data is classified by nonlinear SVMs to compare classification performance for several models. Experiments on two real HSI data sets demonstrate that the best kernel type varies according to data (see Tables 3 and 4). In the first test, KPCA presents better performance compared to the conventional PCA. Overall evaluation for dimension reduction performance of PCA, KPCAs, A-KPCA, and M-KPCA techniques shows that the M-KPCA is more successful than the others. In the second experiment, we have employed seven candidate kernel functions using different kernel parameters for each data. These KPCAs are then utilized to construct the A-KPCA and M-KPCA. Experiments on the AVIRIS data sets confirm that the M-KPCA outperforms the individual KPCAs and the A-KPCA in terms of both OA and Kappa coefficient. Moreover, the comparative results in Section 4.3 demonstrate that M-KPCA experimentally accomplished superior or competitive classification accuracy more than the other unsupervised state-of-the-art FE methods.

The results clearly validate that semisupervised learning of kernels with M-KPCA increases the robustness of nonoptimized KPCAs. One and probably most important limitation of the M-KPCA is its computational complexity, related to the number of samples used for constructing the kernel matrix. Therefore, our future work aims to address the problem of reducing the complexity. It is also possible to extend the proposed method to a selective approach which eliminates weak kernels before feature extraction.

Data Availability

The Indian Pines and KSC data that support the findings of this study are, respectively, available in https://engineering.purdue.edu/~biehl/ and http://www.ehu.eus/ccwintco/index.php?title =Hyperspectral_Remote_Sensing_Scenes.

Conflicts of Interest

The author declares that they have no conflicts of interest.

Acknowledgments

The authors would like to thank Daoqiang Zhang and Zhi-Hua Zhou for providing a part of the source code for the A-KPCA.