Abstract
Classification is one of the most challenging tasks of remotely sensed data processing, particularly for hyperspectral imaging (HSI). Dimension reduction is widely applied as a preprocessing step for classification; however the reduction of dimension using conventional methods may not always guarantee high classification rate. Principal component analysis (PCA) and its nonlinear version kernel PCA (KPCA) are known as traditional dimension reduction algorithms. In a previous work, a variant of KPCA, denoted as Adaptive KPCA (AKPCA), is suggested to get robust unsupervised feature representation for HSI. The specified technique employs several KPCAs simultaneously to obtain better feature points from each applied KPCA which includes different candidate kernels. Nevertheless, AKPCA neglects the influence of subkernels employing an unweighted combination. Furthermore, if there is at least one weak kernel in the set of kernels, the classification performance may be reduced significantly. To address these problems, in this paper we propose an Ensemble Learning (EL) based multiple kernel PCA (MKPCA) strategy. MKPCA constructs a weighted combination of kernels with high discriminative ability from a predetermined set of base kernels and then extracts features in an unsupervised fashion. The experiments on two different AVIRIS hyperspectral data sets show that the proposed algorithm can achieve a satisfactory feature extraction performance on real data.
1. Introduction
Hyperspectral imaging (HSI) provides simultaneously spatial and high resolution spectral data and helps to classify/recognize the materials that are challenging to discriminate with conventional imaging techniques [1]. However, it suffers from the curse of dimensionality. For instance, the curse of dimensionality causes increase in cost of storage, transmission, and processing of hyperspectral images. To overcome such challenges, dimensionality reduction techniques have been applied to hyperspectral data in the existing literature [2]. In general, HSI has spectral redundancy in many spectral channels. For this reason, dimension reduction or compression is possible and even necessary, especially for these bands.
Even though there are several dimension reduction approaches in the literature, including manifold learning [3, 4] and tensors [5], principal component analysis (PCA) [6] is the one among the popular techniques [7–9]. PCA is the discrete form of the continuous KarhunenLoève Transform and it projects the data into a subspace so that the variance retained is maximized and the least square reconstruction error is minimized [10]. Use of PCA for dimensionality reduction in HSI is a computationally suitable approach and it helps preserve the most of the variance of the raw data. Although PCA has some theoretical inadequacies [11, 12] for use on remote sensing data, particularly hyperspectral images [13], the practical applications show that the results obtained using PCA are still competitive for the purpose of classification [14, 15]. The ability of PCA is limited for highdimensional data since it relies on only secondorder statistical information. The nonlinear version of the PCA, denoted as kernel PCA (KPCA), has been proposed to overcome these limitations [16].
Since the KPCA involves the higherorder statistics, it provides more information from the original data [17] and so it is employed in many applications including remote sensing data due to its satisfactory performance. In [18], classification performance of an artificial neural network has been demonstrated to outperform the classical approach using kernel principal components. Fauvel et al. [19] showed that the KPCA is better than the classical PCA in terms of classification accuracies. A general overview of feature reduction techniques for classification of hyperspectral images is presented in [9]. They performed comparative experiments between the unsupervised, e.g., PCA and KPCA, and supervised techniques, e.g., double nearest proportion (DNP) [20] and kernel nonparametric weighted feature extraction (KNWFE) [21]. Since the supervised learning techniques generally focus on improving class separability, these methods are expected to produce better results in terms of classification performance. The comparative results with KNWFE indicate that PCA and KPCA are still preferable to reduce dimensionality of hyperspectral images.
Fundamentally, KPCA is a version of PCA whose performance is greatly affected by the choice of the kernel and parameters. Namely, the selection of the optimal kernel and parameters is crucial for KPCA to achieve good performance. However, the application results show that no single kernel function can be best for all kinds of machine learning problems [22] and, therefore, learning of optimum kernels over a kernel set is an active research area nowadays [23–27]. Li and Yang presented an ensemble KPCA method with Bayesian inference strategy in [28]. They exploited only Gaussian radial basis function (RBF) with different scale parameters as subkernels. Zhang et al. [29] have developed a method for unsupervised kernel learning in the KPCA, dubbed as AKPCA, and applied the new method for object recognition problems.
The AKPCA learns the kernels via an unsupervised learning approach. The 1D input vectors, e.g., feature vectors, are transformed into 2D feature matrices by different kernels. Each column of the feature matrix comes from corresponding 1D input vector. Nonlinear feature extraction (FE) is obtained from one set of projective vectors corresponding to the column direction of the feature matrices. The set of projective vectors corresponding the row direction of the 2D feature matrices is utilized for searching optimal kernels combination simultaneously. Despite having superior performance compared to KPCA, the AKPCA has some critical limitations. Specifically, AKPCA works completely unsupervised, and it is thus incapable of enhancing the class separability and it has no kernel preselection process. These are the main motivations of our work.
In this paper, a novel framework is introduced for hyperspectral FE and classification based on multiple KPCA models with an Ensemble Learning (EL) strategy in a semisupervised manner. EL is a process of combining multiple models, called experts, to set up a strong model for a specific machine learning problem [30]. Strong discriminative ability of individual experts and high diversity among them are required to produce satisfactory models [31, 32]. An acceptable classification performance highly depends on the class separability of features that is directly related to the discriminative ability. Inspired by EL, we extend the AKPCA method by employing multiple kernels such that subkernels possessing higher discrimination ability are highlighted. The proposed approach, multiple kernel PCA (MKPCA), learns an ensemble of multiple kernel principal components on an available labeled data set, and the final features are extracted via a weighted combination of all subkernels according to their separability performance. The early purpose of this paper is the utilization of the KPCA and AKPCA in hyperspectral images and to determine impact of using nonlinear versions of PCA on classification performance. The further contributions and novelties in this paper can be summarized as follows: (1) a novel multikernel PCA strategy is presented by exploiting Ensemble Learning to evaluate and select the kernels; (2) MKPCA acquires the superior classification results than PCA, KPCA, and AKPCA by highlighting the subkernels with a class separability based weighting strategy; (3) MKPCA produces better or competitive classification performance with other popular unsupervised FE methods like locality preserving projections (LPP) [33], random projections (RP) [34], and tdistributed stochastic neighbor embedding (tSNE) [35]. After FE with all mentioned methods, the popular and robust support vector machines (SVMs) classifier is used for supervised classification. Since SVMs consider samples close to the class boundary, called support vectors, they show great performance even in highdimensional data with small training samples [36, 37].
The paper is outlined as follows. Section 2 reviews the related work. In Section 3, the proposed framework of MKPCA is presented. Next, a series of experiments are carried out on real data sets for verifying our method’s effect in Section 4. Finally, Section 5 concludes this paper.
2. Related Work
2.1. KPCA Background
The raw data is projected into the feature space by a nonlinear mapping function and the useful information is concentrated into some principal components corresponding to the larger eigenvalues [19]. Define a learning set as , . Let be a nonlinear mapping from the input space to a highdimensional feature space. The inner product in feature space is calculated by the kernel function in the original input space:where the superscript represents the transpose operation. Denote and . Assuming , i.e., data are centered in , then the total scatter matrix can be defined as . To compute the projective vector for optimal solution, the KPCA employs the following norm:
Computation of optimal projective vector provides solution for the eigenvalue problem: in which and eigenvectors . Hence, (2) can be rewritten as an equivalent problem:where is the kernel matrix. Solutions of (3) are corresponding to the largest eigenvalues; then is the solution vector of (2). The KPCA based FE does not include the nonlinear mapping as any kernel method, and it only needs a kernel function in the input space. To obtain better performance with KPCA, the parameters of the kernel are optimized. However, this optimization cannot produce adequate solutions for every application or data sets because of the nature of the kernel itself [22]. To overcome this drawback, an adaptive kernel combination technique is introduced in [29].
2.2. Adaptive KPCA (AKPCA)
As pointed out in Section 1, the performance of KPCA is notably affected by the selection of kernels and its parameters. Therefore, it needs some extensions. Let be a set of nonlinear mappings. As mentioned in Section 2.1, the inner products in are described as the kernels. Using definition of , can be written. In this equation, is the Hilbert space as the direct sum of and the inner product in can be defined as
To construct a 2D feature matrix, a sample of learning set is transformed to highdimensional feature space and then is obtained. Here, each column of corresponds to a nonlinear mapping generated by ’s. Thus, vectorbased data is converted to matrix based format. Assuming ’s have zero means, i.e., , can be written. Equation (6) includes generated feature vectors. Appropriate and matrices must be determined to optimizewhere are projective vectors corresponding to columns of while corresponding to rows of . The purpose of is to extract features, while the purpose of is kernel selection. In other words, the unsupervised kernel learning and nonlinear FE are simultaneously realized according to projective vectors which are included in and . is the Frobenius norm of matrix, i.e., , where denotes the trace of a matrix. It can be defined as , where . Since the size of original is very large, i.e., , can be written instead of . Hence, is obtained. These calculations allow us to rewrite (6) as (7):where the constrains of (7) are and . Here is sized kernel matrix and it is constructed as follows:
To solve this optimization problem, inspired by Ye’s work [38], an iterative procedure is presented by the following theorem [29].
Theorem 1. Let and be the optimal solution to (7): then (i) eigenvectors corresponding to the largest eigenvalues of the matrix form for a given ; (ii) eigenvectors corresponding to the largest eigenvalues of the matrix create for a given .
After computing and , these matrices can be used to extract the nonlinear features for a test instance . Kernel matrix is constructed and then projected according to , so the nonlinear features are contained in . The AKPCA method is given in Algorithm 1.

3. Multiple Kernel PCA (MKPCA)
In Section 2, we have demonstrated that AKPCA manipulates more than one subkernels. A mapping rule transforms input data samples into corresponding Reproducing Kernel Hilbert Space. Each kernel thus acquires a particular type of information from a given data set, thereby providing a partial description of view data. The value of this specific information may vary according to different machine learning tasks such as classification, clustering, dimensionality reduction, etc. For instance, in a classification problem, high discrimination ability of kernels yields the better results. Hence, we add this capability to AKPCA with ideas of EL. Our proposed technique learns new representation for a hyperspectral image exploiting all available training data. It is thus independent of the classifier.
As seen in formulation (8) and Theorem 1, there are not any coefficients to quantify the contribution of subkernels in classification. In other words, the AKPCA utilizes the unweighted summation. Nevertheless, the discriminative ability of kernels in FE plays significant role for the separability of the classifier. If we add a weighting coefficient on right side of (8), then it becomes
The discriminative ability of a kernel can be measured by an ideal kernel in a given classification task. Cristianini et al. [23] introduced a measure of similarity between two arbitrary kernels or between a kernel and an ideal kernel called kernel alignment (KA). The alignment between two regular kernels is given aswhere the Frobenius product of two Gram matrices and is defined as [23, 39]. This measure can be viewed as the cosine of the angle between and , so it fluctuates between for arbitrary matrices. However, since we consider only positive semidefinite Gram matrices in KA, the score is lower bounded by zero. The alignment can also be adopted to capture the degree of agreement between a kernel and the target label matrix, also considered as ideal kernel. A larger value of KA indicates the higher discriminative ability and it is one of the main strengths for a subclassifier such that they improve the ensemble effect in an EL strategy [40, 41]. An idealized kernel for a binary classification problem can be composed of the dot product of target labels, i.e., , and the alignment between a kernel and the ideal kernel is written as
Our goal is to construct an AKPCA based algorithm which has improved separability of multiclass patterns. Here, kernel class separability (KCS) measure based on scatter matrix is employed to measure the class separability of training samples in feature space. The KSC is a general form of KA and it can be written in the form [42]:where and , respectively, stand for betweenclass scatter matrix and withinclass scatter matrix in kernel space and the traces of them are obtained aswhere denotes the number of training samples in the th class, , and is the th sample in the related class. and are the mean vector for th class and the mean vector for all training samples, respectively. is the mapping function from the input space to the feature space as described in the beginning of Section 2.1. A larger value of signifies superior class separability in the training set. A maximization problem may thus be created to obtain optimal kernels and their parameters or eliminate weak kernels [43], but, in this paper, we directly exploit the value of (12) as the measure of discriminability; hencewhere .
After all, we extend the noniterative AKPCA algorithm using kernel class separability measure with a semisupervised strategy. The proposed noniterative MKPCA technique is given in Algorithm 2.

4. Experiments
In this section, we investigate the performance of the proposed MKPCA algorithm compared with a number of conventional and stateoftheart techniques on two Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) hyperspectral data sets. Our experiments are conducted on a machine with an Intel Core i52410M CPU at 2.30GHz and 8GB DDRIII RAM.
4.1. Data Sets and Experimental Setup
The first set is an airborne remote sensing data captured by the AVIRIS sensor over northwest Indiana on June 12, 1992. Indian Pines data has 16 labeled classes and 145 lines/scene and 145 pixels/line. Originally, the scene has 220 spectral bands (10 nm spectral bandwidth from 0.4 to 2.5 μm); after discarding the water absorption and noise bands, based on [44, 45], only 159 bands were used in the experiments.
Airborne hyperspectral data is acquired by AVIRIS sensor at 18 m spatial resolution over Kennedy Space Center (KSC) during March 1996 and has been employed as a second data source. Noisy bands and water absorption bands are removed. The remaining of the HSI data has 176 bands for 13 wetland and upland classes. Figure 1 shows the RGB image of Indian Pines and false color image of the KSC. Table 1 lists the summary of the data sets in our experiments. All samples in each data are adjusted in the range , as suggested in [46].
(a)
(b)
The PCA and KPCAs are implemented using the SIMFEAT toolbox [47]. In each experiment, a single kernel is selected for the KPCA. In addition to Gaussian radial basis function (RBF) kernel which is formulated as , we have employed three more kernels (see Table 2). Before solving the eigenvalue problem, the parameter in the RBF, Laplacian and Cauchy kernels should be selected or optimized. Unless otherwise stated the kernel parameter is set towhere is the centroid of the total training data [48]. The kernel parameter is nonoptimized and same for each exploited kernel given in Table 2. However, the aim of combination of nonoptimized kernels is to yield a better FE technique for classification.
The valuable parts of the obtained cumulative eigenvalues after eigendecomposition for each method are shown in Figure 2. According to the cumulative eigenvalues of PCA, two principal components reach 99% of total variance for Indian Pines case. Nevertheless, in KSC case, three principal components are needed to reach 99% of information. According to these results, the new dimensions of Indian Pines and KSC for classification experiment are, respectively, defined as 2 and 3. However, hyperspectral information cannot be represented utilizing only the secondorder statistics as it is pointed out in Section 1. From Figure 2, it can be derived that more kernel principal components (KPCs) are needed to realize the same amount of variance as for PCA. Note that the total number of components with PCA is equal to the number of bands, i.e., 159 for Indian Pines, while it is equal for KPCA to the size of the number of training samples, i.e., 2594, which is significantly higher. For the Indian Pines data set, the first 11, 51, 31, and 33 KPCs are needed to accomplish 99% of the cumulative variance with RBF, Laplacian, Cauchy, and histogram intersection (HIST) kernels, respectively. We observe that 8 KPCs are needed with the RBF, 56 with the Laplacian, 18 with the Cauchy, and 55 with the HIST kernel to achieve same amount of information considering to KSC results. In the case of AKPCA and MKPCA, p is set to 1 for kernel selection. For the Indian Pines data set, 35 adaptive KPCs and 20 multiple KPCs contain 99% of information and only 14 adaptive KPCs and 12 multiple KPCs for the KSC.
In order to demonstrate the first principal components (PCs) more efficiently, a subimage of size 100 × 100 in the KSC hyperspectral cube is selected. The first PCs for all of the methods are depicted in Figure 3.
(a) Original
(b) 1st PC
(c) 1st KPCRBF
(d) 1st KPCLap
(e) 1st KPCCau
(f) 1st KPCHIST
(g) 1st AKPC
(h) 1st MKPC
After FE, SVM classifier has been employed for classification. For nonlinear SVMs, we have used the RBF kernel which is formulated in Table 2. The classification experiments and the optimization of parameters, C and σ, of SVMs are achieved using LIBSVM [49] with 5fold cross validation technique. Since SVMs are designed to solve binary problems, various approaches have been proposed for multiclass situations such as remote sensing applications. The most popular approaches for multiclass classification are oneagainstall (1AA) and oneagainstone (1A1). In this paper, we have applied the 1AA strategy for each class. Each test sample is finally labeled as the class whose output score is maximum.
Finally, we compare the proposed MKPCA algorithm against five stateoftheart dimension reduction algorithms, i.e., linear discriminant analysis (LDA) [50], LPP, probabilistic PCA (pPCA) [51], RP, and tSNE. LDA, LPP, pPCA, and tSNE are implemented using the MATLAB toolbox [52] for dimensionality reduction, and RP algorithm is designed based on Wang’s work [53].
4.2. Comparison with KPCA and AKPCA
The original data sets, termed as raw, are also classified for comparisons. Tables 3 and 4 compare the performance of all models numerically (class accuracies and overall accuracy (OA) in percentages) and statistically (kappa test) for the Indian Pines and the KSC data sets, respectively.
Inspection of Table 3 reveals AKPCA outperforms PCA and all the four KPCAs. Further analysis shows that the KPCA performs significantly better than the conventional PCA. Regarding the OAs, it is clear that the MKPCA based classification produces more accurate results when compared to the AKPCA based classification. RBF kernel gives the best results for KPCA among the other kernel functions as seen in Table 3.
The results for the KSC data set are reported in Table 4. Regarding the PCA and KPCA results, FE does not improve the accuracies significantly. The comparison between KPCA and PCA shows that KPCA performs better than the PCA in terms of classification accuracies. Moreover, classification of the AKPCA features is more precise that the one yielded employing the all KPCs. As with the previous experiment, the best results are obtained with the MKPCA. Figures 4 and 5 represent the available labeled scenes and classification maps of all models for the Indian Pines and KSC data sets, respectively.
(a) Labeled scene
(b) Raw
(c) PCA
(d) KPCARBF
(e) KPCALap
(f) KPCACau
(g) KPCAHIST
(h) AKPCA
(i) MKPCA
(a) Labeled scene
(b) Raw
(c) PCA
(d) KPCARBF
(e) KPCALap
(f) KPCACau
(g) KPCAHIST
(h) AKPCA
(i) MKPCA
In the last experiment, we increase the number of KPCAs in both AKPCA and MKPCA utilizing different scale parameters in the same kernel. Tables 3 and 4 show that the best single kernel for each data set is different. Therefore, we, respectively, adopt the seven RBF and Cauchy kernel functions for Indian Pines and KSC such as their scale parameters in the range . The central parameter is determined by (15). The SVM is again employed for classification after FE. The selection of eigenvalues for each method is defined in 99% confidence interval. Table 5 summarizes the classification accuracies of this experiment. The results show that the MKPCAbased features are better than the individual KPCAs and AKPCA features on all data sets, no matter which kernel parameter is applied.
4.3. MKPCA versus Other Dimension Reduction Algorithms
In this section, we compare our method (MKPCA) with the five FE methods, i.e., LDA, LPP, pPCA, RP, and tSNE. MKPCA is constructed with the subkernels indicated in Table 2, and kernel parameters are determined from (15). Different values of the dimensionality number of the new subspaces are tested for the SVM classifier across the two data sets. A set of values are independently generated for the subspace dimension. The classification accuracy is reported for each model, and we plot the results in Figure 6.
(a) Indian Pine
(b) KSC
Inspection of Figure 6 reveals that proposed method regularly outperforms the competing FE methods for multiclass classification in higher dimensions. For instance, if the number of extracted features is set to 50, MKPCA improves over the best competing method RP by 5.19% in terms of OA on Indian Pines and by 3.78% on KSC. It can be also found from Figure 6 that tSNE method is highly stable against any dimensional changes. On comparison of methods, we also observe that the performance of LDA and pPCA is limited for both data sets. Considering the lower dimensions (i.e., when the number of new dimension is assigned a value smaller than 10), the best features are produced by the tSNE which is also the most timeconsuming method. The rest of the methods are sorted as RP, LDA, LPP, pPCA, and MKPCA in ascending order according to the average computation times.
5. Conclusion
In this paper, a novel semisupervised KPCA framework named multiple KPCA (MKPCA) is proposed for effective feature extraction of hyperspectral images. It applies ensemble strategy to favor good candidate kernels during nonlinear projections. A noniterative algorithm is developed to simultaneously feature extraction and kernel combination based on a kernel class separability criteria. In terms of the number of kernels, KPCA uses only one base kernel with predefined parameter(s) (if existing). In terms of the kernel quality, AKPCA has no measurement procedure to evaluate the efficiency of kernels. MKPCA overcomes these drawbacks of both KPCA and AKPCA.
Dimension reduced HSI data is classified by nonlinear SVMs to compare classification performance for several models. Experiments on two real HSI data sets demonstrate that the best kernel type varies according to data (see Tables 3 and 4). In the first test, KPCA presents better performance compared to the conventional PCA. Overall evaluation for dimension reduction performance of PCA, KPCAs, AKPCA, and MKPCA techniques shows that the MKPCA is more successful than the others. In the second experiment, we have employed seven candidate kernel functions using different kernel parameters for each data. These KPCAs are then utilized to construct the AKPCA and MKPCA. Experiments on the AVIRIS data sets confirm that the MKPCA outperforms the individual KPCAs and the AKPCA in terms of both OA and Kappa coefficient. Moreover, the comparative results in Section 4.3 demonstrate that MKPCA experimentally accomplished superior or competitive classification accuracy more than the other unsupervised stateoftheart FE methods.
The results clearly validate that semisupervised learning of kernels with MKPCA increases the robustness of nonoptimized KPCAs. One and probably most important limitation of the MKPCA is its computational complexity, related to the number of samples used for constructing the kernel matrix. Therefore, our future work aims to address the problem of reducing the complexity. It is also possible to extend the proposed method to a selective approach which eliminates weak kernels before feature extraction.
Data Availability
The Indian Pines and KSC data that support the findings of this study are, respectively, available in https://engineering.purdue.edu/~biehl/ and http://www.ehu.eus/ccwintco/index.php?title =Hyperspectral_Remote_Sensing_Scenes.
Conflicts of Interest
The author declares that they have no conflicts of interest.
Acknowledgments
The authors would like to thank Daoqiang Zhang and ZhiHua Zhou for providing a part of the source code for the AKPCA.