Research Article  Open Access
Huiwu Luo, Yuan Yan Tang, Lina Yang, "Subspace Learning via Local Probability Distribution for Hyperspectral Image Classification", Mathematical Problems in Engineering, vol. 2015, Article ID 145136, 17 pages, 2015. https://doi.org/10.1155/2015/145136
Subspace Learning via Local Probability Distribution for Hyperspectral Image Classification
Abstract
The computational procedure of hyperspectral image (HSI) is extremely complex, not only due to the high dimensional information, but also due to the highly correlated data structure. The need of effective processing and analyzing of HSI has met many difficulties. It has been evidenced that dimensionality reduction has been found to be a powerful tool for high dimensional data analysis. Local Fisher’s liner discriminant analysis (LFDA) is an effective method to treat HSI processing. In this paper, a novel approach, called PDLFDA, is proposed to overcome the weakness of LFDA. PDLFDA emphasizes the probability distribution (PD) in LFDA, where the maximum distance is replaced with local variance for the construction of weight matrix and the class prior probability is applied to compute the affinity matrix. The proposed approach increases the discriminant ability of the transformed features in low dimensional space. Experimental results on Indian Pines 1992 data indicate that the proposed approach significantly outperforms the traditional alternatives.
1. Introduction
With the rapid technological advancement of remote sensing, the technology of high dimensional data analysis has been gotten forward. With the great demand of need and appetite to automatical process, the remote sensing data in very high dimensional space, a series of analytical methods, and applicable toolkits were engendered one after another. Hyperspectral imaging (HSI) typically have hundreds even thousands of electromagnetic spectral bands for each pixel, and these bands are often highly correlated. To make full use of rich spectrum and to enable effective processing of HSI data, it is often dramatic to extract useful features, preventing negative effect and precaution caused by redundant data. Dimensionality reduction is an efficient technique to eliminate the redundance among data samples. Dimensionality reduction also eliminates the effects brought by the uncorrelated features and simultaneously “selects” or “extracts” the features that are beneficial to precious classification. To be specific, the aim of dimensionality reduction is to decrease computational complexity and ameliorate statistical illconditioning by discarding redundant features that potentially deteriorate classification performance [1]. Nevertheless, how to suppress the redundance and preserve the most valuable features still remains an open topic in the community of high dimensional data analysis.
Dimension reduction techniques can be roughly categorized into linear approaches and nonlinear ones. The linear approaches include principal component analysis (PCA) [2], random projection (RP) [3], linear discriminant analysis (LDA) [4], and locality preserving projection (LPP), whereas the nonlinear approaches include isomap mapping (Isomaps) [5, 6], diffusion maps (DMaps) [7], and locally linear embedding (LLE) [8].
The common drawback of nonlinear embedding methods is that these techniques are too expensive to compute HSI data when the size of samples becomes large. For instance, Isomaps employs geodesic distance to measure the distance of data samples rather than Euclidean distance, that is the classical straightline distance. However, the theory of Isomaps is established on the basis of training samples, which is excessively reliant on the assumption of manifoldlike distribution. Meantime, the mapping found by Isomaps is recessive and implicit. For the new data points, the geodesic distance has to be recomputed on the new training set to obtain the low dimensional embedding. There is no exact computational expression of new data points in Isomaps. It is clear that such computation is explicitly complex and unapplicable for the large capacity of HSI data. For this reason, Isomaps is impractical for the dimensionality reduction of HSI data. Similar drawback occurs in the construction of LLE [7]. Recent interests in discovering the intrinsic manifold of data structure have been a trend and the theory is still in the progress of development [9], yet some achievements have been gained and reported in many research articles [10].
Nevertheless, the linear approaches are efficient to deal with this issue [11, 12]. PCA, an unsupervised approach, finds the global scatter as the best projected direction with the aim of minimizing the least square error of reconstruction data points [13]. Due to its “unsupervised” nature, the learning procedure is often blind and the projected direction found by PCA is usually not the optimal direction [14]. LDA is a supervised methodology which absorbs the advantage of purpose of learning [15]. Toward that goal, LDA seeks the direction that minimizes the classified error. However, the withinclass scatter matrix of LDA is often singular when it is applied to the small size of samples [16]. Consequently, the optimal solution of LDA is unable to solve, and the projected direction is failed to achieve. These drawbacks will limit the wide promotion of LDA [4]. To cope with this issue, a derived discriminant analysis, which puts additional constraint on the objective function [17], was proposed in some research papers [18–20], for example, Join Global and Local Discriminant Analysis (JGLDA) [21]. The common scheme of these methods is that they are easy to compute and implement, and the mapping is explicit. Yet they have shown effeciency in most cases despite the simple models.
The linear algorithms would have more advantages in the dimensionality reduction of HSI data in general. As a matter of fact, conventional linear approaches, such as PCA, LDA, and LPP, make the assumption that the distributions of data samples are Gaussian distribution or mixed Gaussian distribution. However, the assumption is often failed [22] since the distribution of real HSI data is preferred to be multimodal instead of a single modal. To be specific, the distribution of HSI data is usually unknown [23], and the single Gaussian model or Gaussian mixture model can not capture the distribution of all landmarks of the HSI data since the landmarks from different classes are multimodal [24]. In this case, the conventional methodologies work poorly. In view of this, some methods extend the idea of LDA and formulate extendLDA algorithms; for example, Sugiyama [25] proposed Local Fisher Linear Discriminant Analysis (LFDA) for multimodal clusters. LFDA incorporates the supervised nature of LDA with local description of LPP and then the optimal projection is obtained under the constraint of multimodal samples. Li et al. [1] apply LFDA with maximum likelihood estimate (MLE), support vector machine (SVM), and Gaussian mixture model (GMM) to HSI data. As reported in his paper, LFDA is superior not only in the computational time, but also in the classified accuracy. In a word, LFDA is especially appropriate for the landmarks classification of HSI data. Nevertheless, the conventional LFDA ignores the distribution of data samples in the construction procedure of affinity matrix.
In LFDA, the computation of affinity matrix is important. Note that there are clearly many different ways to define an affinity matrix, but the heat kernel derived from LPP has been shown to result in very effective locality preserving properties [26]. In this way, the local scaling of data samples in the nearest neighborhood is utilized. is a selftuning predefined parameter. To simplify the calculation procedure of parameters, [1, 27, 28] employing a fixed value of for experiments. Note that such calculation may ignore the distribution of data samples in the construction procedure of affinity matrix. Actually, the simplification of local distribution by the distance between the samples and the th nearest neighbor sample may be unreasonable, and the results by using this simplification may raise some error.
Thus, in this paper, to overcome the weakness of conventional LFDA, a novel approach is proposed, where by adopting the local variance of local patch instead of farthest distance for weight matrix and the class prior probability for affinity matrix, the weight matrix of proposed algorithm takes into account both the distribution of HSI data samples and the objective function of HSI data after dimension reduction. This novel approach is called PDLFDA because the probability distribution (PD) is used in LFDA algorithm. To be specific, PDLFDA incorporates two key points, namely.(1)The class prior probability is applied to compute the affinity matrix.(2)The distribution of local patch is represented by the “local variance” instead of “farthest distance” to construct the weight matrix.The proposed approach essentially increases the discriminant ability of transformed features in low dimensional space. The pattern found by PDLFDA is expected to be more accurate and is coincide with the character of HSI data and is conducive to classify HSI data.
The rest of this paper is organized as follows. In the beginning of this paper, the most basic concepts of conventional linear approaches related to our work will be introduced in Section 2. Precisely, Fisher’s linear discriminant analysis (LDA) and locality preserve projection (LPP) as well as local Fisher discriminant analysis (LFDA) will be presented. Proposed algorithm is developed and formalized in Section 3, which is the core of this paper. The experimental results with comparison on real HSI dateset are provided in Section 4. Finally, we conclude our work in Section 5.
2. Related Work
The purpose of linear approaches is to find an optimal projected direction where the information of embedding features is preserved as much as possible. To formulate our problem, let be the dimensional feature in the original space and let be the samples. For the case of supervised learning, let be label of , and then the label set of all samples can be represented by notation . Suppose that there are classes in all, and the sample number of the th is that fulfils the condition . That is, the number of all samples is the total sum of each class. Let be the th sample of the th class. Then, the corresponding sample mean becomes , yet the data center of all samples is denoted by . Suppose that the data set in dimensional hyperspace is distributed on a low dimensional subspace. A general problem of linear discriminant is to find a transformation that maps the dimensional data into a low dimensional subspace data by such that each represents without losing useful information. The transformed matrix is pursued by different methods and different objective function, resulting in different algorithm.
2.1. Fisher’s Linear Discriminant Analysis (LDA)
LDA introduces the withinscatter matrix and betweenscatter matrix to describe the distribution of data samples: Fisher criterion seeks a transformation that maximized the betweenclass scatter while minimized the withinclass scatter. This can be achieved by optimizing the following objective function: It is implicitly assumed that is full rank. Under this assumption, the problem can then be attributed to the generalized eigenvectors by solving Finally, the solution of is given by which are associated with the first largest eigenvalues . Since the rank of betweenclass scatter matrix is at most , there are meaningful features in conventional LDA. To deal with this issue, a regularization procedure is essential in practice.
2.2. Locality Preserve Projection (LPP)
A drawback of LDA is that it does not consider the local structure among data points [29], and the distribution of real HSI data is often multimodal. Locality preserving projection meets this requirement [30]. The goal of LPP is to preserve the local structure of neighborhood points. Toward this goal, a graph is modeled explicitly to describe the relationship using nearest neighborhood. Let denote the affinity matrix, where represents the similarity between points and . The larger the value of , the closer the relationship between and . A simple and effective way to define affinity matrix is given by where denotes the square norm Euclidean distance, is a tuning parameter, and KNN represents the nearest neighborhoods of under parameter .
The transformed matrix of LPP is achieved in the following criterion [31]: where is a diagonal matrix whose entries are the column sum (also can be a row sum since is symmetric) of ; that is, . Arbitrary scaling invariance and degeneracy are guaranteed by the constraint of (6).
The solution of LPP problem can be gained by solving the eigenvector problem of where denotes the graphLaplacian matrix in the community of spectral analysis and can be viewed as the discrete version of Laplace Beltrami operator on a compact Rimannian manifold [29]. And, finally, the transformation matrix is given by that correspond to eigenvalue .
2.3. Local Fisher Discriminant Analysis (LFDA)
Local Fisher discriminant analysis (LFDA) [32] measures the “weights” of two data points by the corresponding distance, and then the affinity matrix is calculated by these weights. Note that the “pairwise” representation of withinscatter matrix and betweenscatter matrix is very important for LFDA. Following simple algebra steps, the withinscatter matrix (1) of LDA can be transformed into the following forms: where Let be the total mixed matrix of LDA, and then we gain where LFDA is achieved by weighting the pairwise data points where and denote the weight matrix of different pairwise points for the withinclass samples and betweenclass samples, respectively, where indicates the affinity matrix. The construction of is critical for the performance of classified accuracy; thereby, the investigation of construction is in great need to be further elaborated in the following section.
3. Proposed Scheme
The calculation of (13) and (14) is very important to the performance of LFDA. There are many methods to compute the affinity matrix . The simplest one is that is equivalent to a constant; that is, where in the above equation is a real nonnegative number. However, the equations of (13) and (14) are derived to the stateoftheart Fisher’s linear discriminant analysis under this construction.
Another construction adopts the heat kernel derived from LPP where is a tuning parameter. Yet, the affinity is valued by the distance of data points, and the computation is too simple to represent the locality of data patches. A more adaptive version [26] of (16) is proposed as follows:
Compared with the former computation, (17) is in conjunction with nearest data points, which is computationally fast and light. Moreover, the property of local patches can be characterized by (17). However, the affinity defined in (16) and (17) is globally computed; thus, it may be apt to overfit the training points and be sensitive to noise. Furthermore, the density of HSI data points may vary according to different patches. Hence, a local scaling technique is proposed in LFDA to cope with this issue [29], where the sophisticated computation is given by where denotes the local scaling around the corresponding sample with the following definition: where represents the th nearest neighbor of , denotes the square Euclidean distance, and is a selftuning predefined parameter.
To simplify the calculation, many researches considered a fixed value of , and a recommended value of is studied in [1, 28]. Note that is used to represent the distribution of local data around sample . However, the above work ignored the distribution around each individual sample. The diversity of adjacent HSI pixels is approximate; thus, the spectrum of the neighboring landmarks has great similarity. That is, the pixels of HSI data which have resembling spectrums tend to be of the same landmark. This phenomenon indicates that the adjacency of local patches not only lies in the spectrum space but also in the spatial space. For a local point, the calculation of making use of the diversity of its th nearest neighborhoods is not fully correct.
An evident example is illustrated in Figure 1, where, two groups of points have different distributions. In group (a), most neighbor points are closed to point , while, in group (b), most neighbor points are far from point . However, the measurement of two cases are the same according to (19). This can be found in Figure 1, where, the distances between point and its th nearest neighborhoods () are same in both distributions, which can be shown in Figures 1(a) and 1(b), . This example indicates that the simplification of local distribution by the distance between the sample and the th nearest neighbor sample is unreasonable. Actually, the result by using of this simplification may raise some errors.
(a)
(b)
Based on the discussion above, a novel approach, which is called PDLFDA, is proposed to overcome the weakness of LFDA. To be specific, PDLFDA incorporates two key points, namely.(1)The class prior probability is applied to compute the affinity matrix.(2)The distribution of local patch is represented by the “local variance” instead of the “farthest distance” to construct the weight matrix.The proposed approach essentially increases the discriminant ability of transformed features in low dimensional space. The pattern found by PDLFDA is expected to be more accurate and coincids with the character of HSI data, and is conducive to classify HSI data.
In this way, a more sophisticated construction of affinity matrix, which is derived from [29], is proposed as follows: where stands for the class prior probability of class and indicates the local variance. Note that the denominator item of (13) is , which will cancel out our prior effect if we use to replace (the construction of will be given in (21)). Different part of this derivation plays the same role as the original formulation; for example, for the last item, on one hand, it plays the role of intraclass discriminating weight and, on the other hand, the product result of may reach zero if the Euclidean square distance is very small for some data points. For this case, an extra item is added to the construction of intraclass discriminating weight to prevent accuracy truncation. By doing so, our derivation can be viewed as an integration of class prior probability, the local weight, and the discriminating weight. This construction is expected to preserve both the local neighborhood structure and the class information. Besides, this construction is expected to share the same advantages detailed in the original work.
It is clear that (20) consists of two new factors compared with LFDA method: class prior probability and local variance .
Suppose class to be class ; that is , so that the probability of class can be calculated by where is the number of the samples in class , whole denotes the total number of samples, and .
Please note that the item in (20) is used to prevent the extra rounding error produced from the first two items and to keep the total value of which does not reach the minimum. Here, denotes the local scaling around . In this paper, a local scaling is measured by the standard deviation of local square distance. Assume that are the nearest samples of , and then the square distance between and is given by
The corresponding mean can be defined as where represents a square Euclidean distance and is a predefined parameter whose recommended value is . The standard deviation can be calculated as Note that, in the above equation, the item becomes a constant that can be shifted outside. Thus, an equivalent formula is given by Similar procedure can be deduced to . Hence, we have
Comparing (19) with (27), it is noticeable that (28) holds Compared with the former definitions, our definition has at least the following advantages.(i)By incorporating the prior probability of each class with local technique , the proposed scheme is expect to be a benefit for the classified accuracy.(ii)The representation of local patches equation (26) is described by local standard deviation rather than absolute diversity in (19), which is more accurate in measuring the local variance of data samples.(iii)Compared with the global calculation, the proposed calculation is taken on local patches, which is efficient in getting rid of overfitting.(iv)The proposed local scaling technique meets the character of HSI data, which is more applicable for the processing of hyperspectral image in real applications.
Based on the above affinity defined, an extended affinity matrix can also be defined in a similar way. Our definition only provides a heuristic exploration for reference. The affinity can be further sparse, for example, by introducing the idea of nearest neighborhoods [31].
The optimal solution of improved scheme can be achieved by maximize the following criterion: It is evident that (29) has the similar form of (3). This finding enlightens us that the transformation can be simply achieved by solving the generalized eigenvalue decomposition of . Moreover, Let be a dimensional invertible square matrix. It is clear that is also an optimal solution of (29). This property indicates that the optimal solution is not uniquely determined because of arbitrary arithmetic transformation of . Let be the eigenvector of corresponding to eigenvalue ; that is, . To cope with this issue, a rescaling procedure is adopted [25]. Each eigenvector is rescaled to satisfy the following constraint: Then, each eigenvector is weighted by the square root of its associated eigenvalue. The transformed matrix of the proposed scheme is finally given by with descending order: .
For a new testing points , the projected point in the new feature space can be captured by ; thus, it can be further analyzed in the transformed space.
According to the above analysis, we can design an algorithm, which is called PDLFDA Algorithm, to perform our proposed method. The detailed description of this algorithm can be found in the appendix (Algorithm 2). A summary of the calculation steps of PDLFDA Algorithm is presented in Algorithm 1.


The advantage of PDLFDA is discussed as follows.
Firstly, to investigate the rank of the betweenclass scatter matrix of LDA, can be rewritten as Thereby,
It is easy to infer that the rank of the betweenclass scatter matrix is at most; thus, there are up to meaningful subfeatures that can be extracted. Thanks to the help of affinity matrix , when compared with the conventional LDA, the reduced subspace of proposed PDLFDA can be any subdimensional space. On the other hand, the classical local fisher’s linear discriminant only weights the value of sample pairs in the same classes, while our method also takes in account the sample pairs in different classes. Hereafter, the proposed method will be more flexible, and the results will be more adaptive. The objective function of proposed method is quite similar to the conventional LDA; hereby, the optimal solution is almost same as the conventional LDA, which indicates that it is also simple to implement and easy to revise.
To further explore the relationship of LDA and PDLFDA, we now rewrite the objective function of LDA and PDLFDA, respectively:
This implies that LDA tries to maximize the betweenclass scatter and simultaneously constraint the withinclass scatter to a certain level. However, such restriction is hard to constraint and no relaxation is imposed. When the data is not a single modal, that is, multimodal, or unknown modal, LDA often fails. On the other hand, benefiting from the flexible designing of affinity matrix , PDLFDA gains more freedom in (35). That is, the separability of PDLFDA will be more distinct, and the degree of freedom remains more than the conventional LDA; thus, our method is expected to be more robust and significantly preponderant.
For large scale data sets, we discuss a scheme that can accelerate the computation procedure of the withinscatter matrix . In our algorithm, owning to the fact that we have put penalty on the affinity matrix for different class samples in constructing the betweenscatter matrix, the accelerated procedure will remain for further discussion.
The withinclass scatter can be reformulated as Here,
can be block diagonal if all samples are sorted according to their labels. This property implies that and can also be block diagonal matrix. Hence, if we compute through (36), then the procedure will be much more efficient. Similarly, can also be formulated as
Nevertheless, is dense and can not be further simplified. However, the simplified computational procedure of saves for us part of time in a way. In this paper, we adopt the above procedure to accelerate and pursue normally. In addition to locality structure, some papers show that another property, for example, marginal information, is also important and should be preserved in the reduced space. The theory of extended LDA and LPP algorithm is developed rapidly recently. Yan et al. [33] summarized these algorithms in a graph embedding framework and also proposed a marginal fisher analysis embedding (MFA) algorithm under this framework.
In MFA, the criterion is characterized by intraclass compactness and interclass marginal superability, which is replaced for the withinclass scatter and betweenclass scatter, severally. The intraclass relationship is reflected by an intrinsic graph which is constructed by nearest neighborhood sample data points in the same class, while the interclass superability is mirrored by a penalty graph computed for marginal points from different classes. Following this idea, the intraclass compactness is given as follows: where Here, represents the nearest neighborhood index set of from the same class and is the row sum (or column sum) of : . Interclass separability is indicated by a penalty graph whose term is expressed as follows: