Trace Ratio Criterion for Feature Extraction in Classification

Li, Guoqi; Wen, Changyun; Wei, Wei; Xu, Yi; Ding, Jie; Zhao, Guangshe; Shi, Luping

doi:https://doi.org/10.1155/2014/725204

Mathematical Problems in Engineering

On this page

Abstract Introduction Analysis Discussion Conclusion References Copyright Related Articles

Research Article | Open Access

Volume 2014 | Article ID 725204 | https://doi.org/10.1155/2014/725204

Trace Ratio Criterion for Feature Extraction in Classification

Guoqi Li,¹Changyun Wen,²Wei Wei,³Yi Xu,²Jie Ding,²Guangshe Zhao,⁴and Luping Shi¹

Academic Editor: Yang Tang

Received12 Nov 2013

Revised28 Mar 2014

Accepted28 Mar 2014

Published29 May 2014

Abstract

A generalized linear discriminant analysis based on trace ratio criterion algorithm (GLDA-TRA) is derived to extract features for classification. With the proposed GLDA-TRA, a set of orthogonal features can be extracted in succession. Each newly extracted feature is the optimal feature that maximizes the trace ratio criterion function in the subspace orthogonal to the space spanned by the previous extracted features.

1. Introduction

Linear discriminant analysis (LDA) [1–3] has been proposed as a class separatory measure, which has been intensively used to reduce dimensionality of a classification problem as well as improve the generalization capability of a pattern classifier. Generally speaking, LDA method is to optimize the ratio criterion of the between-class distance and within-class distance constructed based on the available learning data. Such optimization can be realized by solving a generalized eigenvalue problem of the between-class and within-class scatter matrices [4].

In our opinion, there are three main problems for LDA methods. The first problem is its difficulty of dealing with high-dimensional data, where the number of observed samples is much lower than the samples’ feature dimension [5]. Many methods have been studied and proposed to address this problem; see, for example, the linear discriminant feature selection (LDFS) [6], Sparse Discriminant Analysis [5, 7], and Sparse Tensor Discriminant Analysis [8].

The second problem is the well-known undersampled problem [9] in LDA method, in which scatter matrices may become singular due to insufficient samples. The solutions to this problem have also been well investigated in the following works such as the Regularized LDA [10, 11] using regularization techniques [12, 13] and the Penalized LDA [14], the Pseudo Fisher Linear Discriminant [15], the Generalized Singular Value Decomposition [16], and the Uncorrelated LDA [17] and the Orthogonal LDA [17].

Basically the above two problems are quite similar and they can be unified as the same problem, which has also been extensively investigated in the above schemes. However, the third problem due to the LDA method can only extract quite limited features for classification problems [4]. For example, in two-class classification, one can only find one nonzero eigenvalue (extracted feature), as the between-class scatter matrix is a rank-one matrix. To the best of our knowledge, currently there is no good way to deal with this problem yet.

We focus on the third problem in this paper. A generalized LDA based on trace ratio criterion [18–23] is proposed to overcome such a problem and an algorithm called generalized linear discriminant analysis based on trace ratio criterion algorithm (GLDA-TRA) is derived to extract features from the input feature space. The algorithm first extracts a feature which maximizes the trace ratio criterion by solving a generalized eigenvalue problem. It is shown that such a generalized eigenvalue problem is the same as the generalized eigenvalue problem of LDA. Then, the learning data are projected to a subspace orthogonal to the space spanned by the extracted features. In that orthogonal subspace, the algorithm continues to extract a feature which maximizes the proposed trace ratio criterion. This process continues and, in this way, a set of orthogonal features is obtained iteratively. It is proven that each newly extracted feature is the optimal feature that maximizes the trace ratio criterion in the subspace orthogonal to the space spanned by the previous extracted features. Finally the extracted features are shown to give a sequence of trace ratios with magnitudes monotonically decreasing.

2. Problem Formulation

Let be a sample, where denotes a -dimensional feature space and is a label set. Let denote the th sample in the th class. The within-class scatter matrix and the between-class scatter matrix [24] are, respectively, defined as where is the number of samples in the th class and is the total number of samples and is the sample mean of the th class and is the sample mean of all classes and the notation means matrix transpose. Without loss of generality, we assume that . Then can be constructed as a full rank matrix of rank while is at most with rank . So, in this paper, is considered as a symmetrical and positive definite matrix and is a symmetrical and nonnegative definite matrix. The LDA method extracts features, that is, the column vectors in matrix , in such a way that the ratio of the between-class scatter and the within-class scatter is maximized [25]. Consider where is the determinant of a matrix and is the set of generalized eigenvectors of and corresponding to the largest generalized eigenvalues such that Unfortunately, there are at most nonzero generalized eigenvalues as the rank of is at most . As such, for a two-class classification problem, LDA can only extract one feature.

To overcome this problem, we hope that one can continue to extract after is extracted. Generally speaking, we hope to extract after the features are extracted with starting from . We now use the trace ratio criterion function in [24, 26] to formulate the above-mentioned feature extraction problem. The optimization model we proposed is shown as follows: where is a matrix denoting all the features extracted with being empty, is the feature to be determined in the space orthogonal to , and with denoting the trace of a matrix.

Remark 1. Later it can be shown that the first extracted vector which maximizes in the space is the same found by LDA in (2). Here it is also necessary to point out that the orthogonal constraint condition in is needed in our formulated problem (4). As in (2), the numerator is the determinant of matrix . Implicitly there is a constraint that cannot be the same. Because when , the numerator would be zero. But in (4), the numerator is the trace of , which does not exclude the possibility that are the same. With the constraint in (4), such a possibility can be avoided.

3. Proposed Algorithm and Analysis

In this section, we present and analyze the proposed feature extraction algorithm. Our idea is summarized as follows. We first extract a feature by maximizing the trace ratio criterion function involving and in (4). When the current extracted features become , let denote a space spanned by the linear combination of all the columns of and let denote the space orthogonal to . Then, and are projected onto the subspace by using projection operators and , respectively, where is the generalized matrix inverse of a column full rank matrix . We continue the process to find by optimizing (4) until all features are extracted.

We first present the algorithm in Section 3.1. In Section 3.2., we will show how this algorithm is derived and then analyze its properties.

3.1. Proposed Algorithm

For convenience, we present the definition of a generalized eigenvalue as follows.

Definition 2. A number is called a generalized eigenvalue of matrix with respect to if satisfies that for a nonzero vector , where is a positive definite symmetrical matrix. When , is a normal eigenvalue of .

Now the algorithm is given as follows.

Initialization Step: construct symmetrical matrices and as shown in (1) based on the available learning data with as an empty matrix.

Step () is as follows.(a)Calculation stage: find a unit vector in the direction of a generalized eigenvector which corresponds to the maximum eigenvalue of the generalized eigenvalue problem of with respect to . This can be achieved by the following process: do the Cholesky decomposition . Let and obtain its maximum eigenvalue together with its corresponding eigenvector . Choose and .(b)Update stage is as follows: where is a sufficiently small positive number.

Replace by until features are extracted in .

Remark 3. (1) As is orthonormal, (6) can be rewritten as the following recursive from: Later in Lemma 5, it will be shown that for can still be positive definite for a sufficiently small positive . At each step , is positive definite and the generalized eigenvalue of with respect to exists.
(2) The reason why we need to find the maximum eigenvalue of in the Calculation stage is shown in Lemma 8.
(3) Suppose that features have been extracted. Then Theorem 12 shows that is found to ensure the trace ratio criterion function in (4) to attain its maximum value.
(4) When , the LDA and GLDA-TRA extract the same feature.
(5) GLDA-TRA extracts features one by one. When , becomes a zero matrix, and the algorithm will not extract any more features.

3.2. Derivation and Analysis of GLDA-TRA

Lemma 4. For an arbitrary full column rank matrix , and are projection operators which project a vector onto and , respectively.

Proof. Suppose that belongs to ; there exists a vector such that . It can be obtained that = = and = = . This lemma holds.

Lemma 5. in (6) for are positive definite for a sufficiently small positive .

Proof. Note that for any nonzero as . Let and . Then + . As , and cannot be zero at the same time. Thus for a sufficiently small positive number .

Lemma 6. For matrices , , define . Then .

Proof. Let and , where for , and for . We have Then, for , , we get So (10) gives .

Lemma 7. Let and the trace ratio function . The gradient = .

Proof. Let and . From Lemma 6, = = .

Define , and we have the following results mentioned in the Calculation stage of step .

Lemma 8. Assume that the maximum eigenvalue of is with an eigenvector . Then the unit vector is an extracted feature ensuring that attains its maximum value which is equal to .

Proof. Note that . Thus maximizing is equivalent to minimizing . From Lemma 7, we have Then one can minimize by using an iterative method. Let , where denotes the th iteration and is a small positive constant. Then = + . Thus is a nonincreasing positive sequence and its limit, denoted as , exists. We have . This implies that . As , then converges to an accumulation point that satisfies . This gives Let . From (12), we have . That is, is the generalized eigenvalue obtained from Note that each eigenvector in (13) is an accumulation point of , since it satisfies (12). Now we hope to convert the generalized eigenvalue problem of (13) to a normal eigenvalue problem. This is achieved by Cholesky decomposition of , which is given as , where is a full column rank lower triangular matrix. By doing this, (13) becomes Defining and substituting into (14), it can be obtained that , where is a symmetrical positive definite matrix. Let the maximum eigenvalue and its corresponding eigenvector of be and . Finally the optimal is a unit vector and .

Theorem 9. Suppose that is a feature extracted at step ; the trace ratio at step .

Proof. Note that . We have since is positive definite by Lemma 5. Also, as is a projection operator which projects a vector onto based on Lemma 4. Thus, .

Remark 10. As seen in Theorem 9, if the feature has been extracted at step , then is supposed to make the trace ratio attain its maximum in this step. While at step , after and are updated to and , respectively, we have at . This means that the algorithm needs to find which can maximize , and, obviously, must be different from .

Theorem 11. Sequence produced by GLDA-TRA is a decreasing sequence.

Proof. Assume that is a space which the th feature belongs to. That is, is the feature extracted from such that attains its maximum. Now we consider at step and at step with the respective corresponding maximum eigenvalues and . As is positive definite; from Lemma 4 it can be obtained that when . If , . Then, = . If , and then . Also, as and correspond to the maximum eigenvalues at steps and , it is obvious that Thus, it can be concluded that while . Similarly, . As , increases as increases. Thus, decreases as increases. In addition, we have = = = . Thus, .

Theorem 12. The feature extracted by GLDA-TRA is the optimal feature maximizing in (4) when is extracted.

Proof. Note that as is orthonormal to . Consider an arbitrary . Construct and . Note that is an orthonormal matrix. Then as long as and we have Denote that and = + . Then, and . Thus, we have where . Similarly, it can be obtained that where . From Lemma 8, it is noted that attains its maximum if and only if is parallel to the eigenvector corresponding to the maximum generalized eigenvalue of with respect to . Since is parallel to the generalized eigenvector, as long as , we have That is to say, . So we have . Thus, this theorem holds.

Remark 13. Our main theoretical conclusions of this paper are shown in Theorems 9–12. The implication of Theorem 9 is summarized in Remark 10. In Theorem 11, the decreasing of the trace ratio sequence means that the separability of the data set on the each newly extracted features decreases compared with the previous extracted features. This is the main result and contribution of this paper. Note that the trace ratio cost function is to evaluate whether a feature is good or not. By employing our proposed method, one can obtain and can be greater than new extracted orthogonal features in which the is better than or at least equal to . Theorem 12 gives us more concrete conclusions: if we already have features denoted as , the th feature extracted by GLDA-TRA is the optimal feature in the space orthogonal to the space spanned by . We present and prove Lemmas 4–8 as they are needed in proving Theorems 9–12. The theorems will be illustrated and verified in Section 4.

4. Simulation and Discussion

To simply illustrate our idea, two experiments are done. The first one is on Iris data set and the second one is on an artificial data set.

Example 14. Iris data set is a standard data set to verify the performance of classification algorithms. There are data points belonging to three classes : Setosa, Versicolor, and Virginica, respectively. Each class has samples with four features : Sepal length, Sepal width, Petal length, and Petal width.
One can construct and based on the Iris data set. It can be checked that and , so LDA can only extract at most two features, while by employing GLDA-TRA to the Iris data set, we can extract features one by one, which are denoted as . When , the obtained is given as
Obviously, . Figure 1 shows the distribution of the three classes of the data points after projecting them onto each extracted feature , and . As seen in Figure 1, the separability of the data decreases in the directions of , and . One can also notice that the sequence is strictly decreasing as shown in Figure 2.
When using the extracted features to do classification via support vector machine , both LDA and GLDA-TRA can obtain good results in this example as the data points are well separated even if they are projected to one-dimensional space. When we extract one feature, the accuracy is 98% for both LDA and GLDA-TRA, since they extract the same feature when see Remarks 1 and 3. When we extract features with our methods for , and , the accuracies are, respectively, 98% when , 98.6667% when , and 98% when . More details are shown in Table 1 and even when the data points are well separable, it is still observed that GLDA-TRA can be better than LDA in this case.

(a)

(b)

(c)

(d)

Example 15. This radar data was a 34-dimensional data set , , , collected by a system in Goose Bay, Labrador (http://archive.ics.uci.edu/ml/datasets/Ionosphere). This system consists of a phased array of 16 high-frequency antennas with a total transmitted power on the order of 6.4 kilowatts. The targets were free electrons in the ionosphere. “Good” radar returns are those showing evidence of some type of structure in the ionosphere. “Bad” returns are those that do not; their signals pass through the ionosphere.
The comparisons of LDA and GLDA-TRA in this example are shown in Figure 3. It is seen that LDA and GLDA-TRA are the same when . But LDA cannot continue to extract more features while the proposed GLDA-TRA still extracts features one by one. When , the classification accuracy attains its maximum, which means that the dimension of the radar data set can be reduced to as low as dimensions without losing any classification information.
In Theorem 11, it is shown that the trace ratio sequence on each extracted feature in order is decreasing; this can also be verified in Figure 4, where both trace ratio and its logarithm are plotted. And from Theorem 12, we know that each newly extracted feature is the optimal feature that maximizes the trace ratio function in the subspace orthogonal to the space spanned by the previous extracted features, which is exactly what we claimed in the abstract.

(a)

(b)

Example 16. In the previous examples, both LDA and GLDA-TRA perform quite well even when only one feature is extracted. In this example, we will show that the data points are inseparable if they are projected to one-dimensional space. The classification problem is given by where and is a variable following normal distribution . It can be known that most data points with label locate around the surface of a sphere with radius while data points with label mostly locate around the surface of a sphere with radius .
To do the experiment, two hundred data points that are equally distributed in above two classes are generated. It is observed that LDA does not perform well if it is used to extract features in this problem. This is because LDA can only extract one feature and the data points are inseparable or well separable in an arbitrary one-dimensional feature space. By using GLDA-TRA, when we extract two features (), we obtain a projection matrix . By employing , one can then visualize the data points by projecting the date onto the two extracted features . Figure 5 shows the classification using SVM on this two-dimensional extracted feature space. With GLDA-TRA, the data points can be better separated if we extract more features. The comparisons of LDA and GLDA-TRA regarding the classification accuracy rate are shown in Table 2 and Figure 6.

5. Conclusion

In this paper, a generalized linear discriminant analysis based on trace ratio criterion (GLDA-TRA) algorithm has been proposed. This is to overcome the problem that linear discriminant analysis (LDA) can only extract limited features in classification. It is shown that, in GLDA-TRA, a set of orthogonal features can be extracted one by one. Each newly extracted feature is the optimal feature that maximizes the trace ratio criterion function in the subspace orthogonal to the space spanned by the previous extracted features. Finally the extracted features are such that the trace ratio sequence of these features is decreasing in order. Experimental results also show the effectiveness of our proposed algorithm.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

K. Fukunaga, Introduction to Statistical Pattern Recognition, Second Edition, Academic Press, Boston, Mass, USA, 1990.
View at: MathSciNet
X. Li, W. Hu, H. Wang, and Z. Zhang, “Linear discriminant analysis using rotational invariant L1 norm,” Neurocompting, vol. 73, no. 13–15, pp. 2571–2579, 2010.
View at: Google Scholar
S. Noushath, G. Hemantha Kumar, and P. Shivakumara, “Diagonal Fisher linear discriminant analysis for efficient face recognition,” Neurocomputing, vol. 69, no. 13–15, pp. 1711–1716, 2006.
View at: Publisher Site | Google Scholar
A. M. Martinez and A. C. Kak, “PCA versus LDA,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228–233, 2001.
View at: Publisher Site | Google Scholar
X. Shi, Y. Yang, Z. Guo, and Z. Lai, “Face recognition by sparse discriminant analysis via joint L21-norm minimization,” Pattern Recognition, 2014.
View at: Google Scholar
M. Masaeli, G. Fung, and J. G. Dy, “From transformation-based dimensionality reduction to feature selection,” in Proceedings of the International Conference on Machine Learning (ICML '10), pp. 751–758, June 2010.
View at: Google Scholar
L. Clemmensen, T. Hastie, D. Witten, and B. Ersboll, “Sparse discriminant analysis,” Technometrics, vol. 53, no. 4, pp. 406–413, 2011.
View at: Publisher Site | Google Scholar | MathSciNet
Z. Lai, Y. Xu, J. Yang, J. Tang, and D. Zhang, “Sparse tensor discriminant analysis,” IEEE Transactions on Image Processing, vol. 22, no. 10, pp. 3904–3915, 2013.
View at: Publisher Site | Google Scholar | MathSciNet
L. Chen, H. M. Liao, M. Ko, J. Lin, and G. Yu, “New LDA-based face recognition system which can solve the small sample size problem,” Pattern Recognition, vol. 33, no. 10, pp. 1713–1726, 2000.
View at: Publisher Site | Google Scholar
J. H. Friedman, “Regularized discriminant analysis,” Journal of the American Statistical Association, vol. 84, no. 405, pp. 165–175, 1989.
View at: Publisher Site | Google Scholar | MathSciNet
D. Q. Dai and P. C. Yuen, “Regularized discriminant analysis and its application to face recognition,” Pattern Recognition, vol. 36, no. 3, pp. 845–847, 2003.
View at: Publisher Site | Google Scholar
G. Li, C. Wen, G. Huang, and Y. Chen, “Error tolerance based support vector machine for regression,” Neurocomputing, vol. 74, no. 5, pp. 771–782, 2011.
View at: Publisher Site | Google Scholar
G. Li, C. Wen, Z. G. Li, A. Zhang, Y. Feng, and K. Z. Mao, “Model based online learning with Kernels,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 3, pp. 356–369, 2013.
View at: Google Scholar
T. Hastie, A. Buja, and R. Tibshirani, “Penalized discriminant analysis,” The Annals of Statistics, vol. 23, no. 1, pp. 73–102, 1995.
View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
M. Skurichina and R. P. W. Duin, “Regularisation of linear classifiers by adding redundant features,” Pattern Analysis and Applications, vol. 2, no. 1, pp. 44–52, 1999.
View at: Google Scholar
J. Ye, R. Janardan, C. H. Park, and H. Park, “An optimization criterion for generalized discriminant analysis on undersampled problems,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 982–994, 2004.
View at: Publisher Site | Google Scholar
J. Ye, “Characterization of a family of algorithms for generalized discriminant analysis on undersampled problems,” Journal of Machine Learning Research (JMLR), vol. 6, pp. 483–502, 2005.
View at: Google Scholar | Zentralblatt MATH | MathSciNet
H. Wang, S. Yan, D. Xu, X. Tang, and T. Huang, “Trace ratio vs. ratio trace for dimensionality reduction,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '07), June 2007.
View at: Publisher Site | Google Scholar
Y. Liu, F. Nie, J. Wu, and L. Chen, “Efficient semi-supervised feature selection with noise insensitive trace ratio criterion,” Neurocomputing, vol. 105, pp. 12–18, 2013.
View at: Google Scholar
Y. Jia, F. Nie, and C. Zhang, “Trace ratio problem revisited,” IEEE Transactions on Neural Networks, vol. 20, no. 4, pp. 729–735, 2009.
View at: Publisher Site | Google Scholar
S. Cui and Y. C. Soh, “Linearity indices and linearity improvement of 2-D tetralateral position-sensitive detector,” IEEE Transactions on Electron Devices, vol. 57, no. 9, pp. 2310–2316, 2010.
View at: Publisher Site | Google Scholar
T. T. Ngo, M. Bellalij, and Y. Saad, “The trace ratio optimization problem for dimensionality reduction,” SIAM Journal on Matrix Analysis and Applications, vol. 31, no. 5, pp. 2950–2971, 2010.
View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
Z. Zhang, T. Chow, and M. Zhao, “Trace ratio optimization-based semi-supervised nonlinear dimensionality reduction for marginal manifold visualization,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 5, pp. 1148–1161, 2013.
View at: Google Scholar
L. Zhou, L. Wang, and C. Shen, “Feature selection with redundancy-constrained class separability,” IEEE Transactions on Neural Networks, vol. 21, no. 5, pp. 853–858, 2010.
View at: Publisher Site | Google Scholar
P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. Fisherfaces: recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711–720, 1997.
View at: Google Scholar
F. Nie, S. Xiang, Y. Jia, C. Zhang, and S. Yan, “Trace ratio criterion for feature selection,” in Proceedings of the 23rd AAAI Conference on Artificial Intelligence, pp. 671–676, July 2008.
View at: Google Scholar

Copyright

Copyright © 2014 Guoqi Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1686

Downloads

1082

Citations