Mathematical Problems in Engineering

Volume 2013, Article ID 396780, 7 pages

http://dx.doi.org/10.1155/2013/396780

## Unsupervised Optimal Discriminant Vector Based Feature Selection Method

^{1}Faculty of Mechanical Engineering, Huaiyin Institute of Technology, Huai’an 223003, China^{2}Department of Electrical and Electronic Engineering, University of Melbourne, Victoria, VIC 3010, Australia

Received 22 July 2013; Accepted 21 September 2013

Academic Editor: Baochang Zhang

Copyright © 2013 Su-Qun Cao and Jonathan H. Manton. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

An efficient unsupervised feature selection method based on unsupervised optimal discriminant vector is developed to find the important features without using class labels. Features are ranked according to the feature importance measurement based on unsupervised optimal discriminant vector in the following steps. First, fuzzy Fisher criterion is adopted as objective function to derive the optimal discriminant vector in unsupervised pattern. Second, the feature importance measurement based on elements of unsupervised optimal discriminant vector is defined to determine the importance of each feature. The features with little importance measurement are removed from the feature subset. Experiments on UCI dataset and fault diagnosis are carried out to show that the proposed method is very efficient and able to deliver reliable results.

#### 1. Introduction

Feature selection (FS) has become an active research topic in the area of pattern recognition, machine learning, data mining, intelligent fault diagnosis, and so forth. It is performed to choose a subset of the original features by removing redundant and noisy features from high-dimensional datasets in order to reduce computational cost, increase the classification accuracy, and improve result comprehensibility.

In the supervised FS algorithms, since class labels are available in supervised learning, various feature subsets are evaluated using some function of prediction accuracy to select only those features which are related to or lead to the decision classes of the data under consideration. There are numerous supervised feature selection methods [1–7] such as Fisher criterion [1, 2], Relief [3], and Relief-F [4].

However, for many existing datasets, class labels are often unknown or incomplete because large amounts of data make it difficult for humans to manually label the categories of each instance. Moreover, human labeling is expensive and subjective. Thus, it indicates the significance of unsupervised dimensionality reduction. Principal component analysis (PCA) [8] is often used in unsupervised pattern. However, PCA creates new features or principal components which are functions of original features. It is difficult to obtain intuitive understanding of the data using the new features only. Some unsupervised feature selection methods [8–14] have been proposed such as SUD [9]. SUD, which is a sequential backward selection algorithm to determine the relative importance of variables for Unsupervised Data, uses entropy similarity measurement to determine the importance of features with respect to the underlying clusters.

It is known to us that the famous Fisher criterion which can derive optimal discriminant vector is commonly used to realize feature dimension reduction in supervised pattern. In the unsupervised pattern, how to overcome the lack of the class information to realize feature selection is a worthy topic.

#### 2. An Overview of Optimal Discriminant Vector

Fisher criterion is a discriminant criterion function that was first proposed by Fisher. It is based on the between-class scatter and the within-class scatter. By maximizing this criterion, one can obtain an optimal discriminant vector. After the sample is projected to this vector, the within-class scatter is minimized and the between-class scatter is maximized [15].

Given pattern classes in the pattern set which contains -dimensional patterns, where , is the number of all the patterns in the th class; thus, . Fisher criterion is defined as follows: where is the between-class scatter matrix denoted by and is the within-class scatter matrix denoted by where denotes the mean of the th class, and denotes the mean of all the patterns in the pattern set.

In order to seek an optimal discriminant vector by maximizing the Fisher criterion, the optimal discriminant vector can be obtained by solving the following eigen-system equation: where is diagonal and consists of the corresponding eigenvalues. When the inverse of exists, can be obtained by the maximum eigenvalue of .

#### 3. Unsupervised Optimal Discriminant Vector Based Feature Selection Method

Fisher criterion mentioned above can only be used in supervised pattern. This means that traditional optimal discriminant vector cannot be calculated directly by the unlabeled samples. Cao et al. [16] introduce fuzzy theory into Fisher criterion and define fuzzy Fisher criterion. Maximizing this criterion cannot only realize clustering but also obtain optimal discriminant vector.

Suppose that the membership function with for all and the fuzzy index is a given real value, where denotes the degree of the th -dimensional pattern belonging to the th class; we can define the following fuzzy within-class scatter matrix : and the following fuzzy between-class scatter matrix : Thus, we can derive fuzzy Fisher criterion as follows:

It is obvious that maximizing directly in (7) is not a trivial task due to the existence of its denominator. However, we can reasonably relax this problem by applying the following Lagrange multipliers; and together with the constraint to (7):

Setting to be zero, we have where is the eigenvector belonging to the largest eigenvalue of .

Setting to be zero, we have Here, is a local maximum of [17] proved in Appendix.

Setting to be zero, we have When (11) is used, as stated previously, should satisfy ; hence, in order to satisfy this constraint, we let and for all , if

With the above discussion, we can obtain the optimal discriminant vector in unsupervised pattern and then do feature selection based on . Now, let us illustrate this by the following experiment on 2-dimensional artificial dataset.

Figure 1 contains 168 2-dimensional samples. Through maximizing fuzzy Fisher criterion, we can obtain 2-class clustering result shown as red points and blue points, respectively, and can also get the vector shown as a line in Figure 2. We project all samples to -axis and -axis. It is obvious that projective points in -axis from different class are overlaping while those in -axis are separated well. It means that feature is more important than feature for leading to the decision classes. This is consistent with which gives us a revelation that we can apply the vector for feature selection.

Suppose ; we define as the single feature importance measurement for comparison:

To the above artificial dataset, is the importance measurement of feature and is the importance measurement of feature.

*Proposed Method*

*Step 1. *Set the given threshold or the number of iterations ; initialize and using -means.

*Step 2. *Compute using (5), (6), respectively.

*Step 3. *Compute the largest eigenvalue and the corresponding using (9).

*Step 4. *Update and using (10), (11), and (12), respectively.

*Step 5. *Compute .

*Step 6. *If or the number of iterations , go to Step 7; otherwise go to Step 2.

*Step 7. *Compute the feature importance measurements which are normalized as . Then sort by the descending order.

*Step 8. *Set the feature importance threshold .

*Step 9. *Find a feature subset size which is a minimize number making no less than the threshold .

*Step 10. *Choose features corresponding to the sorted in the descending order, that is, (), as the selected features and then terminate.

Different feature importance threshold leads to different feature subset size. In Step 7 of proposed method, features have already been sorted by the descending order. If the feature subset size is given from the start, we can simply select the first features. But if is not given, we can use to determine the feature subset size. The bigger is, the larger is. The recommended range of is from 0.8 to 0.95.

#### 4. Experimental Results

##### 4.1. Feature Selection on UCI Dataset *Wine*

In this experiment, the benchmarking UCI dataset *Wine *[18] was chosen to test the feature selection effectiveness of SUD, Relief-F and our method. We use the following Rand index [19] to evaluate the clustering performance of the dimension reduction data:
where , denote the clustering results for the original dataset without noise and the corresponding noisy dataset, denotes the number of any two patterns in the original dataset belonging to the same cluster in , , denotes the number of any two patterns in the original dataset belonging to two different clusters in , , and is the number of all patterns in the original dataset. Obviously, . And when is the same as . The smaller , the bigger the difference between and . In other words, the corresponding algorithm has less robust capability in this case.

Table 1 illustrates the basic information of the dataset. We choose 130 samples which belong to class 1 and class 2 as testing dataset. The parameters for the proposed method are set as follows:

Table 2 lists the importance measurement of every feature computed by the proposed method. Due to the threshold , 6 features will be selected from original features. Figure 3 shows the Rand index values corresponding to the number of features using SUD, Relief-F and our method.

From Figure 3, we can easily find that data selected features by the proposed method have the best clustering result among these three algorithms.

##### 4.2. Feature Selection for Fault Diagnosis

The steel plates faults dataset used in this experiment was donated by Semeion, Research Center of Sciences of Communication, Via Sersale 117, Rome, Italy [20, 21]. It classifies steel plates’ faults into 7 different types: Pastry, Z_Scratch, K_Scratch, Stains, Dirtiness, Bumps, and Other_Faults. The dataset includes 1941 samples and every sample owns 27 independent features.

Table 3 shows class distribution and list of features. We choose 348 samples which belong to Pastry and Z_Scratch faults as testing dataset. The parameters for the proposed method are set as the previous experiment.

Table 4 lists the importance measurement of every feature computed by the proposed method. Due to the threshold , 11 features will be selected from original features. Figure 4 shows the Rand index values corresponding to the number of features using SUD, Relief-F, and our method.

Figure 4 shows that the proposed method is able to find the important features. It also shows that the performance of the proposed method without using class labels is very close to and sometimes better than that of SUD or Relief-F which ranks the original features using the class labels.

#### 5. Conclusions

An efficient unsupervised feature selection method based on unsupervised optimal discriminant vector is developed to find the important features without using class labels. It adopts fuzzy Fisher criterion to derive the optimal discriminant vector in unsupervised pattern. It defines the single feature importance measurement based on unsupervised optimal discriminant vector to determine the importance of every feature. Two experiments on *Wine* dataset and fault diagnosis were carried out to show that the proposed method is able to find important features and is a reliable and efficient feature selection methodology compared to SUD and Relief-F. In the future, we will research how to introduce kernel techniques to the proposed method to enhance its applicability.

#### Appendix

#### Proof of (10)

According to [22], we have

Premultiply by on both sides, We cannot solve from the above equation. But it is obvious that the following equation is the particular solution of (A.3): that is, Now we proof that (A.5) is a local maximum of .

We find

As is positive semidefinite matrix, we have And it is obvious that We track value in the experiments shown in Figure 5 and give the empirical evidence to proof .

Thus, According to [23], (A.5) is the local maximum of .

#### Conflict of Interests

The authors declare that they have no conflict of interests.

#### Acknowledgments

This work is supported by the Major Program of the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (11KJA460001, 13KJA460001), China Spark Program (2013GA690404) Technology Innovation Project of Science and Technology Enterprises at Jiangsu Province in China (BC2012429), the Open Project from the Key Laboratory of Digital Manufacture Technology at Jiangsu Province in China (HGDML-1005), Huaian Science and Technique Program in China (HAG2010062), Huaian 533 Talents Project, Huaian International Science and Technology Cooperation Project (HG201309), Qing Lan Project of Jiangsu Province, and Jiangsu Overseas Research & Training Program for University Prominent Young & Middle-Aged Teachers and Presidents, China.

#### References

- R. A. Fisher, “The use of multiple measurements in taxonomic problems,”
*Annals of Eugenics*, vol. 7, pp. 178–188, 1936. View at Google Scholar - J. Kittler, “On the discriminant Vector method of feature selection,”
*IEEE Transactions on Computers*, vol. 26, no. 6, pp. 604–606, 1977. View at Google Scholar - K. Kira and L. A. Rendell, “The feature selection problem: traditional methods and a new algorithm,” in
*Proceedings of the 10th National Conference on Artificial Intelligence*, pp. 129–134, July 1992. View at Scopus - I. Kononenko, “Estimating attributes: analysis and extension of RELIEF,”
*Proceedings of the European Conference on Machine Learning*, pp. 171–182, 1994. View at Google Scholar - L. Yu and H. Liu, “Efficiently handling feature redundancy in high-dimensional data,” in
*Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, pp. 685–690, August 2003. View at Publisher · View at Google Scholar · View at Scopus - G. V. Lashkia and L. Anthony, “Relevant, irredundant feature selection and noisy example elimination,”
*IEEE Transactions on Systems, Man, and Cybernetics B*, vol. 34, no. 2, pp. 888–897, 2004. View at Publisher · View at Google Scholar · View at Scopus - Q. Gu, Z. Li, and J. Han, “Generalized fisher score for feature selection,” in
*Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI '11)*, pp. 266–273, Barcelona, Spain, July 2011. View at Scopus - A. R. Webb,
*Statistical Pattern Recognition*, John Wiley & Sons, New York, NY, USA, 2nd edition, 2002. - M. Dash, H. Liu, and J. Yao, “Dimensionality reduction of unsupervised data,” in
*Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence*, pp. 532–539, November 1997. View at Scopus - J. Basak, R. K. De, and S. K. Pal, “Unsupervised feature selection using a neuro-fuzzy approach,”
*Pattern Recognition Letters*, vol. 19, no. 11, pp. 997–1006, 1998. View at Google Scholar · View at Zentralblatt MATH · View at Scopus - P. Mitra, C. A. Murthy, and S. K. Pal, “Unsupervised feature selection using feature similarity,”
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 24, no. 3, pp. 301–312, 2002. View at Publisher · View at Google Scholar · View at Scopus - M. Dash, K. Choi, P. Scheuermann, and H. Liu, “Feature selection for clustering—a filter solution,” in
*Proceedings of the 2nd IEEE International Conference on Data Mining*, pp. 115–122, December 2002. View at Scopus - S. Y. M. Shi and P. N. Suganthan, “Unsupervised similarity-based feature selection using heuristic Hopfield neural networks,” in
*Proceedings of the International Joint Conference on Neural Networks*, vol. 3, pp. 1838–1843, July 2003. View at Scopus - J. G. Dy and C. E. Brodley, “Feature selection for unsupervised learning,”
*Journal of Machine Learning Research*, vol. 5, pp. 845–889, 2004. View at Google Scholar · View at Zentralblatt MATH - S. Z. Li and A. K. Jain,
*Encyclopedia of Biometrics*, Springer, 2009. - S.-Q. Cao, S.-T. Wang, X.-F. Chen, Z.-P. Xie, and Z.-H. Deng, “Fuzzy fisher criterion based semi-fuzzy clustering algorithm,”
*Journal of Electronics & Information Technology*, vol. 30, no. 9, pp. 2162–2165, 2008. View at Google Scholar · View at Scopus - X.-B. Zhi and J.-L. Fan, “Fuzzy fisher criterion based adaptive dimension reduction fuzzy clustering algorithm,”
*Journal of Electronics & Information Technology*, vol. 31, no. 11, pp. 2653–2658, 2009. View at Google Scholar · View at Scopus - C. L. Blake and C. J. Merz, “UCI repository of machine learning databases,” Department of Information and Computer Science, University of California, Irvine, Calif, USA, 1998, http://archive.ics.uci.edu/ml/.
- W. Rand, “Objective criteria for the evaluation of clustering methods,”
*Journal of the American Statistical Association*, vol. 66, no. 336, pp. 846–850, 1971. View at Google Scholar - Center for Machine Learning and Intelligent Systems, the University of California, Irvine, Calif, USA, 2011, http://archive.ics.uci.edu/ml/datasets/Steel+Plates+Faults.
- M. Buscema, S. Terzi, and W. Tastle, “A new meta-classifier,” in
*Proceedings of the Annual North American Fuzzy Information Processing Society Conference (NAFIPS' 10)*, pp. 1–7, IEEE Press, Toronto, Canada, July 2010. View at Publisher · View at Google Scholar · View at Scopus - J. H. Manton, “Differential calculus, tensor products and the importance of notation,” 2013, http://arxiv.org/abs/1208.0197.
- J. Wilde, Unconstrained Optimization, 2011, http://www.econ.brown.edu/students/Takeshi_Suzuki/Math_Camp_2011/Unconstrained_Optimization-2011.pdf.