#### Abstract

An efficient unsupervised feature selection method based on unsupervised optimal discriminant vector is developed to find the important features without using class labels. Features are ranked according to the feature importance measurement based on unsupervised optimal discriminant vector in the following steps. First, fuzzy Fisher criterion is adopted as objective function to derive the optimal discriminant vector in unsupervised pattern. Second, the feature importance measurement based on elements of unsupervised optimal discriminant vector is defined to determine the importance of each feature. The features with little importance measurement are removed from the feature subset. Experiments on UCI dataset and fault diagnosis are carried out to show that the proposed method is very efficient and able to deliver reliable results.

#### 1. Introduction

Feature selection (FS) has become an active research topic in the area of pattern recognition, machine learning, data mining, intelligent fault diagnosis, and so forth. It is performed to choose a subset of the original features by removing redundant and noisy features from high-dimensional datasets in order to reduce computational cost, increase the classification accuracy, and improve result comprehensibility.

In the supervised FS algorithms, since class labels are available in supervised learning, various feature subsets are evaluated using some function of prediction accuracy to select only those features which are related to or lead to the decision classes of the data under consideration. There are numerous supervised feature selection methods [1–7] such as Fisher criterion [1, 2], Relief [3], and Relief-F [4].

However, for many existing datasets, class labels are often unknown or incomplete because large amounts of data make it difficult for humans to manually label the categories of each instance. Moreover, human labeling is expensive and subjective. Thus, it indicates the significance of unsupervised dimensionality reduction. Principal component analysis (PCA) [8] is often used in unsupervised pattern. However, PCA creates new features or principal components which are functions of original features. It is difficult to obtain intuitive understanding of the data using the new features only. Some unsupervised feature selection methods [8–14] have been proposed such as SUD [9]. SUD, which is a sequential backward selection algorithm to determine the relative importance of variables for Unsupervised Data, uses entropy similarity measurement to determine the importance of features with respect to the underlying clusters.

It is known to us that the famous Fisher criterion which can derive optimal discriminant vector is commonly used to realize feature dimension reduction in supervised pattern. In the unsupervised pattern, how to overcome the lack of the class information to realize feature selection is a worthy topic.

#### 2. An Overview of Optimal Discriminant Vector

Fisher criterion is a discriminant criterion function that was first proposed by Fisher. It is based on the between-class scatter and the within-class scatter. By maximizing this criterion, one can obtain an optimal discriminant vector. After the sample is projected to this vector, the within-class scatter is minimized and the between-class scatter is maximized [15].

Given pattern classes in the pattern set which contains -dimensional patterns, where , is the number of all the patterns in the th class; thus, . Fisher criterion is defined as follows: where is the between-class scatter matrix denoted by and is the within-class scatter matrix denoted by where denotes the mean of the th class, and denotes the mean of all the patterns in the pattern set.

In order to seek an optimal discriminant vector by maximizing the Fisher criterion, the optimal discriminant vector can be obtained by solving the following eigen-system equation: where is diagonal and consists of the corresponding eigenvalues. When the inverse of exists, can be obtained by the maximum eigenvalue of .

#### 3. Unsupervised Optimal Discriminant Vector Based Feature Selection Method

Fisher criterion mentioned above can only be used in supervised pattern. This means that traditional optimal discriminant vector cannot be calculated directly by the unlabeled samples. Cao et al. [16] introduce fuzzy theory into Fisher criterion and define fuzzy Fisher criterion. Maximizing this criterion cannot only realize clustering but also obtain optimal discriminant vector.

Suppose that the membership function with for all and the fuzzy index is a given real value, where denotes the degree of the th -dimensional pattern belonging to the th class; we can define the following fuzzy within-class scatter matrix : and the following fuzzy between-class scatter matrix : Thus, we can derive fuzzy Fisher criterion as follows:

It is obvious that maximizing directly in (7) is not a trivial task due to the existence of its denominator. However, we can reasonably relax this problem by applying the following Lagrange multipliers; and together with the constraint to (7):

Setting to be zero, we have where is the eigenvector belonging to the largest eigenvalue of .

Setting to be zero, we have Here, is a local maximum of [17] proved in Appendix.

Setting to be zero, we have When (11) is used, as stated previously, should satisfy ; hence, in order to satisfy this constraint, we let and for all , if

With the above discussion, we can obtain the optimal discriminant vector in unsupervised pattern and then do feature selection based on . Now, let us illustrate this by the following experiment on 2-dimensional artificial dataset.

Figure 1 contains 168 2-dimensional samples. Through maximizing fuzzy Fisher criterion, we can obtain 2-class clustering result shown as red points and blue points, respectively, and can also get the vector shown as a line in Figure 2. We project all samples to -axis and -axis. It is obvious that projective points in -axis from different class are overlaping while those in -axis are separated well. It means that feature is more important than feature for leading to the decision classes. This is consistent with which gives us a revelation that we can apply the vector for feature selection.

Suppose ; we define as the single feature importance measurement for comparison:

To the above artificial dataset, is the importance measurement of feature and is the importance measurement of feature.

*Proposed Method*

*Step 1. *Set the given threshold or the number of iterations ; initialize and using -means.

*Step 2. *Compute using (5), (6), respectively.

*Step 3. *Compute the largest eigenvalue and the corresponding using (9).

*Step 4. *Update and using (10), (11), and (12), respectively.

*Step 5. *Compute .

*Step 6. *If or the number of iterations , go to Step 7; otherwise go to Step 2.

*Step 7. *Compute the feature importance measurements which are normalized as . Then sort by the descending order.

*Step 8. *Set the feature importance threshold .

*Step 9. *Find a feature subset size which is a minimize number making no less than the threshold .

*Step 10. *Choose features corresponding to the sorted in the descending order, that is, (), as the selected features and then terminate.

Different feature importance threshold leads to different feature subset size. In Step 7 of proposed method, features have already been sorted by the descending order. If the feature subset size is given from the start, we can simply select the first features. But if is not given, we can use to determine the feature subset size. The bigger is, the larger is. The recommended range of is from 0.8 to 0.95.

#### 4. Experimental Results

##### 4.1. Feature Selection on UCI Dataset *Wine*

In this experiment, the benchmarking UCI dataset *Wine *[18] was chosen to test the feature selection effectiveness of SUD, Relief-F and our method. We use the following Rand index [19] to evaluate the clustering performance of the dimension reduction data:
where , denote the clustering results for the original dataset without noise and the corresponding noisy dataset, denotes the number of any two patterns in the original dataset belonging to the same cluster in , , denotes the number of any two patterns in the original dataset belonging to two different clusters in , , and is the number of all patterns in the original dataset. Obviously, . And when is the same as . The smaller , the bigger the difference between and . In other words, the corresponding algorithm has less robust capability in this case.

Table 1 illustrates the basic information of the dataset. We choose 130 samples which belong to class 1 and class 2 as testing dataset. The parameters for the proposed method are set as follows:

Table 2 lists the importance measurement of every feature computed by the proposed method. Due to the threshold , 6 features will be selected from original features. Figure 3 shows the Rand index values corresponding to the number of features using SUD, Relief-F and our method.

From Figure 3, we can easily find that data selected features by the proposed method have the best clustering result among these three algorithms.

##### 4.2. Feature Selection for Fault Diagnosis

The steel plates faults dataset used in this experiment was donated by Semeion, Research Center of Sciences of Communication, Via Sersale 117, Rome, Italy [20, 21]. It classifies steel plates’ faults into 7 different types: Pastry, Z_Scratch, K_Scratch, Stains, Dirtiness, Bumps, and Other_Faults. The dataset includes 1941 samples and every sample owns 27 independent features.

Table 3 shows class distribution and list of features. We choose 348 samples which belong to Pastry and Z_Scratch faults as testing dataset. The parameters for the proposed method are set as the previous experiment.

Table 4 lists the importance measurement of every feature computed by the proposed method. Due to the threshold , 11 features will be selected from original features. Figure 4 shows the Rand index values corresponding to the number of features using SUD, Relief-F, and our method.

Figure 4 shows that the proposed method is able to find the important features. It also shows that the performance of the proposed method without using class labels is very close to and sometimes better than that of SUD or Relief-F which ranks the original features using the class labels.

#### 5. Conclusions

An efficient unsupervised feature selection method based on unsupervised optimal discriminant vector is developed to find the important features without using class labels. It adopts fuzzy Fisher criterion to derive the optimal discriminant vector in unsupervised pattern. It defines the single feature importance measurement based on unsupervised optimal discriminant vector to determine the importance of every feature. Two experiments on *Wine* dataset and fault diagnosis were carried out to show that the proposed method is able to find important features and is a reliable and efficient feature selection methodology compared to SUD and Relief-F. In the future, we will research how to introduce kernel techniques to the proposed method to enhance its applicability.

#### Appendix

#### Proof of (10)

According to [22], we have

Premultiply by on both sides, We cannot solve from the above equation. But it is obvious that the following equation is the particular solution of (A.3): that is, Now we proof that (A.5) is a local maximum of .

We find

As is positive semidefinite matrix, we have And it is obvious that We track value in the experiments shown in Figure 5 and give the empirical evidence to proof .

**(a) Wine Dataset**

**(b) Steel plates dataset**

Thus, According to [23], (A.5) is the local maximum of .

#### Conflict of Interests

The authors declare that they have no conflict of interests.

#### Acknowledgments

This work is supported by the Major Program of the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (11KJA460001, 13KJA460001), China Spark Program (2013GA690404) Technology Innovation Project of Science and Technology Enterprises at Jiangsu Province in China (BC2012429), the Open Project from the Key Laboratory of Digital Manufacture Technology at Jiangsu Province in China (HGDML-1005), Huaian Science and Technique Program in China (HAG2010062), Huaian 533 Talents Project, Huaian International Science and Technology Cooperation Project (HG201309), Qing Lan Project of Jiangsu Province, and Jiangsu Overseas Research & Training Program for University Prominent Young & Middle-Aged Teachers and Presidents, China.