Abstract

We propose a preprocessing method to improve the performance of Principal Component Analysis (PCA) for classification problems composed of two steps; in the first step, the weight of each feature is calculated by using a feature weighting method. Then the features with weights larger than a predefined threshold are selected. The selected relevant features are then subject to the second step. In the second step, variances of features are changed until the variances of the features are corresponded to their importance. By taking the advantage of step 2 to reveal the class structure, we expect that the performance of PCA increases in classification problems. Results confirm the effectiveness of our proposed methods.

1. Introduction

In many real world applications, we faced databases with a large set of features. Unfortunately, in the high-dimensional spaces, data become extremely sparse and far apart from each other. Experiments show that in this situation once the number of features linearly increases, the required number of examples for learning exponentially increases. This phenomenon is commonly known as the curse of dimensionality. Dimensionality reduction is an effective solution to the problem of curse of dimensionality [1, 2]. Dimensionality reduction is to extract or select a subset of features to describe the target concept. The selection and extraction are based on finding a relevant subset of original features and generating a new feature space through transformation, respectively [1, 3]. The proper design of selection or extraction process improves the complexity and the performance of learning algorithms [4].

Feature selection concerns representing the data by selecting a small subset of its features in its original format [5]. The role of feature selection is critical, especially in applications involving many irrelevant features. Given a criterion function, feature selection is reduced to a search problem [4, 6]. Exhaustive search, when the number of the features is too large, is infeasible and heuristic search can be employed. These algorithms, such as sequential forward and/or backward selection [7, 8], have shown successful results in practical applications. However, none of them can provide any guarantee of optimality. This problem can be alleviated by using feature weighting, which assigns a real-value number to each feature to indicate its relevancy to the learning problem [6]. Among the existing feature weighting algorithms, ReliefF [5] is considered as one of the most successful ones due to its simplicity and effectiveness [9]. A major shortcoming of the feature weighting is its inability to capture the interaction of correlated features [4, 10]. This drawback can be solved by some feature extraction techniques.

The basis of feature extraction is a mathematical transformation that changes data from a higher dimensional space into a lower dimensional one. Feature extraction algorithms are generally effective [11]. However, their effectiveness will be degraded when they are used for processing large-scale datasets [12]. In addition, the features extracted from the mathematical transformation usually concern with all original features. So the extracted features may contain information originated from the irrelevant information in the original space [3, 13].

Principal Component Analysis (PCA) is an effective feature extraction approach and has successfully been applied in recognition applications such as face, handprint, and human-made object recognition [1416] and industrial robotics [17]. The traditional PCA is an orthogonal linear transformation and operates directly on a whole pattern represented as a vector and acquires a set of projections to extract global feature from a given training pattern [18]. PCA reduces the dimension such that the representation is as faithful as possible to the original data [2]. PCA employs all features in the original space, regardless their relevancy, to produce new features. This may result in features containing information originated from irrelevant features in the original space. A side effect is misclassification results. Some works have been done to improve the performance of PCA via the feature weighting. In [19, 20], feature weighting has been used for eliminating irrelevant features or using the weight of features in its calculation. In [19], rank is used instead of the original data for copying the outliers and noises. Honda et al. used weights of features in PCA-guided formulation, while in our proposed method we utilize weights of features to properly change the dataset.

The main objective of this paper is to improve the accuracy of classification using features extracted by PCA. PCA is the best-known unsupervised linear feature extraction algorithm; but it is used for classification tasks too. Since PCA do not pay any particular attention to the underlying class structure, it is not always an optimal dimensionality-reduction procedure for classification purposes, and the projection axes chosen by PCA might not provide the good discrimination power. However, the study in [21] illustrates that PCA might outperform LDA which is one of the best supervised dimensionality reduction method, when the number of samples per class is small or when the training data nonuniformly samples the underlying distribution. In the present work, we propose a novel preprocessing method composed of two steps. In the first step, the qualities of features are computed via a feature weighting algorithm. The selected relevant features, features with weights larger than a predefined threshold, are then subject to the second step. In the second step, the variances of features are modified until the most relevant ones become the most important ones for PCA. Finally, PCA is performed on them to generate uncorrelated features.

The rest of this paper is organized as follows. Section 2 reviews ReliefF, PCA, and its associated problems in brief. Section 3 describes the proposed algorithm. Section 4 presents our experiments on both synthetic and real data and the final section is Conclusion.

2. Review of the ReliefF and PCA Methods

This section reviews ReliefF and PCA briefly and presents the drawbacks of PCA.

2.1. ReliefF

Relief [5] is one of the most successful algorithms to assess the quality of features. The main idea of Relief is to iteratively estimate the weights of features according to how well values distinguish among instances that are near each other. The original Relief limits into two classes problems and deals with complete data [22]. In particular, it has no mechanism to eliminate redundant features [23]. This paper utilizes an extension of Relief called ReliefF [22] that solves the two first problems of Relief. In contrast to Relief, which uses the 1-nearest-neighbor algorithm, ReliefF uses an approach based on -nearest-neighbor algorithms. Pseudocode 1 presents the pseudocode of this algorithm. It is assumed that denotes a training dataset with   samples in which each sample consists of   features and the known class label . In each iteration, ReliefF randomly selects a sample (pattern) and then searches of its nearest neighbors from the same class, termed nearest hits , and also the nearest neighbors from each of different classes, called nearest misses . To compute the weight of each feature, ReliefF uses the contribution of all the hits and misses.

ReliefF Algorithm
Initialization: given ,   is the label of classes between .  
   number of class, set   , , number of iteration ;
for to
   Randomly select a pattern from with class ;
   Find nearest hits from class
   For each class
   from class find nearest misses
   For to
   compute:
   end
end

In ReliefF algorithm, is a parameter defined by users and determines the number of process repeats to estimate the weight of each feature. is the th feature of sample and is the prior probability of class .

2.2. Principle Component Analysis

PCA is a very effective approach of extracting features. It is successfully applied to various applications of pattern recognition such as face classification [18]. As mentioned above, and are the number of samples and their dimension of dataset , respectively. PCA finds a subspace whose basis vectors correspond to the maximum-variance direction of the original space. As mentioned before, PCA is a linear transform. Let represents the linear transformation that maps the original -dimensional space into an -dimensional feature space where normally . Equation (1) shows the new feature vectors, Columns of   are the eigenvectors obtained by solving (2): Here is the covariance matrix and the eigenvalue associated with the eigenvector . The eigenvectors are sorted from high to low according to their corresponding eigenvalues. The eigenvector associated with largest eigenvalue is the most important vector that reflects the greatest variance [21].

PCA employs the entire features and it acquires a set of projection vectors to extract global feature from given training samples. The performance of PCA is reduced when there are more irrelevant features than the relevant ones. On the other hand, PCA has no preknowledge about the class in a given data. So, it is not efficient to determine the classes in the subspace of a given dataset.

We present an example to confirm the mentioned points. This example uses a dataset with five variables and 300 records. The number of classes is three and each class has 100 points. The last two variables represent uniform distributed noise points and irrelevant features. Table 1 shows the centroids and the standard deviations of the three classes [24].

The centroids of two noise variables ( and ), against other three variables, are very close and their standard deviations are larger than those of the other three variables. Figure 1, illustrates the 300 points in different two-dimensional subspaces. We can find no class structure in subspaces with two noisy features. Now, PCA is applied on the database presented in Table 1. Figure 2 shows the results obtained by using two significant eigenvectors extracted by PCA.

Figure 2 shows that the obtained result is not suitable for classification, because there is no mechanism in PCA algorithm to determine irrelevant features. As mentioned before, PCA finds projections of the data with maximum variance. Observably, in this example, there are two irrelevant features with the largest variance. Now, PCA is just performed on three relevant variables . Figure 3 illustrates the new data by applying the PCA. Notice that the class structures can be found in Figure 3. Because of removing irrelevant features, it is suitable for classification. The next section presents the proposed algorithm to solve this problem.

3. RPCA Feature Extraction

As shown in Figure 2, the directions founded by PCA are not proper for classification if the variances of features are not corresponding with their importance. For example, if the variances of irrelevant features are large, then the extracted features via PCA are not suitable for classification. Therefore, it is expected that if the importance of features are proper with their variances then the extracted features using PCA are more likely suitable for classification. In this paper, a new preprocessing method is proposed which involves two connected steps: relevance analysis and variance adjustment as shown in Figure 4.

In the step of the relevant analysis, weights of features are calculated through one feature weighting approach (like Relief or its extension for multiclass dataset called ReliefF). Assume that be the weight vector, estimated by using ReliefF, for the variables in the original space. Since the weights indicate the level of relevancy, the feature with the largest weight has the largest relevancy. The relevancy level is close to zero or negative when the feature is irrelevant [5]. In this work, features with the weights larger than the threshold defined by user are the subject to the next step. Therefore, vector is changed as follows:

After removing the irrelevant features, we do not need to collect all the features. In the variance adjustment step, the variances of features have been changed so that the most important feature becomes the most important feature for PCA. A key idea for this step is motivated from this characteristic of PCA: a feature with maximum variance has the most important for PCA. The new variance of th feature is calculated as follows: where is the number of features that their weight are more than threshold (number of relevant features). is the weight of most important feature and is the weight rank of -th feature (1 is least importance and is most importance). Since , and , is always positive. It is important to mention that because is the largest weight. Then, to modify the variance of -th feature to , the values of it should be multiplied by the number specified for it. So, it is calculated as follows: Equation (5) shows the way that can obtain for each feature where is the new variance of th feature and calculated using (4). is the number of samples and are th feature of th sample and mean of th feature, respectively. After this adjustment, PCA is employed on data. We call our proposed method RPCA that refers to applying ReliefF in the first step for weighting features.

Notice that each feature weighting method can be utilized in the first step. Since the output of the first step is used as a subject for the second step (variance adjustment), more effective feature weighting methods lead to better results. Hence, if we use a feature weighting more effective than ReliefF, the obtained result is better than we use ReliefF. Further, the type of feature weighting is very important. For example, if we replace ReliefF with another unsupervised feature weighting method like SUD [25], the proposed method can be utilized for the unsupervised dataset as a dimensionality reduction. The advantages of our preprocessing method are summarized as follows.(i)The extracted features are formed only by using relevant features.(ii)The preprocessing steps have low time complexity.(iii)The preprocessing steps reveal the underlying class structure for PCA approximately.

4. Simulation Results

This section presents the experimental results to show the effectiveness of RPCA on four UCI datasets and synthetic data introduced in Section 2.2. Table 2 summarizes the data information of the four UCI datasets. We applied ReliefF, which employs instead of just one nearest hit and miss, in our experiment. The value of was set to 10 as suggested in [22].

In order to provide a platform where PCA and RPCA can be compared, KNN classification errors are used. The number of nearest neighbors is achieved by trial and error. To eliminate statistical variation, each algorithm is run 20 times for each dataset. In each run, a dataset is randomly partitioned into training and testing. Also, 50 irrelevant features with Gaussian distributions are added to UCI datasets. The mean of Gaussian distribution is equal to zero and the standard deviation is set based on dataset.

Table 3 shows the testing errors. The number of extracted features is five expected in syntactic dataset which is two in this dataset. The number of training and testing instances for synthetic dataset are 100 and 200, respectively. The performance of KNN is degraded significantly in the presence of the large number of irrelevant features [6]. Figure 5 illustrates the average testing errors of PCA and RPCA as a function of the number of extracted features for 20 runs. This figure reveals that RPCA significantly outperforms PCA in terms of classification errors and effectiveness in reducing dimensionality. These results show that RPCA can significantly improve the performance of KNN. As discussed in Section 3, using a feature weighting better than ReliefF in the first step can lead to better results.

5. Conclusion

We propose a new preprocessing method comprised two steps to improve the performance of PCA in classification task. After weighting features and selecting relevant features in the first step, the variances of features are adjusted based on their importance in the second step until the most important feature has the most variance. Finally, PCA is applied to the modified data. Since, in the first step, ReliefF is used for feature weighting, we nominate our proposed preprocessing technique RPCA. Moreover, we can utilize another type of feature weighting method instead of ReliefF. For example, SUD [25] can be employed in unsupervised data. The simulation results show that the RPCA significantly improves the efficiency of PCA in classification purposes.

Acknowledgment

This research is supported by Iran Telecommunication Research Center (ITRC).