Abstract

Ensemble data mining methods, also known as classifier combination, are often used to improve the performance of classification. Various classifier combination methods such as bagging, boosting, and random forest have been devised and have received considerable attention in the past. However, data dimensionality increases rapidly day by day. Such a trend poses various challenges as these methods are not suitable to directly apply to high-dimensional datasets. In this paper, we propose an ensemble method for classification of high-dimensional data, with each classifier constructed from a different set of features determined by partitioning of redundant features. In our method, the redundancy of features is considered to divide the original feature space. Then, each generated feature subset is trained by a support vector machine, and the results of each classifier are combined by majority voting. The efficiency and effectiveness of our method are demonstrated through comparisons with other ensemble techniques, and the results show that our method outperforms other methods.

1. Introduction

The ultimate goal of supervised learning for classification is to mine previously unknown knowledge from existing data to predict a future event with best possible classification performance [1]. Classification algorithms typically deal with a set of records, each of which consists of a fixed number of features along with a class label that denotes its target. The algorithm then outputs a decision boundary that represents underlying patterns in the data. Many useful classification algorithms such as decision tree [2], neural network [3, 4], and support vector machine (SVM) [5] have been presented in the past. However, the increase in the data dimensionality may cause several issues with respect to scalability and learning performance in these classification algorithms. Moreover, the classification ability of a single classifier is limited.

In general, ensembles of classifiers provide better classification accuracy than a single predictor can do. To improve the classification accuracy, ensemble methods, also known as classifier combination, first generate a set of base classifiers from training data and then perform actual classification by combining the results of base classifiers. For achieving better accuracy of the combined set of multiple classifiers, each base classifier should be diverse and independent. When it comes to building each base classifier, ensemble classifier generation methods can be broadly categorized into four groups [6]: (i) by selecting different subsets of instances of training set to build each base classifier, (ii) by choosing different subsets of features of the input features to construct each base classifier, (iii) by being based on different categories of the class labels to build each base classifier, and (iv) by manipulating the learning algorithm. Among many methods, bagging [7] and boosting [8] are two widely used ensemble methods. They resample the original data to create multiple training sets based on some sampling distribution and build the base classifier from each bootstrap sample. However, these methods are not guaranteed to generate fully independent individual base classifiers [9]. According to [10, 11], their theoretical and empirical results indicate that the most effective method of achieving independence is by training base classifiers on different feature subsets [12]. The basic idea of feature subset-based ensemble is simply to give each classifier a different projection of the training set [13]. In particular for high-dimensional data, adopting independent feature subsets for ensemble generation has shown to be more efficient [14] compared with manipulating the training samples. This may be due to the following: (i) a feature subset-based ensemble can perform faster due to the reduced size of input space; (ii) it can reduce the correlation among the classifiers. Among the feature subset-based ensemble methods, random forest [15] is a widely used approach that employs decision tree as a base classifier. It achieves diversity by randomly partitioning the original feature space instead of using whole features. However, random partition of the input space may increase the risk that irrelevant and redundant features can be included in the selected subset. Furthermore, decision tree methods have the so-called fragmentation problem as less and less training data are used to search for the root nodes of subtrees. If the training data do not have enough instances compared with dimensions, the performance of decision tree becomes typically very poor.

In this paper, we propose an ensemble framework for classifying high-dimensional data with each classifier constructed from a different set of features determined by redundant features partitioning. First, we suggest a multiple subset generation method based on feature relevance and redundancy to construct each classifier. Then, a number of classifiers are built from the generated subsets. Finally, the classification results of the classifiers are combined by majority voting. It is observed that the proposed ensemble method outperforms other ensemble methods by up to 6% in terms of classification accuracy.

2. Previous Work

2.1. Bagging

Bagging [7] is a method for generating multiple versions of classifiers and using these to get an aggregated classifier. Each base classifier is generated by different bootstrap samples. Algorithm 1 shows the bagging algorithm [14]. The algorithm takes training data , inducer , and the number of bootstrap samples as input and then produces an ensemble classifier which is the combination of the classifiers trained from the multiple bootstrap samples. is obtained by repeatedly sampling instances from a dataset according to probability distribution (line 2). Since the sampling is done with replacement, some instances may appear several times in the same training set, while others may not. Consequently, bootstrap samples, , are generated, from which a classifier is trained by using each bootstrap sample (line 3). Finally, a combined classifier is built from , and predicts the class label of a given instance by counting votes (line 5).

Input: training data , Inducer , number of bootstrap samples
Output: Aggregated classifier
Begin:
(1)  for to
(2)    = bootstrap sample from (sample with replacement)
(3)   
(4)  }
(5) 
End

2.2. Boosting

Boosting [6] is also a widely used ensemble method developed to improve the performance of learning algorithms that generate multiple classifiers and vote on them. Unlike bagging, boosting assigns a weight to each training instance and may adaptively change the weight at the end of each boosting round. AdaBoost is an improved boosting algorithm whose pseudo code is shown in Algorithm 2 [14]. The algorithm takes as input training data containing instances, inducer , and iteration parameter and then outputs a combined classifier. Initially, all of the instances are equally assigned the same weight (line 1). Then, the algorithm gradually constructs classifiers by modifying the weights of training instances based on the previous classifier’s performance (lines 2–9). This is accomplished by computing the new classifier while putting more emphasis on those objects previously found to be difficult to accurately classify. After generating each classifier, the proportion of incorrect classification rate is calculated (line 4). If the weighted error is larger than 0.5, the current will be set to a bootstrap sample with weight 1 for every instance. Otherwise, the weight of correctly classified instances will be updated by a factor inversely proportional to the error (lines 6–8). In other words, if the current classifier finds a certain object difficult to classify, then that object will be assigned a greater weight for the next iteration. Conversely, if an object is found to be easy to classify, then it will have smaller weight in the next iteration. Finally, the classifiers are combined using a weighted voting scheme (line 10).

Input: training data size of , Inducer , number of iterations
Output: Aggregated classifier
Begin:
(1)   with instance weights assigned to be 1
(2)  for to
(3)  
(4)  
(5)  If , set to a bootstrap sample from with weight 1 for every instance and go to Step  3
(6)  
(7)  For each , if then
(8)  Normalize the weights of instances so the total weight of is
(9) }
(10) 
End

2.3. Random Forest

Random forest is an ensemble classification method consisting of multiple unpruned decision trees. Unlike bagging, random forest forms bootstrap samples by randomly partitioning the original feature space instead of using the whole input features. As shown in Algorithm 3, to construct individual decision trees, bootstrap samples are selected from the training instances with replacement (line 2). Then, classification and regression tree (CART) algorithm is applied to grow the decision tree. At the node selection stage, it decides the best splitting node from a randomly selected subspace of features (lines 3-4).

Input: training data , number of selected variables , number of trees
Output: Aggregated classifier
Begin:
(1) for to
(2)   = bootstrap sample from (sample with replacement)
(3)   size of ( will be randomly selected from original input space)
(4)   (: Classification and regression tree)
(5) }
(6) 
End

2.4. Feature Subset-Based Ensembles

Bagging and boosting are the ensemble methods that manipulate the original instances. However, this kind of ensemble methods is difficult to accurately classify high-dimensional data like image or gene expression data. The reason is that image or gene expression data generally has very small number of samples compared with dimensions. Therefore, sampling the training instances will lead to lack of representative instances so that bagging and boosting will be susceptible to overfitting. In this case, feature subset-based ensemble method is more efficient [14] compared with manipulating the training samples. This may be due to the following: (i) a feature subset-based ensemble can perform faster due to the reduced size of input space; (ii) it can reduce the correlation among the classifiers. Besides random forest, various feature partitioning-based ensemble methods have been proposed. Ahn et al. [16] proposed an ensemble method that uses mutually exclusive subspaces to achieve diversity. The authors applied their method to bioinformatics and chemical domains and showed that their method can achieve better performance than that of random forest. Ming Ting et al. [17] also introduced a feature subset-based ensemble method that employs support vector machine as a base classifier. The feature space was divided into nonoverlapping local regions according to user-defined number of features. de Bock and Poel [18] proposed a rotation-based ensemble classifier that applied feature extraction methods such as principle component analysis and independent component analysis to generate subspace of features. However, these methods did not consider the correlation among features [19].

3. Proposed Ensemble Method

We propose an ensemble method with multiple independent feature subsets to better classify high-dimensional data. The framework of the proposed ensemble is shown in Figure 1. The proposed method mainly consists of two phases: (i) generating multiple feature subsets based on the correlation among features and (ii) constructing the model from each feature subset using a machine learning algorithm as the base classifier and combining the results of all classifiers by majority voting. Next, we will illustrate each step in detail.

3.1. Feature Subset Generation

Generating feature subsets for ensemble can be viewed as multiple iterations of feature selection procedures. In the past, various feature selection techniques have been proposed such as chi-square test, mutual information, Pearson correlation coefficients, and Relief [20]. Although these methods are fast, they lack robustness when interactions among features exist. To select a relevant and nonredundant feature subset, a Fast Correlation-Based Filter (FCBF) [21] approach was proposed to remove the redundant as well as irrelevant features, and the Symmetrical Uncertainty (SU) was used to measure the correlation whereHere IG is the information gain of after observing variable . and are the entropies of variables and , respectively. FCBF removes irrelevant features by ranking correlation between features and classification classes. To remove redundant features, the authors introduced a concept of predominant feature. A feature is said to be predominant if it does not have any approximate Markov Blanket in the current set. For two relevant features and , forms an approximate Markov Blanket for ifwhere is the correlation between feature and class; is the correlation between feature and class; is the correlation between feature and feature. Thus, FCBF is a process in which all predominant features are identified. They are searched as follows. First, the feature with the largest value is selected as a starting point. Next, all redundant features regarding this feature are removed. Then redundant features regarding the next feature with the largest SU in the remaining set are removed. The algorithm repeats this procedure until there are no redundant features existing.

Although FCBF has good performance on high-dimensional data, it is not suitable for ensemble learning because it was originally designed to select a single feature subset. Thus, we extend FCBF to generate multiple feature subsets. First, based on the correlation between features and classes (i.e., SU), all the features are sorted in a descending order. Then a relevant subset of features can be derived by a predefined threshold . If the SU value of a feature is larger than the threshold, the feature is considered to be relevant. Generally, we recommend setting the threshold to be 0 in order to consider all of the features in the redundancy analysis step except “waste-features” which have a 0 SU value with respect to the class. After that, redundancy analysis is conducted on the relevant subset. The main difference between our method and FCBF is that our method considers the removed features in FCBF. It is because we hypothesize that it may be interesting to pay attention to the removed features as FCBF removes less relevant ones between two redundant features, and in some cases, low ranked features can also play an important role when considering the combination of features. Thus, the features not selected in the previous iteration will be the input in the next iteration. For example, in the first iteration, the redundant features are removed from original space as is done in FCBF. In the second iteration, the same analysis is done as iteration 1 but for the removed subset in the first iteration not for the whole feature space. Then, the third subset is selected from the removed space in the second iteration. It is repeated until a user-defined number of subsets are selected.

3.2. Model Learning

Over the past few years, SVM has been widely used for classification because of its good performance on high-dimensional data [22]. SVM was developed by Vapnik to solve the problems occurring in applications such as handwritten digit recognition [23], object recognition [24], text classification [25], cancer diagnosis [23], and bioinformatics [26]. Hence, we use SVM as the base classifier in our ensemble method. The goal of SVM is to find a hyperplane with a maximal margin (distance between two groups of data points) as defined and illustrated in Figure 2. Given some data points that are assumed to be divided into two groups, circles and squares, the hyperplane can be written aswhere and are parameters of the model: denotes an orthogonal vector and refers to a bias. SVM separates data points into two groups in such a way that they divide the data and there exist no data points in between them, and their distance, defined as margin, is maximized. Figure 2 shows two more hyperplanes placed at the boundary of two groups. These hyperplanes and the margin can be written as follows:Hence, the learning task in SVM can be formalized as the following constrained optimization problem:This is also known as a convex optimization problem, which can be solved by using the standard Lagrange multiplier method:where parameters are called the Lagrange multipliers. With the Lagrange multipliers, the decision function can be written as follows:Additionally, the results of each classifier are combined by majority voting, and classification of unknown data is performed based on the class label to obtain the most frequent votes. The mathematical function of our ensemble method with classifiers can be written as

4. Experimental Results

4.1. Dataset

To evaluate the effectiveness of our method, we used six publicly available datasets from two different domains, four from face recognition and two from DNA microarray data classification. The purpose of using first four datasets, namely, AR10P [27], ORL10P [28], PIE10P [29], and PIX10P [30], is to show how well our method can classify the image data. Each dataset has a large number of features compared with the number of instances. The last two datasets are Leukemia and Prostate datasets from DNA microarray experiments and are used to show how well our method can distinguish different types of cancers. The Leukemia dataset [31] contains a total of 72 samples in three classes, acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), and mixed-lineage leukemia gene (MLL), which have 24, 28, and 20 samples, respectively. The number of features is 12,582. The Prostate dataset, first published in [32], embeds a two-class classification problem and contains 102 samples and 12,600 genes. One of the tasks addressed by the authors was to build a model that can distinguish between normal and tumorous prostate tissues. The summary of the datasets is shown in Table 1.

4.2. Performance Evaluation

We compared our methods with widely used ensemble methods: bagging, AdaBoost, and random forest. For bagging, AdaBoost, and our method, SVM was used as the base classifier for fair comparison. The number of classifiers for each ensemble method was set to 20. To obtain a statistically reliable predictive measurement, we performed 10 runs of 10-fold cross validation on all the datasets. In 10-fold cross validation, each dataset was randomly partitioned into ten parts. Nine parts were used as the training set, and the remaining one was used as the testing dataset. Selecting the kernel and appropriate parameters plays an important role in SVM classification performance. The RBF kernel is a commonly used kernel for three reasons [33]. First, the RBF kernel can handle nonlinear relationship between class labels and attributes. Second, it has fewer hyperparameters that influence the complexity of the model selection than that of the polynomial kernel. Third, the RBF kernel has fewer numerical difficulties. In our experiments, we chose the RBF kernel function, and the parameters and of RBF kernel must be optimized for each dataset. To determine the best values of and , we conducted a grid-search approach using 10-fold cross validation. A number of pairs of values were attempted, and the pair with the best accuracy was picked in the range of and .

Tables 2 through 7 show the performance of our proposed method in terms of TP rate, FP rate, precision, recall, -measure, and ROC area, respectively. TP rate and FP rate refer to the proportion of actual positive instances correctly predicted as positive and the proportion of actual negative instances wrongly predicted as positive [3436], respectively. Precision is computed as the number of true positive instances divided by the total number of instances labelled as belonging to the positive class. Recall is defined as the number of true positive instances divided by the total number of instances actually belonging to the positive class. -measure is an evaluation metric that combines precision and recall as follows: -measure = 2 (precision recall)/(precision + recall). ROC area is defined as the area under the Receiver Operating Characteristic (ROC) curve. Each row of the tables indicates the performance on each class and the last row shows the averaged performance. From the tables, the average TP rate is found to be 0.969, 1, 0.99, 0.99, 0.986, and 0.961, and the average FP rate is found to be 0.003, 0, 0.001, 0.001, 0.009, and 0.038 on six datasets. Hence, we can easily observe that our method makes good prediction. Moreover, the ROC area on ORL10, PIE10P, and PIX10P is almost 100%. From the tables, it is clear that our method shows good performance on many different evaluation measures.

Figures 3 and 4 exhibit the box plot of classification accuracies of our method, bagging, AdaBoost, and random forest. On AR10P dataset, the proposed method shows best average prediction accuracy which is 96.152%. On Leukemia dataset, it is clear that the proposed method is found to result in best average prediction accuracy, which is 98.75% (standard deviation = 0.44), while the other methods are found to be 94.30% (standard deviation = 1.38), 96.80% (standard deviation = 0.67), and 82.08% (standard deviation = 2.81) for bagging, AdaBoost, and random forest, respectively. Similar results can also be found in other figures. One interesting observation is that random forest has relatively poor performance. It may be because random forest uses decision tree as base classifier, while other methods use SVM. It is well known that SVM has better performance on high-dimensional data than decision trees. The classification accuracies of each run can be seen in Appendix if the reader is interested.

To test the statistical significance of differences among classifiers, a paired-samples -test is performed regarding bagging and the proposed method. We selected bagging because it showed the best average classification accuracy among the existing methods in most cases. On ORL10P dataset, we selected random forest instead of bagging, because the performance of our method and bagging is exactly the same. From Table 8, the hypothesis that the mean accuracy of proposed method is equal to the mean accuracy of bagging has been significantly rejected (, value = 0.007 on AR10P, , value = 0.000 on ORL10P, , value = 0.002 on PIE10P, , value = 0.000 on PIX10P, , value = 0.000 on Leukemia, and , value = 0.000 on Prostate dataset) with 5% significance level. It means that the differences among classifiers are statistically significant.

5. Conclusion

In this paper, we presented a feature partitioning-based ensemble method to better classify high-dimensional data. In our method, each base classifier was trained from different feature space by dividing redundant features into different subsets. SVM was used as the base classifier and the results of each SVM were merged by a majority voting method. For the experiments, we used six publicly available datasets in two different domains. Through the experiments, we demonstrated that dividing the redundant features into several parts for ensemble construction can achieve better performance for classification on high-dimensional data and that our proposed algorithm has higher prediction accuracies than other ensemble classification algorithms.

Appendix

See Tables 9, 10, 11, 12, 13, and 14.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (No. 2013R1A2A2A01068923) and (No. 2013R1A1A206518), Export Promotion Technology Development Program, Ministry of Agriculture, Food and Rural Affairs (No. 114083-3), and Special Research Program of Chonnam National University, 2009 (2009-0413).