Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2015 (2015), Article ID 471371, 18 pages
http://dx.doi.org/10.1155/2015/471371
Research Article

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

1Shenzhen Key Laboratory of High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
2University of Chinese Academy of Sciences, Beijing 100049, China
3School of Computer Science and Engineering, Water Resources University, Hanoi 10000, Vietnam
4College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
5Faculty of Information Technology, Vietnam National University of Agriculture, Hanoi 10000, Vietnam

Received 20 June 2014; Accepted 20 August 2014

Academic Editor: Shifei Ding

Copyright © 2015 Thanh-Tung Nguyen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using -value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.