Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2015, Article ID 471371, 18 pages
http://dx.doi.org/10.1155/2015/471371
Research Article

Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data

1Shenzhen Key Laboratory of High Performance Data Mining, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
2University of Chinese Academy of Sciences, Beijing 100049, China
3School of Computer Science and Engineering, Water Resources University, Hanoi 10000, Vietnam
4College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
5Faculty of Information Technology, Vietnam National University of Agriculture, Hanoi 10000, Vietnam

Received 20 June 2014; Accepted 20 August 2014

Academic Editor: Shifei Ding

Copyright © 2015 Thanh-Tung Nguyen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. L. Breiman, “Random forests,” Machine Learning, vol. 450, no. 1, pp. 5–32, 2001. View at Publisher · View at Google Scholar · View at Scopus
  2. L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen, Classification and Regression Trees, CRC Press, Boca Raton, Fla, USA, 1984.
  3. H. Kim and W.-Y. Loh, “Classification trees with unbiased multiway splits,” Journal of the American Statistical Association, vol. 96, no. 454, pp. 589–604, 2001. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  4. A. P. White and W. Z. Liu, “Technical note: bias in information-based measures in decision tree induction,” Machine Learning, vol. 15, no. 3, pp. 321–329, 1994. View at Publisher · View at Google Scholar · View at Scopus
  5. T. G. Dietterich, “Experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization,” Machine Learning, vol. 40, no. 2, pp. 139–157, 2000. View at Publisher · View at Google Scholar · View at Scopus
  6. Y. Freund and R. E. Schapire, “A desicion-theoretic generalization of on-line learning and an application to boosting,” in Computational Learning Theory, pp. 23–37, Springer, 1995. View at Google Scholar
  7. T.-T. Nguyen and T. T. Nguyen, “A real time license plate detection system based on boosting learning algorithm,” in Proceedings of the 5th International Congress on Image and Signal Processing (CISP '12), pp. 819–823, IEEE, October 2012. View at Publisher · View at Google Scholar · View at Scopus
  8. T. K. Ho, “Random decision forests,” in Proceedings of the 3rd International Conference on Document Analysis and Recognition, vol. 1, pp. 278–282, 1995.
  9. T. K. Ho, “The random subspace method for constructing decision forests,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832–844, 1998. View at Publisher · View at Google Scholar · View at Scopus
  10. L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996. View at Google Scholar · View at Zentralblatt MATH · View at Scopus
  11. R. Díaz-Uriarte and S. Alvarez de Andrés, “Gene selection and classification of microarray data using random forest,” BMC Bioinformatics, vol. 7, article 3, 2006. View at Publisher · View at Google Scholar · View at Scopus
  12. R. Genuer, J.-M. Poggi, and C. Tuleau-Malot, “Variable selection using random forests,” Pattern Recognition Letters, vol. 31, no. 14, pp. 2225–2236, 2010. View at Publisher · View at Google Scholar · View at Scopus
  13. B. Xu, J. Z. Huang, G. Williams, Q. Wang, and Y. Ye, “Classifying very high-dimensional data with random forests built from small subspaces,” International Journal of Data Warehousing and Mining, vol. 8, no. 2, pp. 44–63, 2012. View at Publisher · View at Google Scholar · View at Scopus
  14. Y. Ye, Q. Wu, J. Zhexue Huang, M. K. Ng, and X. Li, “Stratified sampling for feature subspace selection in random forests for high dimensional data,” Pattern Recognition, vol. 46, no. 3, pp. 769–787, 2013. View at Publisher · View at Google Scholar · View at Scopus
  15. X. Chen, Y. Ye, X. Xu, and J. Z. Huang, “A feature group weighting method for subspace clustering of high-dimensional data,” Pattern Recognition, vol. 45, no. 1, pp. 434–446, 2012. View at Publisher · View at Google Scholar · View at Scopus
  16. D. Amaratunga, J. Cabrera, and Y.-S. Lee, “Enriched random forests,” Bioinformatics, vol. 240, no. 18, pp. 2010–2014, 2008. View at Publisher · View at Google Scholar
  17. H. Deng and G. Runger, “Gene selection with guided regularized random forest,” Pattern Recognition, vol. 46, no. 12, pp. 3483–3489, 2013. View at Publisher · View at Google Scholar · View at Scopus
  18. C. Strobl, “Statistical sources of variable selection bias in classification trees based on the gini index,” Tech. Rep. SFB 386, 2005, http://epub.ub.uni-muenchen.de/archive/00001789/01/paper_420.pdf. View at Google Scholar
  19. C. Strobl, A.-L. Boulesteix, and T. Augustin, “Unbiased split selection for classification trees based on the gini index,” Computational Statistics & Data Analysis, vol. 520, no. 1, pp. 483–501, 2007. View at Google Scholar
  20. C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, “Bias in random forest variable importance measures: illustrations, sources and a solution,” BMC Bioinformatics, vol. 8, article 25, 2007. View at Publisher · View at Google Scholar · View at Scopus
  21. C. Strobl, A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis, “Conditional variable importance for random forests,” BMC Bioinformatics, vol. 9, no. 1, article 307, 2008. View at Publisher · View at Google Scholar · View at Scopus
  22. T. Hothorn, K. Hornik, and A. Zeileis, Party: a laboratory for recursive partytioning, r package version 0.9-9999, 2011, http://cran.r-project.org/package=party.
  23. F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics, vol. 10, no. 6, pp. 80–83, 1945. View at Google Scholar
  24. T.-T. Nguyen, J. Z. Huang, and T. T. Nguyen, “Two-level quantile regression forests for bias correction in range prediction,” Machine Learning, 2014. View at Publisher · View at Google Scholar
  25. T.-T. Nguyen, J. Z. Huang, K. Imran, M. J. Li, and G. Williams, “Extensions to quantile regression forests for very high-dimensional data,” in Advances in Knowledge Discovery and Data Mining, vol. 8444 of Lecture Notes in Computer Science, pp. 247–258, Springer, Berlin, Germany, 2014. View at Publisher · View at Google Scholar
  26. A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: illumination cone models for face recognition under variable lighting and pose,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 643–660, 2001. View at Publisher · View at Google Scholar · View at Scopus
  27. F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” in Proceedings of the 2nd IEEE Workshop on Applications of Computer Vision, pp. 138–142, IEEE, December 1994. View at Scopus
  28. M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neuroscience, vol. 3, no. 1, pp. 71–86, 1991. View at Publisher · View at Google Scholar · View at Scopus
  29. H. Deng, “Guided random forest in the RRF package,” http://arxiv.org/abs/1306.0237.
  30. A. Liaw and M. Wiener, “Classification and regression by randomforest,” R News, vol. 20, no. 3, pp. 18–22, 2002. View at Google Scholar
  31. R. Diaz-Uriarte, “varselrf: variable selection using random forests,” R package version 0.7-1, 2009, http://ligarto.org/rdiaz/Software/Software.html.
  32. J. H. Friedman, T. J. Hastie, and R. J. Tibshirani, “glmnet: Lasso and elastic-net regularized generalized linear models,” R package version , pages 1-1, 2010, http://CRAN.R-project.org/package=glmnet.