Artificial Evolution Methods in the Biological and Biomedical SciencesView this Special Issue
Research Article | Open Access
Classification of Oncologic Data with Genetic Programming
Discovering the models explaining the hidden relationship between genetic material and tumor pathologies is one of the most important open challenges in biology and medicine. Given the large amount of data made available by the DNA Microarray technique, Machine Learning is becoming a popular tool for this kind of investigations. In the last few years, we have been particularly involved in the study of Genetic Programming for mining large sets of biomedical data. In this paper, we present a comparison between four variants of Genetic Programming for the classification of two different oncologic datasets: the first one contains data from healthy colon tissues and colon tissues affected by cancer; the second one contains data from patients affected by two kinds of leukemia (acute myeloid leukemia and acute lymphoblastic leukemia). We report experimental results obtained using two different fitness criteria: the receiver operating characteristic and the percentage of correctly classified instances. These results, and their comparison with the ones obtained by three nonevolutionary Machine Learning methods (Support Vector Machines, MultiBoosting, and Random Forests) on the same data, seem to hint that Genetic Programming is a promising technique for this kind of classification.
- P. Russel, Fundamentals of Genetics, Addison-Wesley, Reading, Mass, USA, 2000.
- J. Koza, Genetic Programming, MIT Press, Cambridge, Mass, USA, 1992.
- Y. Lu and J. Han, “Cancer classification using gene expression data,” Information Systems, vol. 28, no. 4, pp. 243–268, 2003.
- D. Michie, D.-J. Spiegelhalter, and C.-C. Taylor, Machine Learning, Neural and Statistical Classification, Prentice-Hall, Upper Saddle River, NJ, USA, 1994.
- U. Alon, N. Barkai, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.
- A. L. Hsu, S.-L. Tang, and S. K. Halgamuge, “An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data,” Bioinformatics, vol. 19, no. 16, pp. 2131–2140, 2003.
- I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, no. 1–3, pp. 389–422, 2002.
- J. C. Hernandez, B. Duval, and J.-K. Hao, “A genetic embedded approach for gene selection and classification of microarray data,” in Proceedings of the 5th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBIO '07), vol. 4447 of Lecture Notes in Computer Science, pp. 90–101, Springer, Valencia, Spain, April 2007.
- N. Friedman, M. Linial, I. Nachman, and D. Pe'er, “Using Bayesian networks to analyze expression data,” Journal of Computational Biology, vol. 7, no. 3-4, pp. 601–620, 2000.
- J. H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, Mich, USA, 1975.
- D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, Mass, USA, 1989.
- J. J. Liu, G. Cutler, W. Li et al., “Multiclass cancer classification and biomarker discovery using GA-based algorithms,” Bioinformatics, vol. 21, no. 11, pp. 2691–2697, 2005.
- J.-H. Moore, J.-S. Parker, and L.-W. Hahn, “Symbolic discriminant analysis for mining gene expression patterns,” L. De Raedt and P. Flach, Eds., vol. 2167 of Lecture Notes in Artificial Intelligence, pp. 372–381, Springer, Berlin, Germany, 2001.
- M. Rosskopf, H. A. Schmidt, U. Feldkamp, and W. Banzhaf, “Genetic programming based DNA microarray analysis for classification of tumour tissues,” Tech. Rep. 2007-03, Memorial University of Newfoundland, 2007.
- J. Yu, J. Yu, A. A. Almal et al., “Feature selection and molecular classification of cancer using genetic programming,” Neoplasia, vol. 9, no. 4, pp. 292–303, 2007.
- C. C. Bojarczuk, H. S. Lopes, and A. A. Freitas, “Data mining with constrained-syntax genetic programming: applications to medical data sets,” in Proceedings of the Intelligent Data Analysis in Medicine and Pharmacology, 2001.
- J.-H. Hong and S.-B. Cho, “The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming,” Artificial Intelligence in Medicine, vol. 36, no. 1, pp. 43–58, 2006.
- T. R. Golub, D. K. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, pp. 531–537, 1999.
- M. Keijzer, “Scaled symbolic regression,” Genetic Programming and Evolvable Machines, vol. 5, no. 3, pp. 259–269, 2004.
- C. E. Metz, “Basic principles of ROC analysis,” Seminars in Nuclear Medicine, vol. 8, no. 4, pp. 283–298, 1978.
- M. H. Zweig and G. Campbell, “Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine,” Clinical Chemistry, vol. 39, no. 4, pp. 561–577, 1993.
- V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.
- J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods: Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola, Eds., MIT Press, Cambridge, Mass, USA, 1998.
- Weka, a multi-task machine learning software developed by Waikato University, http://www.cs.waikato.ac.nz/ml/weka/.
- Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.
- G. I. Webb, “MultiBoosting: a technique for combining boosting and wagging,” Machine Learning, vol. 40, no. 2, pp. 159–196, 2000.
- L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth International Group, Belmont, Calif, USA, 1984.
- L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
- L. Vanneschi, D. Rochat, and M. Tomassini, “Multi-optimization for generalization in symbolic regression using genetic programming,” in Proceedings of the 2nd Annual Italian Workshop on Artificial Life and Evolutionary Computation (WIVACE '07), G. Nicosia et al., Ed., 2007.
Copyright © 2009 Leonardo Vanneschi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.