Artificial Evolution Methods in the Biological and Biomedical SciencesView this Special Issue
Classification of Oncologic Data with Genetic Programming
Discovering the models explaining the hidden relationship between genetic material and tumor pathologies is one of the most important open challenges in biology and medicine. Given the large amount of data made available by the DNA Microarray technique, Machine Learning is becoming a popular tool for this kind of investigations. In the last few years, we have been particularly involved in the study of Genetic Programming for mining large sets of biomedical data. In this paper, we present a comparison between four variants of Genetic Programming for the classification of two different oncologic datasets: the first one contains data from healthy colon tissues and colon tissues affected by cancer; the second one contains data from patients affected by two kinds of leukemia (acute myeloid leukemia and acute lymphoblastic leukemia). We report experimental results obtained using two different fitness criteria: the receiver operating characteristic and the percentage of correctly classified instances. These results, and their comparison with the ones obtained by three nonevolutionary Machine Learning methods (Support Vector Machines, MultiBoosting, and Random Forests) on the same data, seem to hint that Genetic Programming is a promising technique for this kind of classification.
P. Russel, Fundamentals of Genetics, Addison-Wesley, Reading, Mass, USA, 2000.
J. Koza, Genetic Programming, MIT Press, Cambridge, Mass, USA, 1992.
Y. Lu and J. Han, “Cancer classification using gene expression data,” Information Systems, vol. 28, no. 4, pp. 243–268, 2003.View at: Publisher Site | Google Scholar
D. Michie, D.-J. Spiegelhalter, and C.-C. Taylor, Machine Learning, Neural and Statistical Classification, Prentice-Hall, Upper Saddle River, NJ, USA, 1994.
U. Alon, N. Barkai, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.View at: Publisher Site | Google Scholar
A. L. Hsu, S.-L. Tang, and S. K. Halgamuge, “An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data,” Bioinformatics, vol. 19, no. 16, pp. 2131–2140, 2003.View at: Publisher Site | Google Scholar
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, no. 1–3, pp. 389–422, 2002.View at: Publisher Site | Google Scholar
J. C. Hernandez, B. Duval, and J.-K. Hao, “A genetic embedded approach for gene selection and classification of microarray data,” in Proceedings of the 5th European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBIO '07), vol. 4447 of Lecture Notes in Computer Science, pp. 90–101, Springer, Valencia, Spain, April 2007.View at: Google Scholar
N. Friedman, M. Linial, I. Nachman, and D. Pe'er, “Using Bayesian networks to analyze expression data,” Journal of Computational Biology, vol. 7, no. 3-4, pp. 601–620, 2000.View at: Publisher Site | Google Scholar
J. H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, Mich, USA, 1975.
D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, Mass, USA, 1989.
J. J. Liu, G. Cutler, W. Li et al., “Multiclass cancer classification and biomarker discovery using GA-based algorithms,” Bioinformatics, vol. 21, no. 11, pp. 2691–2697, 2005.View at: Publisher Site | Google Scholar
J.-H. Moore, J.-S. Parker, and L.-W. Hahn, “Symbolic discriminant analysis for mining gene expression patterns,” L. De Raedt and P. Flach, Eds., vol. 2167 of Lecture Notes in Artificial Intelligence, pp. 372–381, Springer, Berlin, Germany, 2001.View at: Google Scholar
M. Rosskopf, H. A. Schmidt, U. Feldkamp, and W. Banzhaf, “Genetic programming based DNA microarray analysis for classification of tumour tissues,” Tech. Rep. 2007-03, Memorial University of Newfoundland, 2007.View at: Google Scholar
J. Yu, J. Yu, A. A. Almal et al., “Feature selection and molecular classification of cancer using genetic programming,” Neoplasia, vol. 9, no. 4, pp. 292–303, 2007.View at: Publisher Site | Google Scholar
C. C. Bojarczuk, H. S. Lopes, and A. A. Freitas, “Data mining with constrained-syntax genetic programming: applications to medical data sets,” in Proceedings of the Intelligent Data Analysis in Medicine and Pharmacology, 2001.View at: Google Scholar
J.-H. Hong and S.-B. Cho, “The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming,” Artificial Intelligence in Medicine, vol. 36, no. 1, pp. 43–58, 2006.View at: Publisher Site | Google Scholar
T. R. Golub, D. K. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, pp. 531–537, 1999.View at: Google Scholar
M. Keijzer, “Scaled symbolic regression,” Genetic Programming and Evolvable Machines, vol. 5, no. 3, pp. 259–269, 2004.View at: Publisher Site | Google Scholar
C. E. Metz, “Basic principles of ROC analysis,” Seminars in Nuclear Medicine, vol. 8, no. 4, pp. 283–298, 1978.View at: Google Scholar
M. H. Zweig and G. Campbell, “Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine,” Clinical Chemistry, vol. 39, no. 4, pp. 561–577, 1993.View at: Google Scholar
V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.
J. Platt, “Fast training of support vector machines using sequential minimal optimization,” in Advances in Kernel Methods: Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola, Eds., MIT Press, Cambridge, Mass, USA, 1998.View at: Google Scholar
Weka, a multi-task machine learning software developed by Waikato University, http://www.cs.waikato.ac.nz/ml/weka/.
Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119–139, 1997.View at: Google Scholar
G. I. Webb, “MultiBoosting: a technique for combining boosting and wagging,” Machine Learning, vol. 40, no. 2, pp. 159–196, 2000.View at: Publisher Site | Google Scholar
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Wadsworth International Group, Belmont, Calif, USA, 1984.
L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.View at: Publisher Site | Google Scholar | Zentralblatt MATH
L. Vanneschi, D. Rochat, and M. Tomassini, “Multi-optimization for generalization in symbolic regression using genetic programming,” in Proceedings of the 2nd Annual Italian Workshop on Artificial Life and Evolutionary Computation (WIVACE '07), G. Nicosia et al., Ed., 2007.View at: Google Scholar