Table of Contents Author Guidelines Submit a Manuscript
BioMed Research International
Volume 2015 (2015), Article ID 748681, 12 pages
http://dx.doi.org/10.1155/2015/748681
Research Article

An Effective Big Data Supervised Imbalanced Classification Approach for Ortholog Detection in Related Yeast Species

1Departamento de Ciencias de la Computación, Universidad Central “Marta Abreu” de Las Villas (UCLV), 54830 Santa Clara, Cuba
2Department of Computer Science and Artificial Intelligence, Research Center on Information and Communications Technology (CITIC-UGR), University of Granada, 18071 Granada, Spain
3Centro de Bioactivos Químicos, Universidad Central “Marta Abreu” de Las Villas (UCLV), 54830 Santa Clara, Cuba
4Centro Interdisciplinar de Investigação Marinha e Ambiental (CIMAR/CIIMAR), Universidade do Porto, Rua dos Bragas 177, 4050-123 Porto, Portugal
5Departamento de Biologia, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 4169-007 Porto, Portugal

Received 7 April 2015; Revised 26 July 2015; Accepted 20 August 2015

Academic Editor: Shigehiko Kanaya

Copyright © 2015 Deborah Galpert et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. W. M. Fitch, “Distinguishing homologous from analogous proteins,” Systematic Biology, vol. 19, no. 2, pp. 99–113, 1970. View at Publisher · View at Google Scholar · View at Scopus
  2. R. L. Tatusov, E. V. Koonin, and D. J. Lipman, “A genomic perspective on protein families,” Science, vol. 278, no. 5338, pp. 631–637, 1997. View at Publisher · View at Google Scholar · View at Scopus
  3. A. Alexeyenko, I. Tamas, G. Liu, and E. L. L. Sonnhammer, “Automatic clustering of orthologs and inparalogs shared by multiple proteomes,” Bioinformatics, vol. 22, no. 14, pp. e9–e15, 2006. View at Publisher · View at Google Scholar · View at Scopus
  4. L. Li, C. J. Stoeckert, and D. S. Roos, “OrthoMCL: identification of ortholog groups for eukaryotic genomes,” Genome Research, vol. 13, no. 9, pp. 2178–2189, 2003. View at Publisher · View at Google Scholar · View at Scopus
  5. C. Dessimoz, G. Cannarozzi, M. Gil et al., “OMA, a comprehensive, automated project for the identification of orthologs from complete genome data: introduction and first achievements,” in Comparative Genomics: RECOMB 2005 International Workshop, RCG 2005, Dublin, Ireland, September 18-20, 2005. Proceedings, A. McLysaght and D. H. Huson, Eds., vol. 3678 of Lecture Notes in Computer Science, pp. 61–72, Springer, Berlin, Germany, 2005. View at Publisher · View at Google Scholar
  6. B. Linard, J. D. Thompson, O. Poch, and O. Lecompte, “OrthoInspector: comprehensive orthology analysis and visual exploration,” BMC Bioinformatics, vol. 12, article 11, 2011. View at Publisher · View at Google Scholar · View at Scopus
  7. T. F. DeLuca, J. Cui, J.-Y. Jung, K. C. St. Gabriel, and D. P. Wall, “Roundup 2.0: enabling comparative genomics for over 1800 genomes,” Bioinformatics, vol. 28, no. 5, Article ID bts006, pp. 715–716, 2012. View at Publisher · View at Google Scholar · View at Scopus
  8. M. Lechner, M. Hernandez-Rosales, D. Doerr et al., “Orthology detection combining clustering and synteny for very large datasets,” PLoS ONE, vol. 9, no. 8, Article ID e105015, 2014. View at Publisher · View at Google Scholar
  9. J. C. Chiu, E. K. Lee, M. G. Egan, I. N. Sarkar, G. M. Coruzzi, and R. DeSalle, “OrthologID: automation of genome-scale ortholog identification within a parsimony framework,” Bioinformatics, vol. 22, no. 6, pp. 699–707, 2006. View at Publisher · View at Google Scholar · View at Scopus
  10. J. Muller, D. Szklarczyk, P. Julien et al., “eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations,” Nucleic Acids Research, vol. 38, no. 1, pp. D190–D195, 2009. View at Publisher · View at Google Scholar · View at Scopus
  11. K. M. Kim, S. Sung, G. Caetano-Anollés, J. Y. Han, and H. Kim, “An approach of orthology detection from homologous sequences under minimum evolution,” Nucleic Acids Researc, vol. 36, no. 17, article e110, 2008. View at Publisher · View at Google Scholar · View at Scopus
  12. L. P. Pryszcz, J. Huerta-Cepas, and T. Gabaldón, “MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score,” Nucleic Acids Research, vol. 39, no. 5, article e32, 2011. View at Publisher · View at Google Scholar · View at Scopus
  13. J. Huerta-Cepas, S. Capella-Gutierrez, L. P. Pryszcz et al., “PhylomeDB v3.0: an expanding repository of genome-wide collections of trees, alignments and phylogeny-based orthology and paralogy predictions,” Nucleic Acids Research, vol. 39, pp. D556–D560, 2011. View at Publisher · View at Google Scholar · View at Scopus
  14. G. Shi, L. Zhang, and T. Jiang, “MSOAR 2.0: incorporating tandem duplications into ortholog assignment based on genome rearrangement,” in Proceedings of the 8th LSS Computational Systems Bioinformatics Conference (CSB '09), pp. 12–24, 2009.
  15. F. Towfic, M. H. W. Greenlee, and V. Honavar, “Detection of gene orthology based on protein-protein interaction networks,” in Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM '09), pp. 48–53, IEEE, Washington, DC, USA, November 2009. View at Publisher · View at Google Scholar
  16. S. F. Altschul, T. L. Madden, A. A. Schäffer et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997. View at Publisher · View at Google Scholar · View at Scopus
  17. R. Overbeek, M. Fonstein, M. D'Souza, G. D. Push, and N. Maltsev, “The use of gene clusters to infer functional coupling,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 6, pp. 2896–2901, 1999. View at Publisher · View at Google Scholar · View at Scopus
  18. A. E. Hirsh and H. B. Fraser, “Protein dispensability and rate of evolution,” Nature, vol. 411, no. 6841, pp. 1040–1049, 2001. View at Google Scholar · View at Scopus
  19. D. P. Wall, H. B. Fraser, and A. E. Hirsh, “Detecting putative orthologs,” Bioinformatics, vol. 19, no. 13, pp. 1710–1711, 2003. View at Publisher · View at Google Scholar · View at Scopus
  20. M. K. Kamvysselis, Computational comparative genomics: genes, regulation, evolution [Ph.D. thesis], Massachusetts Institute of Technology, Cambridge, Mass, USA, 2003.
  21. A. C. J. Roth, G. H. Gonnet, and C. Dessimoz, “Algorithm of OMA for large-scale orthology inference,” BMC Bioinformatics, vol. 9, article 518, 2008. View at Publisher · View at Google Scholar
  22. D. M. Kristensen, Y. I. Wolf, A. R. Mushegian, and E. V. Koonin, “Computational methods for Gene Orthology inference,” Briefings in Bioinformatics, vol. 12, no. 5, pp. 379–391, 2011. View at Publisher · View at Google Scholar · View at Scopus
  23. A. Kuzniar, R. C. H. J. van Ham, S. Pongor, and J. A. M. Leunissen, “The quest for orthologs: finding the corresponding gene across genomes,” Trends in Genetics, vol. 24, no. 11, pp. 539–551, 2008. View at Publisher · View at Google Scholar · View at Scopus
  24. L. Salichos and A. Rokas, “Evaluating ortholog prediction algorithms in a Yeast Model Clade,” PLoS ONE, vol. 6, no. 4, Article ID e18755, 2011. View at Publisher · View at Google Scholar · View at Scopus
  25. M. Rasmussen and M. Kellis, Multi-BUS: An Algorithm for Resolving Multi-Species Gene Correspondence and Gene Family Relationships, CSAIL Research, 2005.
  26. X. H. Zheng, F. Lu, Z.-Y. Wang, F. Zhong, J. Hoover, and R. Mural, “Using shared genomic synteny and shared protein functions to enhance the identification of orthologous gene pairs,” Bioinformatics, vol. 21, no. 6, pp. 703–710, 2005. View at Publisher · View at Google Scholar · View at Scopus
  27. X. Chen, J. Zheng, Z. Fu et al., “Assignment of orthologous genes via genome rearrangement,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 4, pp. 302–315, 2005. View at Publisher · View at Google Scholar · View at Scopus
  28. Z. Fu, X. Chen, V. Vacic, P. Nan, Y. Zhong, and T. Jiang, “MSOAR: a high-throughput ortholog assignment system based on genome rearrangement,” Journal of Computational Biology, vol. 14, no. 9, pp. 1160–1175, 2007. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  29. T.-W. Chen, T. H. Wu, W. V. Ng, and W.-C. Lin, “DODO: an efficient orthologous genes assignment tool based on domain architectures. Domain based ortholog detection,” BMC Bioinformatics, vol. 11, supplement 7, article S6, 2010. View at Publisher · View at Google Scholar · View at Scopus
  30. E. L. L. Sonnhammer, T. Gabaldón, A. W. S. da Silva et al., “Big data and other challenges in the quest for orthologs,” Bioinformatics, vol. 30, no. 21, pp. 2993–2998, 2014. View at Publisher · View at Google Scholar · View at Scopus
  31. A. Fernández, S. del Río, V. López et al., “Big Data with Cloud Computing: an insight on the computing environment, MapReduce, and programming frameworks,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 4, no. 5, pp. 380–409, 2014. View at Publisher · View at Google Scholar · View at Scopus
  32. M. Beyer and D. Laney, “3D data management: Controlling data volume, velocity and variety,” 2001, http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf.
  33. C. L. P. Chen and C.-Y. Zhang, “Data-intensive applications, challenges, techniques and technologies: a survey on Big Data,” Information Sciences, vol. 275, pp. 314–347, 2014. View at Publisher · View at Google Scholar · View at Scopus
  34. S. del Río, V. López, J. M. Benítez, and F. Herrera, “On the use of MapReduce for imbalanced big data using Random Forest,” Information Sciences, vol. 284, pp. 112–137, 2014. View at Publisher · View at Google Scholar · View at Scopus
  35. J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. View at Publisher · View at Google Scholar · View at Scopus
  36. M. Zaharia, M. Chowdhury, T. Das et al., “Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI '12), pp. 1–14, USENIX Association, San Jose, Calif, USA, April 2012.
  37. E. N. Koch, M. Costanzo, J. Bellay et al., “Conserved rules govern genetic interaction degree across species,” Genome Biology, vol. 13, no. 7, article R57, 2012. View at Publisher · View at Google Scholar · View at Scopus
  38. T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981. View at Publisher · View at Google Scholar · View at Scopus
  39. S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970. View at Publisher · View at Google Scholar · View at Scopus
  40. E. Deza, Dictionary of Distances, Elsevier, 2006.
  41. A. E. Darling, B. Mau, and N. T. Perna, “Progressivemauve: multiple genome alignment with gene gain, loss and rearrangement,” PLoS ONE, vol. 5, no. 6, Article ID e11147, 2010. View at Publisher · View at Google Scholar · View at Scopus
  42. S. Miyazawa and R. L. Jernigan, “Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues,” Proteins: Structure, Function, and Genetics, vol. 34, no. 1, pp. 49–68, 1999. View at Publisher · View at Google Scholar · View at Scopus
  43. “Rough sets in ortholog gene detection,” in Rough Sets and Intelligent Systems Paradigms, D. Galpert, R. Millo, M. M. García, G. Casas, R. Grau, and L. Arco, Eds., vol. 8537 of Lecture Notes in Computer Science, Springer, Basel, Switzerland, 2014.
  44. R. Millo, D. Galpert, G. Casas et al., “Agregación de medidas de similitud para la detección de ortólogos, validación con medidas basadas en la teoría de conjuntos aproximados,” Computación y Sistemas, vol. 18, no. 1, pp. 19–35, 2014. View at Google Scholar
  45. W. T. Hadoop, The Definitive Guide, O'Reilly Media, Sebastopol, Calif, USA, 2012.
  46. S. Owen, R. Anil, T. Dunning, and E. Friedman, Mahout in Action, 2011.
  47. D. A. Hakim, “Partial Data MapReduce Random Forests,” 2015, https://mahout.apache.org/users/classification/partial-implementation.html.
  48. S. Krishnan and V. Smith, “Linear Support Vector Machines (SVMs),” 2013, https://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machines-svms.
  49. R. Barandela, J. S. Sánchez, V. García, and E. Rangel, “Strategies for learning in class imbalance problems,” Pattern Recognition, vol. 36, no. 3, pp. 849–851, 2003. View at Publisher · View at Google Scholar · View at Scopus
  50. A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, no. 7, pp. 1145–1159, 1997. View at Publisher · View at Google Scholar · View at Scopus
  51. H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, 2009. View at Publisher · View at Google Scholar · View at Scopus
  52. K. P. Byrne and K. H. Wolfe, “The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species,” Genome Research, vol. 15, no. 10, pp. 1456–1461, 2005. View at Publisher · View at Google Scholar · View at Scopus
  53. W. R. Pearson, “Selecting the right similarity-scoring matrix,” Current Protocols in Bioinformatics, vol. 43, pp. 3.5.1–3.5.9, 2013. View at Publisher · View at Google Scholar · View at Scopus
  54. I. Triguero, S. del Río, V. López, J. Bacardit, J. M. Benítez, and F. Herrera, “ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem,” Knowledge-Based Systems, vol. 87, pp. 69–79, 2015. View at Publisher · View at Google Scholar
  55. T. F. DeLuca, I.-H. Wu, J. Pu et al., “Roundup: a multi-genome repository of orthologs and evolutionary distance,” Bioinformatics, vol. 22, no. 16, pp. 2044–2046, 2006. View at Publisher · View at Google Scholar · View at Scopus
  56. V. Wood and P. J. Piskur, “Schizosaccharomyces pombe comparative genomics; from sequence to systems,” in Comparative Genomics, vol. 15 of Topics in Current Genetics, pp. 233–285, Springer, Berlin, Germany, 2005. View at Publisher · View at Google Scholar
  57. J. G. Moreno-Torres, X. Llorà, D. E. Goldberg, and R. Bhargava, “Repairing fractures between data using genetic programming-based feature extraction: a case study in cancer diagnosis,” Information Sciences, vol. 222, pp. 805–823, 2013. View at Publisher · View at Google Scholar · View at Scopus
  58. G. M. Hagelsieb and K. Latimer, “Choosing BLAST options for better detection of orthologs as reciprocal best hits,” Bioinformatics, vol. 24, no. 3, pp. 319–324, 2008. View at Publisher · View at Google Scholar · View at Scopus
  59. C. A. Del Carpio-Muñoz and J. C. Carbajal, “Folding pattern recognition in proteins using spectral analysis methods,” Genome Informatics, vol. 13, pp. 163–172, 2002. View at Google Scholar · View at Scopus
  60. A. C. E. Darling, B. Mau, F. R. Blattner, and N. T. Perna, “Mauve: multiple alignment of conserved genomic sequence with rearrangements,” Genome Research, vol. 14, no. 7, pp. 1394–1403, 2004. View at Publisher · View at Google Scholar · View at Scopus
  61. S. Kumar and A. Filipski, “Multiple sequence alignment: in pursuit of homologous DNA positions,” Genome Research, vol. 17, no. 2, pp. 127–135, 2007. View at Publisher · View at Google Scholar · View at Scopus