Advances in Bioinformatics
Volume 2008 (2008), Article ID 205969, 12 pages
doi:10.1155/2008/205969
Research Article

Metagenome Fragment Classification Using N-Mer Frequency Profiles

1Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA 19104, USA
2Spoken Language Systems Laboratory, INESC-ID, 1000 Lisbon, Portugal
3Department of Electrical and Computer Engineering, Rowan University, Glassboro, NJ 08028, USA
4School of Biomedical Engineering, Science & Health Systems, Drexel University, Philadelphia, PA 19130, USA

Received 5 June 2008; Revised 19 September 2008; Accepted 30 September 2008

Academic Editor: Rita Casadio

Copyright © 2008 Gail Rosen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. H.-M. Müller and S. E. Koonin, “Vector space classification of DNA sequences,” Journal of Theoretical Biology, vol. 223, no. 2, pp. 161–169, 2003. View at Publisher · View at Google Scholar · View at MathSciNet
  2. G. Yeo and C. B. Burge, “Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals,” in Proceedings of the 7th Annual International Conference on Computational Molecular Biology (RECOMB '03), pp. 322–331, Berlin, Germany, April 2003. View at Publisher · View at Google Scholar
  3. M. Yousef, S. Jung, A. V. Kossenkov, L. C. Showe, and M. K. Showe, “Naïve Bayes for microRNA target predictions—machine learning for microRNA targets,” Bioinformatics, vol. 23, no. 22, pp. 2987–2992, 2007. View at Publisher · View at Google Scholar · View at PubMed
  4. R. S. Gupta and E. Griffiths, “Critical issues in bacterial phylogeny,” Theoretical Population Biology, vol. 61, no. 4, pp. 423–434, 2002. View at Publisher · View at Google Scholar
  5. B. B. Ward, “How many species of prokaryotes are there?,” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 16, pp. 10234–10236, 2002. View at Publisher · View at Google Scholar · View at PubMed
  6. D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster, “MEGAN analysis of metagenomic data,” Genome Research, vol. 17, no. 3, pp. 377–386, 2007. View at Publisher · View at Google Scholar · View at PubMed
  7. K. E. Wommack, J. Bhavsar, and J. Ravel, “Metagenomics: read length matters,” Applied and Environmental Microbiology, vol. 74, no. 5, pp. 1453–1463, 2008. View at Publisher · View at Google Scholar · View at PubMed
  8. C. Manichanh, C. E. Chapple, L. Frangeul, K. Gloux, R. Guigo, and J. Dore, “A comparison of random sequence reads versus 16S rDNA sequences for estimating the biodiversity of a metagenomic library,” Nucleic Acids Research, vol. 36, no. 16, pp. 5180–5188, 2008. View at Publisher · View at Google Scholar · View at PubMed
  9. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990. View at Publisher · View at Google Scholar · View at PubMed
  10. Q. Wang, G. M. Garrity, J. M. Tiedje, and J. R. Cole, “Naïve Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy,” Applied and Environmental Microbiology, vol. 73, no. 16, pp. 5261–5267, 2007. View at Publisher · View at Google Scholar · View at PubMed
  11. L. Krause, N. N. Diaz, A. Goesmann, et al., “Phylogenetic classification of short environmental DNA fragments,” Nucleic Acids Research, vol. 36, no. 7, pp. 2230–2239, 2008. View at Publisher · View at Google Scholar · View at PubMed
  12. S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970.
  13. T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981.
  14. D. G. Higgins and P. M. Sharp, “CLUSTAL: a package for performing multiple sequence alignment on a microcomputer,” Gene, vol. 73, no. 1, pp. 237–244, 1988. View at Publisher · View at Google Scholar
  15. S. Abby and V. Daubin, “Comparative genomics and the evolution of prokaryotes,” Trends in Microbiology, vol. 15, no. 3, pp. 135–141, 2007. View at Publisher · View at Google Scholar · View at PubMed
  16. E. V. Koonin, K. S. Makarova, and L. Aravind, “Horizontal gene transfer in prokaryotes: quantification and classification,” Annual Review of Microbiology, vol. 55, pp. 709–742, 2001. View at Publisher · View at Google Scholar · View at PubMed
  17. S. Neph and M. Tompa, “MicroFootPrinter: a tool for phylogenetic footprinting in prokaryotic genomes,” Nucleic Acids Research, vol. 34, pp. W366–W368, 2006. View at Publisher · View at Google Scholar · View at PubMed
  18. J. L. Sebat, F. S. Colwell, and R. L. Crawford, “Metagenomic profiling: microarray analysis of an environmental genomic library,” Applied and Environmental Microbiology, vol. 69, no. 8, pp. 4927–4934, 2003. View at Publisher · View at Google Scholar
  19. E. A. Galbraith, D. A. Antonopoulos, and B. A. White, “Suppressive subtractive hybridization as a tool for identifying genetic diversity in an environmental metagenome: the rumen as a model,” Environmental Microbiology, vol. 6, no. 9, pp. 928–937, 2004. View at Publisher · View at Google Scholar · View at PubMed
  20. J. J. Dunn, S. R. McCorkle, L. A. Praissman, et al., “Genomic signature tags (GSTs): a system for profiling genomic DNA,” Genome Research, vol. 12, no. 11, pp. 1756–1765, 2002. View at Publisher · View at Google Scholar · View at PubMed
  21. H. Teeling, J. Waldmann, T. Lombardot, M. Bauer, and F. O. Glöckner, “TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences,” BMC Bioinformatics, vol. 5, article 163, pp. 1–7, 2004. View at Publisher · View at Google Scholar · View at PubMed
  22. D. T. Pride, R. J. Meinersmann, T. M. Wassenaar, and M. J. Blaser, “Evolutionary implications of microbial genome tetranucleotide frequency biases,” Genome Research, vol. 13, no. 2, pp. 145–158, 2003. View at Publisher · View at Google Scholar · View at PubMed
  23. B. Fertil, M. Massin, S. Lespinats, C. Devic, P. Dumee, and A. Giron, “GENSTYLE: exploration and analysis of DNA sequences with genomic signature,” Nucleic Acids Research, vol. 33, pp. W512–W515, 2005. View at Publisher · View at Google Scholar · View at PubMed
  24. M. Ganapathiraju, J. Klein-Seetharaman, R. Rosenfeld, et al., “Comparative n-gram analysis of whole-genome sequences,” in Proceedings of the Human Language Technologies Conference (HLT '02), San Diego, Calif, USA, March 2002.
  25. A. Apostolico, M. E. Bock, and S. Lonardi, “Monotony of surprise and large-scale quest for unusual words,” in Proceedings of the 6th Annual International Conference on Computational Molecular Biology (RECOMB '02), pp. 22–31, Washington, DC, USA, April 2002. View at Publisher · View at Google Scholar
  26. A. C. McHardy, H. G. Martín, A. Tsirigos, P. Hugenholtz, and I. Rigoutsos, “Accurate phylogenetic classification of variable-length DNA fragments,” Nature Methods, vol. 4, no. 1, pp. 63–72, 2007. View at Publisher · View at Google Scholar · View at PubMed
  27. I. Rish, “An empirical study of the naive bayes classifier,” in Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI '01), pp. 41–46, Seattle, Wash, USA, August 2001.
  28. G. K. Zipf, Human Behavior and the Principle of Least-Effort, Addison-Wesley, Cambridge, Mass, USA, 1949.
  29. G. Hampikian and T. Andersen, “Absent sequences: nullomers and primes,” in Proceedings of the Pacific Symposium on Biocomputing, vol. 12, pp. 355–366, Boise, Idaho, USA, January 2007. View at Publisher · View at Google Scholar
  30. V. Y. Fofanov, C. Putonti, S. Chumakov, B. M. Pettitt, and Y. Fofanov, “Fast algorithm for the analysis of the presence of short oligonucleotide sequences in genomic sequences,” University of Houston, Houston, Tex, USA, May 2005.
  31. R. Sandberg, G. Winberg, C.-I. Bränden, A. Kaske, I. Ernberg, and J. Cöster, “Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier,” Genome Research, vol. 11, no. 8, pp. 1404–1409, 2001. View at Publisher · View at Google Scholar · View at PubMed
  32. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, San Francisco, Calif, USA, 2005.
  33. J. C. Venter, K. Remington, J. F. Heidelberg, et al., “Environmental genome shotgun sequencing of the Sargasso Sea,” Science, vol. 304, no. 5667, pp. 66–74, 2004. View at Publisher · View at Google Scholar · View at PubMed
  34. S. J. Giovannoni, H. J. Tripp, S. Givan, et al., “Genetics: genome streamlining in a cosmopolitan oceanic bacterium,” Science, vol. 309, no. 5738, pp. 1242–1245, 2005. View at Publisher · View at Google Scholar · View at PubMed
  35. S. T. Dyhrman, P. D. Chappell, S. T. Haley, et al., “Phosphonate utilization by the globally important marine diazotroph Trichodesmium,” Nature, vol. 439, no. 7072, pp. 68–71, 2006. View at Publisher · View at Google Scholar · View at PubMed
  36. S. M. Sowell, A. D. Norbeck, M. S. Lipton, et al., “Proteomic analysis of stationary phase in the marine bacterium “Candidatus pelagibacter ubique”,” Applied and Environmental Microbiology, vol. 74, no. 13, pp. 4091–4100, 2008. View at Publisher · View at Google Scholar · View at PubMed