Table of Contents Author Guidelines Submit a Manuscript
The Scientific World Journal
Volume 2014, Article ID 173869, 12 pages
http://dx.doi.org/10.1155/2014/173869
Research Article

Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics

1Computer and Information Sciences Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar, 31750 Tronoh, Perak, Malaysia
2Fundamental and Applied Sciences Department, Universiti Teknologi PETRONAS, Bandar Seri Iskandar, 31750 Tronoh, Perak, Malaysia
3College of Sciences, Alfaisal University, P.O. Box 50927, Riyadh 11533, Saudi Arabia

Received 3 March 2014; Revised 16 May 2014; Accepted 19 May 2014; Published 19 June 2014

Academic Editor: Loris Nanni

Copyright © 2014 Muhammad Javed Iqbal et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Linked References

  1. N. M. Luscombe, D. Greenbaum, and M. Gerstein, “What is bioinformatics? A proposed definition and overview of the field,” Methods of Information in Medicine, vol. 40, no. 4, pp. 346–358, 2001. View at Google Scholar · View at Scopus
  2. D. R. Bentley, “The human genome project—an overview,” Medicinal Research Reviews, vol. 20, pp. 189–196, 2000. View at Google Scholar
  3. J.-M. Claverie and C. Notredame, Bioinformatics for Dummies, 2nd edition, 2007.
  4. W. R. Pearson and D. J. Lipman, “Improved tools for biological sequence comparison,” Proceedings of the National Academy of Sciences of the United States of America, vol. 85, no. 8, pp. 2444–2448, 1988. View at Google Scholar · View at Scopus
  5. W. Pearson, “Finding protein and nucleotide similarities with FASTA,” Current Protocols in Bioinformatics, chapter 3, unit3.9, 2004. View at Google Scholar · View at Scopus
  6. S. F. Altschul, T. L. Madden, A. A. Schäffer et al., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402, 1997. View at Publisher · View at Google Scholar · View at Scopus
  7. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990. View at Publisher · View at Google Scholar · View at Scopus
  8. W. R. Pearson, “Using the FASTA program to search protein and DNA sequence databases,” Methods in Molecular Biology, vol. 25, pp. 365–389, 1994. View at Google Scholar · View at Scopus
  9. T. Plötz and G. A. Fink, “A new approach for HMM based protein sequence family modeling and its application to remote homology classification,” in Proceedings of the IEEE/SP 13th Workshop on Statistical Signal Processing, pp. 1008–1013, Bordeaux, France, July 2005. View at Scopus
  10. K. Karplus, C. Barrett, and R. Hughey, “Hidden Markov models for detecting remote protein homologies,” Bioinformatics, vol. 14, no. 10, pp. 846–856, 1998. View at Publisher · View at Google Scholar · View at Scopus
  11. D. W. Mount, “Comparison of the PAM and BLOSUM amino acid substitution matrices,” Cold Spring Harbor Protocols, vol. 3, no. 6, 2008. View at Publisher · View at Google Scholar · View at Scopus
  12. E. G. Mansoori, M. J. Zolghadri, and S. D. Katebi, “Protein superfamily classification using fuzzy rule-based classifier,” IEEE Transactions on Nanobioscience, vol. 8, no. 1, pp. 92–99, 2009. View at Publisher · View at Google Scholar · View at Scopus
  13. J. C. Jeong, X. Lin, and X.-W. Chen, “On position-specific scoring matrix for protein function prediction,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 2, pp. 308–315, 2011. View at Publisher · View at Google Scholar · View at Scopus
  14. R. Saidi, M. Maddouri, and E. Mephu Nguifo, “Protein sequences classification by means of feature extraction with substitution matrices,” BMC Bioinformatics, vol. 11, article 175, 2010. View at Publisher · View at Google Scholar · View at Scopus
  15. X. M. Zhao, D. S. Huang, Y. M. Cheung, H. Q. Wang, and X. Huang, “A novel hybrid GA/SVM system for protein sequences classification,” in Intelligent Data Engineering and Automated Learning, vol. 3177 of Lecture Notes in Computer Science, pp. 11–16, 2004. View at Google Scholar
  16. M. Zamani and S. C. Kremer, “Amino acid encoding schemes for machine learning methods,” in Proceedings of the IEEE International Conference onBioinformatics and Biomedicine Workshops (BIBMW '11), pp. 327–333, Atlanta, Ga, USA, November 2011. View at Publisher · View at Google Scholar · View at Scopus
  17. J. T. L. Wang, Q. Ma, D. Shasha, and C. H. Wu, “New techniques for extracting features from protein sequences,” IBM Systems Journal, vol. 40, no. 2, pp. 426–441, 2001. View at Google Scholar · View at Scopus
  18. N. Ahmad, D. Alahakoon, and R. Chau, “Classification of protein sequences using the growing Self-Organizing map,” in Proceedings of the 4th International Conference on Information and Automation for Sustainability (ICIAFS '08), pp. 167–172, Colombo, Sri Lanka, December 2008. View at Publisher · View at Google Scholar · View at Scopus
  19. A. L. D. Rossi and M. A. de Oliveira Camargo-Brunetto, “Protein classification using artificial neural networks with different protein encoding methods,” in Proceedings of the 7th International Conference on Intelligent Systems Design and Applications (ISDA '07), pp. 169–174, Rio de Janeiro, Brazil, October 2007. View at Publisher · View at Google Scholar · View at Scopus
  20. D. Wang and G.-B. Huang, “Protein sequence classification using extreme learning machine,” in Proceedings of the International Joint Conference on Neural Networks (IJCNN '05), pp. 1406–1411, Montreal, Canada, August 2005. View at Publisher · View at Google Scholar · View at Scopus
  21. S. Bandyopadhyay, “An efficient technique for superfamily classification of amino acid sequences: feature extraction, fuzzy clustering and prototype selection,” Fuzzy Sets and Systems, vol. 152, no. 1, pp. 5–16, 2005. View at Publisher · View at Google Scholar · View at Scopus
  22. M. N. Davies, A. Secker, A. A. Freitas, J. Timmis, E. Clark, and D. R. Flower, “Alignment-independent techniques for protein classification,” Current Proteomics, vol. 5, no. 4, pp. 217–223, 2008. View at Publisher · View at Google Scholar · View at Scopus
  23. U. B. Angadi and M. Venkatesulu, “Structural SCOP superfamily level classification using unsupervised machine learning,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 2, pp. 601–608, 2012. View at Publisher · View at Google Scholar · View at Scopus
  24. S. Vipsita and S. Ku. Rath, “Two-stage approach for protein superfamily classification,” Computational Biology Journal, vol. 2013, Article ID 898090, 12 pages, 2013. View at Publisher · View at Google Scholar
  25. M. J. Iqbal, I. Faye, A. Md Said, and B. B. Samir, “Data mining of protein sequences with amino acid position-based feature encoding technique,” in Proceedings of the 1st International Conference on Advanced Data and Information Engineering (DaEng '13), vol. 285 of Lecture Notes in Electrical Engineering, pp. 119–126, Kuala Lumpur, Malaysia, 2014.
  26. S. Vipsita, B. K. Shee, and S. K. Rath, “An efficient technique for protein classification using feature extraction by artificial neural networks,” in Proceedings of the Annual IEEE India Conference: Green Energy, Computing and Communication (INDICON '10), Kolkata, India, December 2010. View at Publisher · View at Google Scholar · View at Scopus
  27. C. Leslie, E. Eskin, and W. S. Noble, “The spectrum kernel: a string kernel for SVM protein classification,” in Proceedings of the Pacific Symposium on Biocomputing, pp. 564–575, 2002.
  28. C. Caragea, A. Silvescu, and P. Mitra, “Protein sequence classification using feature hashing,” in Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM '11), pp. 538–543, Atlanta, Ga, USA, November 2011. View at Publisher · View at Google Scholar · View at Scopus
  29. C. Yu, R. L. He, and S. S.-T. Yau, “Protein sequence comparison based on K-string dictionary,” Gene, vol. 529, pp. 250–256, 2013. View at Publisher · View at Google Scholar · View at Scopus
  30. H. M. Berman, J. Westbrook, Z. Feng et al., “The protein data bank,” Nucleic Acids Research, vol. 28, no. 1, pp. 235–242, 2000. View at Google Scholar · View at Scopus
  31. W. C. Barker, J. S. Garavelli, H. Huang et al., “The Protein Information Resource (PIR),” Nucleic Acids Research, vol. 28, no. 1, pp. 41–44, 2000. View at Google Scholar · View at Scopus
  32. A. Bairoch and R. Apweiler, “The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000,” Nucleic Acids Research, vol. 28, no. 1, pp. 45–48, 2000. View at Google Scholar · View at Scopus
  33. R. Apweiler, A. Bairoch, C. H. Wu et al., “UniProt: the universal protein knowledgebase,” Nucleic Acids Research, vol. 32, pp. D115–D119, 2004. View at Google Scholar · View at Scopus
  34. A. Solovyov and W. I. Lipkin, “Centroid based clustering of high throughput sequencing reads based on n-mer counts,” BMC Bioinformatics, vol. 14, p. 268, 2013. View at Google Scholar
  35. R. Caruana and A. Niculescu-Mizil, “An Empirical Comparison of Supervised Learning Algorithms Using Different Performance Metrics,” 2006.
  36. J. R. Vergara and P. A. Estévez, “A review of feature selection methods based on mutual information,” Neural Computing and Applications, vol. 24, pp. 175–186, 2014. View at Google Scholar
  37. I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003. View at Google Scholar
  38. M. Dash and H. Liu, “Feature selection for classification,” Intelligent Data Analysis, vol. 1, pp. 131–156, 1997. View at Google Scholar