Semi-Supervised Learning for Classification of Protein Sequence Data

King, Brian R.; Guda, Chittibabu

doi:https://doi.org/10.3233/SPR-2008-0241

Scientific Programming

On this page

Abstract Copyright Related Articles

Special Issue

Biological Data Mining

View this Special Issue

Open Access

Volume 16 | Article ID 795010 | https://doi.org/10.3233/SPR-2008-0241

Semi-Supervised Learning for Classification of Protein Sequence Data

Brian R. King¹and Chittibabu Guda²

Abstract

Protein sequence data continue to become available at an exponential rate. Annotation of functional and structural attributes of these data lags far behind, with only a small fraction of the data understood and labeled by experimental methods. Classification methods that are based on semi-supervised learning can increase the overall accuracy of classifying partly labeled data in many domains, but very few methods exist that have shown their effect on protein sequence classification. We show how proven methods from text classification can be applied to protein sequence data, as we consider both existing and novel extensions to the basic methods, and demonstrate restrictions and differences that must be considered. We demonstrate comparative results against the transductive support vector machine, and show superior results on the most difficult classification problems. Our results show that large repositories of unlabeled protein sequence data can indeed be used to improve predictive performance, particularly in situations where there are fewer labeled protein sequences available, and/or the data are highly unbalanced in nature.

Copyright

Copyright © 2008 Hindawi Publishing Corporation. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation Order printed copies

Views

844

Downloads

710

Citations