Research Article | Open Access
PROCARB: A Database of Known and Modelled Carbohydrate-Binding Protein Structures with Sequence-Based Prediction Tools
Understanding of the three-dimensional structures of proteins that interact with carbohydrates covalently (glycoproteins) as well as noncovalently (protein-carbohydrate complexes) is essential to many biological processes and plays a significant role in normal and disease-associated functions. It is important to have a central repository of knowledge available about these protein-carbohydrate complexes as well as preprocessed data of predicted structures. This can be significantly enhanced by tools de novo which can predict carbohydrate-binding sites for proteins in the absence of structure of experimentally known binding site. PROCARB is an open-access database comprising three independently working components, namely, (i) Core PROCARB module, consisting of three-dimensional structures of protein-carbohydrate complexes taken from Protein Data Bank (PDB), (ii) Homology Models module, consisting of manually developed three-dimensional models of N-linked and O-linked glycoproteins of unknown three-dimensional structure, and (iii) CBS-Pred prediction module, consisting of web servers to predict carbohydrate-binding sites using single sequence or server-generated PSSM. Several precomputed structural and functional properties of complexes are also included in the database for quick analysis. In particular, information about function, secondary structure, solvent accessibility, hydrogen bonds and literature reference, and so forth, is included. In addition, each protein in the database is mapped to Uniprot, Pfam, PDB, and so forth.
Carbohydrates play a key role in a variety of important biological recognition processes like infection, immune response, cell differentiation, and neuronal development. All of these biological phenomena may be regulated by the interaction of these carbohydrates with proteins [1–4]. One area of therapeutic significance in protein-carbohydrate interactions has relied on the role of carbohydrates as cell surface receptors enabling adherence of bacteria, parasites, and viruses by a process known as bioadhesion [5–10]. Bacteria are often competent enough to efficiently adhere to the surface membranes of the host cells via lectin binding, thus enabling subsequent colonization and progression of the disease . Irregular structure and levels of certain tumor cell surface sugars may also present opportunities for therapeutic intervention . On the other hand, the ubiquitous application of carbohydrates in nature potentially poses severe specificity issues. Understanding the molecular basis of carbohydrate recognition might offer the essential basis to rationally plan biologically active saccharide analogues .
In spite of their numerous important biological roles, there is no appropriate database dedicated to these protein-carbohydrate complexes. Although, the Protein Data Bank (PDB)  stores all the experimentally determined protein-carbohydrate complexes, yet it is not easy to identify a protein-carbohydrate complex in PDB. The GLYCOSCIENCES.de web resource  provides numerous tools and databases which aid in searching the PDB for various carbohydrates. Moreover, the available databases like Lectines  & Glycoconjugate  databank dedicated to protein carbohydrate complexes do not have detailed information on the functionally important carbohydrate-binding residues and proteins. Hence, there is a need for a single resource where all the relevant information about a pair of interacting protein and carbohydrate would be available. Therefore, the PROCARB (Figure 1) has been developed to provide, not only a single source of annotated complexes, but also a number of precomputed features of these carbohydrate-binding proteins like solvent accessibility, secondary structure, and hydrogen bonding information. Also the role of carbohydrates in the complex is also provided in the database wherever possible. This core module consists of 604 protein-carbohydrate complexes with at least one but possibly more carbohydrate molecule(s) in each complex. Total number of carbohydrate molecules, thus is 4240, which are bound to 5360 residues in proteins.
Structure-based approach to drug design has become a standard protocol in the pharmaceutical industry where large databases of potential small drug candidates may be docked into an active site of a particular target molecule . Structures of many glycoproteins of interest have not been solved yet but can be modeled because suitable templates of matching structures are available. Therefore, we have also attempted to generate the three-dimensional structures of different types of glycoproteins (both N- and O-linked), with unknown structures by using homology modelling. This module of PROCARB consists of 26 N-linked and 20 O-linked modelled structures.
Finally, functional annotation of proteins and understanding of functions in cases were only an amino acid sequence of protein is available requires predicting potential carbohydrate-binding sites, which experimentalists can then verify. Based on our previous work in this direction , we developed a web server which can take an amino acid sequence provided by users and predict carbohydrate-binding sites, albeit with a modest success rate keeping in view the difficulty in sequence-based prediction, which nonetheless provides useful clues for experiments.
2. Database Description
Overall organization of the database is illustrated in Figures 2(a) and 2(b). As shown in the figure and stated above, the PROCARB is composed of three modules, which work largely independently. These modules are described in the following sections.
2.1. PROCARB Core Module
The PROCARB core module is developed by systematically locating protein-carbohydrate complexes in the protein data bank (PDB) and manual verification of existence and identification of carbohydrate ligand. A protein is considered as a carbohydrate binding if any atom of its amino acid is within a 3.5 Å cutoff distance from any atom of the sugar in the protein-carbohydrate complex . Various structural and contact properties like secondary structure, hydrogen bond, van der Waal contacts, solvent accessibility, and so forth, are computed for all entries and stored in this core module of the database. In addition, a Jmol  visualisation is provided with preloaded scripts allowing identifying the location and nature of carbohydrate binding sites. All structures found by keyword search were validated manually for the presence of carbohydrate ligands. Specifically at the time of last update, 914 hits were obtained using keyword search in the PDB, of which only 604 proteins were found to have a carbohydrate attached, making it important that these ligands be manually annotated. The databases, so compiled, are also available for free download, both in the raw PDB file as well as a subset of entries which consists of representative structures selected at 25% sequence similarity. For each complex, the carbohydrate details were retrieved from the PDBsum  and to confirm whether one of the bound ligands is a carbohydrate, all ligands were manually checked either in the PDBeChem  database which classifies sugars as a saccharide or from the literature reference.
FASTA formatted sequences and 3D coordinates for both raw and nonredundant datasets are also stored in the database. These data sets are scheduled to be regularly updated as new entries become available from the PDB. For a quick analysis a set of four residue-wise structural features, namely, contact with carbohydrate, secondary structure, and solvent accessibility is included. These features are computed using standard software such as DSSP , ASAView , and HBPlus , respectively.
2.2. Homology Models Module
In this module, we have attempted to generate the three-dimensional structures of a large number of glycoproteins (both N- and O-linked) with hitherto unknown structure, using automated web-based homology modeling. As a case study, a detailed project model-based 3D-structure of Hev b 4, a latex allergen N-glycoprotein has also been completed which is described elsewhere in our earlier work .
To select proteins for modeling, Swissprot  search was performed for N-linked glycoproteins using the keyword “N-linked”. O-linked glycoprotein sequences were collected from O-glycbase  database. To have at least one model for each protein family, the sequence data was grouped into families at 30% sequence identity and one member from each family was selected for modeling. In all cases, at least one glycosylation site was identified and annotated in Swissprot . This data set has two groups each one corresponding to O-linked and N-linked glycoproteins.
Selected glycoprotein sequences, having at least one experimentally verified glycosylation site, were used as an input for the web server 3D-JIGSAW . This server builds three-dimensional models for proteins on homologues of known 3D structure. The automated mode of 3D-JIGSAW  web server resulted in 50 homology-based models of N-linked glycoproteins out of 73 N-glycoprotein sequences and 104 structure models of O-linked glycoproteins out of initial 173 O-glycoprotein sequences. After careful examination of each model, it was noted that there were only 26 N-linked and 20 O-linked models in which at least one experimentally verified glycosylation site was modeled. Optimization of these models was carried out via CHARMm all atom forcefield minimization. Energy was minimized for a gradient of 1.0 kcal/mol by using conjugate gradient protocol available in Discovery studio version 2.0 [Accelry’s Software Inc]  to remove any steric clashes and stabilize the models. The various types of initial potential energy, potential energy, Van der Waals energy, and electrostatic energy of N- and O-glycoprotein models after energy minimization are listed in Tables 1 and 2. Additionally, Ramachandran analysis was performed for subsequent optimization on all the 46 models using SAVES  web server (Tables 3 and 4). In other models, the 3D-JIGSAW  server was not able to model the experimentally determined glycosylation site due to the absence of a suitable template so they were not included in the web resource. Graphics highlighting the experimentally determined glycosylation sites were generated for the modeled structures using VMD  and form the part of database and can also be visualized in Jmol .
Though this is based on using automated web-based homology modeling, most of the models are within the acceptable ranges of Ramachandran score (Tables 3 and 4) and may provide some initial encouragement to use the homology models in understanding their structure-function relation by designing mutagenesis and drug designing experiments. Protein structure models can be of enormous help in functional genomics. One of the most important assistance of homology models lies in the functional genomics where they could provide structural insights to understand the protein function . The 3D models have already been employed to identify the enzymatic activities  and ligand-binding  functions of proteins. Additionally, it is well known that homology modeling requires high quality of sequence alignment between the target and the template proteins; therefore, human intervention may be a possible solution for models with low scores. In spite of various limitations, homology modelling will remain an essential tool in predicting the 3D structures of proteins as the number of protein sequences will keep on increasing and it is impracticable to resolve the 3D structure of each sequence .
2.3. CBS-PRED Module
Many proteins which interact with carbohydrates (either covalently or noncovalently) are known without the knowledge of residues that participate in these interactions. Only few computational methods have been described till date which predict the covalently attached Glycosylation sites [40, 41] in proteins. Similarly, only three methods are reported for the prediction of carbohydrate binding sites in proteins based on the 3D structure of the complex [42–44]. In view of this, we have earlier developed an algorithm to identify carbohydrate-binding residues from single sequences or their evolutionary profiles . CBS-Pred is an implementation of these algorithms into PROCARB. This module is made up of two submodules, namely, CBS-SS and CBS-PSSM, which utilize single sequence or alignment profiles in the backend to make a residue-wise prediction. Although PSSM-based predictions are more accurate, single sequence module is provided as a high-speed alternative as generating PSSM is time consuming. Exact performance score of these submodules is likely to change as we update neural network parameters, used for prediction with every update in training data sets. Therefore, prediction performance scores are returned with the server output and can be used to estimate the degree of false predictions.
We also tested the CBS-Pred on Area under the ROC curve (AUC) (Table 5) for protein-carbohydrate complexes that were submitted to the PDB between January 2007 and November 2008. In this way we obtained ROC plots (Figures 3(a) and 3(b)) for the following two datasets:(a)PROCARB30: A nonredundant dataset of protein-carbohydrate complexes submitted to PDB between January 2007 and November 2008.(b)PROCARB61: A redundant dataset of protein-carbohydrate complexes submitted to PDB between January 2007 and November 2008.
3. Additional Tools
3.1. PROCARB BLAST
A BLAST  sequence similarity search has been provided which accepts user input and can search the user submitted query against the above mentioned databases. This may be helpful in determining the homologous sequences from the PROCARB database on the basis of sequence similarity.
3.2. Carbohydrate Finder
Due to the enormous diversity of carbohydrates, it is always difficult to identify whether a given ligand in a PDB coordinate file is a carbohydrate or not. Carbohydrate Finder identifies diverse types of carbohydrates in a given protein-carbohydrate complex. Currently, it can recognize 100 different types of carbohydrates.
3.3. Contact Calculator
Contact Calculator calculates the contacting pairs in a given protein-carbohydrate complex at different cutoff distances and can also recognize 100 different types of carbohydrates that may be in contact with the amino acid residues (Table 6).
A database of protein-carbohydrate complexes and models of unknown glycoprotein structures was developed, and an associated sequence-based prediction module was compiled. We expect that PROCARB will facilitate functional annotation, designing of site-directed mutagenesis experiments, and modeling protein-carbohydrate interactions which in turn will help the experimental and bioinformatics research on understanding protein-carbohydrate interactions.
Availability and Requirements
The financial support from the Indian Council of Medical Research (ICMR) is gratefully acknowledged.
- K.-A. Karlsson, J. Angstrom, J. Bergström, and B. Lanne, “Microbial interaction with animal cell surface carbohydrates,” APMIS, Supplement, vol. 100, no. 27, pp. 71–83, 1992.
- T. A. Springer, “The sensation and regulation of interactions with the extracellular environment: the cell biology of lymphocyte adhesion receptors,” Annual Review of Cell Biology, vol. 6, pp. 359–402, 1990.
- E. G. Bremer, “Glycosphingolipids as effectors of growth and differentiation,” Current Topics in Membranes, vol. 40, pp. 387–411, 1994.
- F. B. Jungalwala, “Expression and biological functions of sulfoglucuronyl glycolipids (SGGLs) in the nervous system—a review,” Neurochemical Research, vol. 19, no. 8, pp. 945–957, 1994.
- E. Rands, M. R. Candelore, A. H. Cheung, W. S. Hill, C. D. Strader, and R. A. F. Dixon, “Mutational analysis of β-adrenergic receptor glycosylation,” Journal of Biological Chemistry, vol. 265, no. 18, pp. 10759–10764, 1990.
- A. Dorato, S. Raguet, H. Okamura, J. J. M. Bergeron, P. A. Kelly, and B. I. Posner, “Characterization of the structure and glycosylation properties of intracellular and cell surface rat hepatic prolactin receptors,” Endocrinology, vol. 131, no. 4, pp. 1734–1742, 1992.
- C. Garcia Rodriguez, D. R. Cundell, E. I. Tuomanen, L. F. Kolakowski Jr., C. Gerard, and N. P. Gerard, “The role of N-glycosylation for functional expression of the human platelet-activating factor receptor. Glycosylation is required for efficient membrane trafficking,” Journal of Biological Chemistry, vol. 270, no. 42, pp. 25178–25184, 1995.
- J. H. Musser, “Carbohydrates as drug discovery leads,” Annual Reports in Medicinal Chemistry, vol. 27, pp. 301–310, 1992.
- R. L. Schnaar, “Complex carbohydrates in drug development,” Advances in Pharmacology, vol. 23, pp. 35–84, 1992.
- S. J. Williams and G. J. Davies, “Protein-carbohydrate interactions: learning lessons from nature,” Trends in Biotechnology, vol. 19, no. 9, pp. 356–362, 2001.
- J. Adam, M. Pokorná, C. Sabin, E. P. Mitchell, A. Imberty, and M. Wimmerová, “Engineering of PA-IIL lectin from Pseudomonas aeruginosa—unravelling the role of the specificity loop for sugar preference,” BMC Structural Biology, vol. 7, article 36, 2007.
- S. Hakomori, “Possible functions of tumor-associated carbohydrate antigens,” Current Opinion in Immunology, vol. 3, no. 5, pp. 646–653, 1991.
- J. H. Naismith and R. A. Field, “Structural basis of trimannoside recognition by concanavalin A,” Journal of Biological Chemistry, vol. 271, no. 2, pp. 972–976, 1996.
- H. M. Berman, T. Battistuz, T. N. Bhat et al., “The protein data bank,” Acta Crystallographica Section D, vol. 58, no. 6, pp. 899–907, 2002.
- GLYCOSCIENCES.de, http://www.glycosciences.de/.
- LECTINES, http://www.cermav.cnrs.fr/lectines/.
- GLYCOCONJUGATE DATABANK, http://www.glycostructures.jp/.
- H. Kubinyi, “Molecular similarity. 2. The structural basis of drug design,” Pharmazie in unserer Zeit, vol. 27, no. 4, pp. 158–172, 1998.
- A. Malik and S. Ahmad, “Sequence and structural features of carbohydrate binding in proteins and assessment of predictability using a neural network,” BMC Structural Biology, vol. 7, article 1, 2007.
- “Jmol: an open-source Java viewer for chemical structures in 3D,” http://www.jmol.org/.
- R. A. Laskowski, E. G. Hutchinson, A. D. Michie, A. C. Wallace, M. L. Jones, and J. M. Thornton, “PDBsum: a web-based database of summaries and analyses of all PDB structures,” Trends in Biochemical Sciences, vol. 22, no. 12, pp. 488–490, 1997.
- PDBeChem, http://www.ebi.ac.uk/msd-srv/msdchem/cgi-bin/cgi.pl.
- W. Kabsch and C. Sander, “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features,” Biopolymers, vol. 22, no. 12, pp. 2577–2637, 1983.
- S. Ahmad, M. M. Gromiha, H. Fawareh, and A. Sarai, “ASAView: database and tool for solvent accessibility representation in proteins,” BMC Bioinformatics, vol. 5, article 51, 2004.
- I. K. McDonald and J. M. Thornton, “Satisfying hydrogen bonding potential in proteins,” Journal of Molecular Biology, vol. 238, no. 5, pp. 777–793, 1994.
- A. Bateman, E. Birney, L. Cerruti et al., “The pfam protein families database,” Nucleic Acids Research, vol. 30, no. 1, pp. 276–280, 2002.
- C. H. Wu, R. Apweiler, A. Bairoch et al., “The Universal Protein Resource (UniProt): an expanding universe of protein information,” Nucleic Acids Research, vol. 34, pp. D187–D191, 2006.
- A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, “SCOP: a structural classification of proteins database for the investigation of sequences and structures,” Journal of Molecular Biology, vol. 247, no. 4, pp. 536–540, 1995.
- A. Bairoch and R. Apweiler, “The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000,” Nucleic Acids Research, vol. 28, no. 1, pp. 45–48, 2000.
- A. Malik, S. A. M. Arif, S. Ahmad, and E. Sunderasan, “A molecular and in silico characterization of Hev b 4, a glycosylated latex allergen,” International Journal of Biological Macromolecules, vol. 42, no. 2, pp. 185–190, 2008.
- R. Gupta, H. Birch, K. Rapacki, S. Brunak, and J. E. Hansen, “O-GLYCBASE version 4.0: a revised database of O-glycosylated proteins,” Nucleic Acids Research, vol. 27, no. 1, pp. 370–372, 1999.
- P. A. Bates, L. A. Kelley, R. M. MacCallum, and M. J. E. Sternberg, “Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM,” Proteins: Structure, Function and Genetics, vol. 45, no. 5, pp. 39–46, 2001.
- “Discovery Studio Version 2,” http://accelrys.com/products/discovery-studio/.
- SAVES, http://nihserver.mbi.ucla.edu/SAVES/.
- W. Humphrey, A. Dalke, and K. Schulten, “VMD: visual molecular dynamics,” Journal of Molecular Graphics, vol. 14, no. 1, pp. 33–38, 1996.
- M. C. Peitsch, “About the use of protein models,” Bioinformatics, vol. 18, no. 7, pp. 934–938, 2002.
- M. C. Peitsch and M. S. Boguski, “The first enzyme among the lipocalin family,” Trends in Biochemical Sciences, vol. 16, p. 363, 1991.
- M. C. Peitsch and M. S. Boguski, “Is apolipoprotein D a mammalian bilin-binding protein?” New Biologist, vol. 2, no. 2, pp. 197–206, 1990.
- H. Venselaar, R. P. Joosten, B. Vroling et al., “Homology modelling and spectroscopy, a never-ending love story,” European Biophysics Journal, vol. 39, no. 4, pp. 551–563, 2009.
- C. Caragea, J. Sinapov, A. Silvescu, D. Dobbs, and V. Honavar, “Glycosylation site prediction using ensembles of Support Vector Machine classifiers,” BMC Bioinformatics, vol. 8, article 438, 2007.
- K. Julenius, A. Mølgaard, R. Gupta, and S. Brunak, “Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites,” Glycobiology, vol. 15, no. 2, pp. 153–164, 2005.
- C. Taroni, S. Jones, and J. M. Thornton, “Analysis and prediction of carbohydrate binding sites,” Protein Engineering, vol. 13, no. 2, pp. 89–98, 2000.
- C. Shionyu-Mitsuyama, T. Shirai, H. Ishida, and T. Yamane, “An empirical approach for structure-based prediction of carbohydrate-binding sites on proteins,” Protein Engineering, vol. 16, no. 7, pp. 467–478, 2003.
- M. Kulharia, S. J. Bridgett, R. S. Goody, and R. M. Jackson, “InCa-SiteFinder: a method for structure-based prediction of inositol and carbohydrate binding sites on proteins,” Journal of Molecular Graphics and Modelling, vol. 28, no. 3, pp. 297–303, 2009.
- S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp. 403–410, 1990.
Copyright © 2010 Adeel Malik et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.