Review Article

Protein Bioinformatics Infrastructure for the Integration and Analysis of Multiple High-Throughput “omics” Data

Table 1

Commonly used molecular biology databases for functional analysis of gene and protein expression data.

Database nameDatabase contentData access and analysis supportURL

Protein Sequence

UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, UniProt Archive (UniParc) [26]UniProt protein sequences and functional information, comprehensive and non-redundant database that contains most of the publicly available protein sequences in the worldText search; Blast sequence similarity search; Sequence alignment; Batch retrieval; Database ID mapping; FTP downloadhttp://www.uniprot.org/

NCBI Reference Sequence (RefSeq) [27]Non-redundant collection of richly annotated DNA, RNA, and protein sequencesEntrez query access; Searching Nucleotide or Protein; Searching Genome; BLAST; FTP download; Sequence Homology searches and retrievalhttp://www.ncbi.nlm.nih.gov/

Gene and Genome

GenBank [28]Genetic sequence database, an annotated collection of all publicly available DNA sequences databasesDatabase query; Phylogenetics; Genome Analyses; FTP downloadhttp://www.ncbi.nlm.nih.gov/Genbank/
EMBL [29]http://www.ebi.ac.uk/embl/
DDBJ [30]http://www.ddbj.nig.ac.jp/

UniGene [31]Non-redundant set of eukaryotic gene-oriented clusters of transcript sequences, together with information on protein similarities, gene expression, cDNA clone reagents, and genomic locationEntrez query; Library browse; Digital Differential Display; FTP downloadhttp://www.ncbi.nlm.nih.gov/unigene

FlyBase [32]Drosophila sequences and genomic informationAberration Maps; Batch download; BLAST; Chromosome Maps; Coordinate Converter; CytoSearch; GBrowse; ID Converter; ImageBrowse; Interactions Browser; QueryBuilder; TermLink; FTP downloadhttp://flybase.bio.indiana.edu/

Mouse Genome Database (MGD) [33]Gene characterization, nomenclature, mapping, gene homologies among mammals, sequence links, phenotypes, allelic variants and mutants, and strain dataGenes & Markers Query; Sequence Query; MouseBLAST; Graphical Map Tools; Mouse Genome Browser; Batch Query; MGI Web Servicehttp://www.informatics.jax.org/

Saccharomyces Genome Database (SGD) [34]Genetic and molecular biological information about Saccharomyces cerevisiae Search Gene function information and Protein information; Specialized Gene and Sequence Searches; Search Yeast Literature; BLAST; Batch download; Pattern Matching; Genome Restriction Analysis; PDB Homology Query; Yeast Protein Motif Query; Yeast Biochemical Pathways; Gene Expression Connectionhttp://www.yeastgenome.org/

WormBase [35]Data repository for C. elegans and C. briggsae Gene, Phenotype, protein, and Genetics Search; Microarray Expression download and Pattern search; Ontology Searchhttp://www.wormbase.org/

The Arabidopsis Information Resource (TAIR) [36]The genetic and molecular biology information resource about Arabidopsis Synteny Viewer; MapViewer; Pattern Matching; Motif Analysis; Bulk Data Retrieval; Chromosome Map Tool; Restriction Analysishttp://www.arabidopsis.org/
Taxonomy

NCBI Taxonomy [37]Names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequenceBrowse; Retrieve and FTP downloadhttp://www.ncbi.nlm.nih.gov/Taxonomy/

UniProt Taxonomy [26]UniProt taxonomy database, which integrates taxonomy data compiled in the NCBI database and data specific to the UniProt KnowledgebaseQuery the database by keywords (species name) or NCBI taxonomic identifierhttp://www.uniprot.org/taxonomy/

Gene Expression

Gene Expression Omnibus (GEO) [38]Public repository for high-throughput microarray experimental dataSearch by accession number; Search Entrez GEO DataSets or Entrez GEO Profiles with keywords; Visualize cluster heat map images; Retrieve other genes with similar expression patterns; Retrieve chromosomally closest 20 genes; FTP downloadhttp://www.ncbi.nlm.nih.gov/geo/

CleanEx [39]Expression reference database that facilitates joint analysis and cross-dataset comparisonsSearch by ID, Gene symbol and target ID; List expression datasets; Text search in expression datasets description lines; Extract all features of common genes between datasets; Experiments pools comparison; Batch retrieval; FTP downloadhttp://www.cleanex.isb-sib.ch/

SOURCE [40]Functional genomics resource for human, mouse and rat to facilitate the analysis of large sets of data using genome-scale experimental approachesSearch by CloneID, Database Accession, Gene name/Symbol, UniGene ClusterID, Probe ID, and Entrez GeneID; Batch retrievalhttp://source.stanford.edu/

ArrayExpress [41]Public repository for well-annotated data from array based platforms, including gene expression, comparative genomic hybridization (CGH) and chromatin-immunoprecipitation (ChIP) experiments, tiling arrays, and so forthWeb-based query interface; REST and Web-services access; FTP download; Web-based online microarray analysis tool—Expression Profilerhttp://www.ebi.ac.uk/microarray-as/ae

Proteomic Peptide ID Databases

Global Proteome Machine Database (GPMDB) [42]Global Proteome Machine Database, which utilizes the information obtained by GPM servers to aid in peptide validation as well as protein coverage patternsSearch by protein description keywords, and data set keywordshttp://gpmdb.thegpm.org/

PRoteomics IDEntifications Database (PRIDE) [43]PRIDE database provides public data repository for proteomics dataSearch by PRIDE Experiment accession number and Protein accessions; Browse experiments by project name or categories such as species, tissue, cell type, GO terms and disease; Ontology Lookup Service (OLS); Protein Identifier Cross Reference (PICR) service; Database on Demand (DOD)http://www.ebi.ac.uk/pride/
Peptidome [44]Public repository that archives and freely distributes tandem mass spectrometry peptide and protein identification dataSearch by Accession, Author, Description, MeSH Terms, Organism, Peptide Count, Platform, Protein Count, Protein GI, Publication Date, Search Engine, Spectra Count, Submitter Institute, Title, Update Datehttp://www.ncbi.nlm.nih.gov/peptidome

PeptideAtlas [45]Peptide database identified by Tandem Mass Proteomics experimentsSearch by Protein/Gene Name, Protein/Gene ID, Protein/Gene Symbol, Accession, Refseq, Sequence and Peptide Accession; Browse Peptides; Browse Proteins; FTP downloadhttp://www.peptideatlas.org/

Protein Expression

Swiss-2DPAGE [46]Annotated 2D gel electrophoresis database contains data on proteins identified on various 2D PAGE and SDS-PAGE reference mapsSearch by description, accession number, author, spot serial number, experimental pI/Mw range and experimental identification methods; Retrieve all the protein entries identified on a given reference map; Compute estimated location on reference maps for a user-entered sequence; FTP downloadhttp://ca.expasy.org/ch2d

Function and Pathway

Kyoto Encyclopedia of Genes and Genomes (KEGG) [47]Integrated database resource consisting of 16 main databases, broadly categorized into systems information, genomic information, and chemical informationAccess by KEGG object identifier; KEGG Web Services and KEGG FTP download; Pathway Mapping; Brite Mapping; KegHier for browsing and searching functional hierarchies in KEGG BRITE; KegArray for analysis of transcriptome data (gene expression profiles) and metabolome data (compound profiles)http://www.genome.jp/kegg/

BioCyc [48]Microbial pathway/genome databasesVisualize individual metabolic pathways; View the complete metabolic map of an organism; Genome browsing capabilities and comparative analysis toolshttp://biocyc.org/

Genetic Variation and Disease

Online Mendelian Inheritance in Man (OMIM) [49]A catalog of human genetic and genomic phenotypesEntrez search at basic, advanced, or complex Boolean levels; Browse entries; Build query; Combine search results; Store search results in Clipboard; FTP downloadwww.ncbi.nlm.nih.gov/sites/entrez?db=omim

HapMap [50]Resource for human genetic variationBrowse data; Bulk data download; HapMart—a data mining tool for retrieving data from the HapMap databasehttp://www.hapmap.org/

Ontology

Gene Ontology (GO) [51]Gene Ontology database provides controlled vocabulary of terms describing Biological process, Cellular component, and Molecular function of gene and gene product annotation dataTools include Browsers, Microarray tools, Annotation tools, Mapping to other databases, FTP download in Flat file, MySQL or RDF XML formathttp://www.geneontology.org/
Interaction

IntAct [52]Protein-protein interaction dataBrowse by UniProt Taxonomy, Gene Ontology, Interpro Domain, Reactome Pathway, Chromosomal Location, and mRNA expression, FTP download in PSI-MI and PSI-MI TAB formathttp://www.ebi.ac.uk/intact

Database of Interacting Proteins (DIP) [53]Database of experimentally determined interactions between proteins with curator or computational methods generated annotationsSearch by protein entry, BLAST, Motif, Article and pathBLAST; Data analysis services include Expression Profile Reliability Index, Paralogous Verification, and Domain Pair Verificationhttp://dip.doe-mbi.ucla.edu/

Modification

RESID [54]Collection of annotations and structures for Protein Pre-, Co- and Post-translational modificationsWeb-based search interface; FTP download database entries in XML format, and associated files containing XML DTD, graphic images, and molecular modelshttp://www.ebi.ac.uk/RESID

Phosphosite [55]Database of phosphorylation sites and other Post-translational modificationsSearch by Protein, Sequence, or Reference; Browse MS data by Disease, Cell Line, and Tissuehttp://www.phosphosite.org/

Structure

Protein Data Bank (PDB) [56]Database of experimentally-determined structures of proteins, nucleic acids, and complex assembliesWeb-based search and browsing interface; File download via http and FTP services in PDB, mmCIF, and PDBML/XML formathttp://www.pdb.org/pdb/home/home.do

Structural Classification of Proteins (SCOP) [57]Comprehensive ordering of all proteins of known structure according to their evolutionary and structural relationshipsKeywords-based searchhttp://scop.mrc-lmb.cam.ac.uk/

CATH [58]Protein domain structures databaseSearch by ID/Keywords and FASTA sequence; BLAST; Cathedral server, and SSAP server for query and analysis CATH data; FTP downloadhttp://www.cathdb.info/

Molecular Modeling Database (MMDB) [59]Database of 3D structuresSearch by UID/text term, protein sequence and 3D coordinates; FTP downloadhttp://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml

PDBsum [60]Summaries and analyses of PDB structuresSearch by text or sequence; Browse by Highlights, List of PDB codes, Het Groups, Ligands, Enzymes, ProSite and Species; Download data file for protein names, protein sequences, protein annotations, Enzymes, Het Groups, and Ligandshttp://www.ebi.ac.uk/pdbsum

Protein Structure Model Database (Modbase) [61]Annotated comparative protein structure models and related resourcesSearch by model or sequence similarity and propertieshttp://modbase.compbio.ucsf.edu/modbase-cgi/index.cgi

Classification

PIRSF [62]Family/superfamily classification of whole proteinsBatch retrieval using UniProtKB AC, PIRSF ID, Pfam ID, COG ID, EC Number, GO ID, KEGG Pathway ID, PDB ID; PIRSF scan by sequence or UniProtKB identifier; FTP downloadhttp://pir.georgetown.edu/pirwww/dbinfo/pirsf.shtml
UniProt Reference Clusters (UniRef) [26]UniProt non-redundant reference clustersSearches on various attributes of the UniRef clusters, including UniRef cluster ID, protein names, organism names and database identifiers; Direct web access in HTML, XML and FASTA format; FTP download in XML formathttp://www.uniprot.org/help/uniref

Pfam [63]Protein families of domains each represented by multiple sequence alignments and hidden Markov models (HMMs)Search by Sequence, Functional similarity, Keyword, Domain, DNA, and Taxonomy; Browse by Families, Clans, Proteomics; FTP downloadhttp://pfam.sanger.ac.uk

InterPro [64]Integrated resource of protein families, domains, and functional sitesText search; SRS text search; InterPro Scan; InterPro BoMart; Web services; FTP downloadhttp://www.ebi.ac.uk/interpro

Protein ANalysis THrough Evolutionary Relationships (PANTHER) Classification System [65]Gene products organized by biological functionSearch; Browse; Batch search; Gene expression data analysis; Evolutionary analysis of coding SNPs; HMM sequence scoring; FTP downloadhttp://www.pantherdb.org/panther

Simple Modular Architecture Research Tool (SMART) [66]Resource for protein domain identification and the analysis of protein domain architecturesSequence analysis; Architecture analysis; Domain detectionhttp://smart.embl-heidelberg.de/