VibrioBase: A Model for Next-Generation Genome and Annotation Database Development
To facilitate the ongoing research of Vibrio spp., a dedicated platform for the Vibrio research community is needed to host the fast-growing amount of genomic data and facilitate the analysis of these data. We present VibrioBase, a useful resource platform, providing all basic features of a sequence database with the addition of unique analysis tools which could be valuable for the Vibrio research community. VibrioBase currently houses a total of 252 Vibrio genomes developed in a user-friendly manner and useful to enable the analysis of these genomic data, particularly in the field of comparative genomics. Besides general data browsing features, VibrioBase offers analysis tools such as BLAST interfaces and JBrowse genome browser. Other important features of this platform include our newly developed in-house tools, the pairwise genome comparison (PGC) tool, and pathogenomics profiling tool (PathoProT). The PGC tool is useful in the identification and comparative analysis of two genomes, whereas PathoProT is designed for comparative pathogenomics analysis of Vibrio strains. Both of these tools will enable researchers with little experience in bioinformatics to get meaningful information from Vibrio genomes with ease. We have tested the validity and suitability of these tools and features for use in the next-generation database development.
Vibrio is a genus of Gram-negative bacteria possessing a curved rod shape, of which several species are pathogenic. Vibrio cholerae  is one of the well-studied pathogenic species that causes foodborne infection and the cholera disease. Cholera is a diarrheal disease  that can kill within hours if it remains untreated. There are about 3 to 5 million cholera cases with 100,000 to 120,000 fatalities due to cholera infections every year . Other notable members of pathogenic Vibrio spp. include the Vibrio parahaemolyticus , Vibrio vulnificus , and Vibrio harveyi . These microorganisms are naturally found in saltwater and they are a leading cause of seafood-borne infections such as gastroenteritis and septicemia, carried by organisms such as crabs, prawns, or oysters. Infections with the noncholera Vibrio are commonly associated with eating undercooked seafood [7, 8] or open wounds infection . Many other Vibrio spp. can also cause infections in sea-living organisms and are common causes of mortality in domestic marine life.
Many new Vibrio genomes have been sequenced and the availability of these genome sequences from different sources has made it possible to perform genome-wide comparative analyses [9–11]. Such comparative analysis may enhance our understanding of the biology, diversity, evolution, and virulence of the Vibrio bacteria, which may be useful towards combating the Vibrio pathogens. Recently, an online genome database, MabsBase, has been developed specifically for the Mycobacterium abscessus, a genus of Actinobacteria . However, a similar genome database is not available for Vibrio spp. MBGD , IMG , PATRIC , and SEED database  do provide a wide array of microbial genomes including some of the Vibrio strains for comparative genomics, but without much emphasis on virulence factors for comparative pathogenomics and lacking user-friendly web interfaces. PATRIC  does provide virulence factors information but lack functionalities to cluster/compare and visualize user-selected Vibrio strains based on their virulence gene profiles, which are useful for comparative pathogenomics analysis.
To facilitate the ongoing research of Vibrio spp., we have developed VibrioBase, a dedicated platform for the Vibrio research community to host the fast-growing amount of genomic data and facilitate the analysis of these data. This online platform is empowered by advanced web technologies and in-house analysis tools for the Vibrio research community. The comprehensive set of genomic datasets in VibrioBase will facilitate analyses in comparative genomics and pathogenomics among different Vibrio strains or species. Here, we describe the overview and some key features of VibrioBase.
2. Database Content and Organization
VibrioBase currently houses a total of 252 genome sequences from 31 Vibrio spp. obtained from the NCBI (Supplementary Table 1 in Supplementary Material available online at http://dx.doi.org/10.1155/2014/569324). All the genome sequences in VibrioBase were annotated using the rapid annotation subsystem technology (RAST) pipeline . The RAST pipeline is a fully automated annotation engine and helps in recognizing protein encoding genes, rRNAs, and tRNAs and subsequently ascribes functions to the genes. The protein assignments in the RAST pipeline are based on the functional properties within the subsystems (see the subsystems approach to genome annotation and its use in the project to annotate 1000 genomes) and FIGfams  maintained in the SEED system. The annotations include the functional prediction, protein translation of contigs, RNA and CDS classification, subsystem function prediction, and the start and stop positions of sequences. The PSORTb standalone version 3.0  was used to predict the subcellular localization of each putative protein in the VibrioBase datasets. Protein subcellular localization describes the spatial arrangement of proteins within cells, thereby providing important functional information of proteins. The objective of this database is to provide resources for whole-genome annotations and state-of-the-art tools to support the expanding Vibrio research community and to collectively gather information on the strains of Vibrio spp. into a single database. With this annotation information, users are a step closer to the understanding of Vibrio spp. Information of all the strains in this database is linked to the NCBI database. For instance, by following the ORF ID, users will be taken to the NCBI database for further information on that specific region of each particular ORF. We present here some of the key features in our database which would be quintessential for the next-generation genome and annotation database development.
3. Real-Time Data Searching Feature
4. Pairwise Genome Comparison (PGC) Tool: Information Aesthetic for Comparative Genomics
The pairwise genome comparison (PGC) tool is a newly designed in-house comparative analysis tool integrated to allow users to compare two different Vibrio genomes (Figure 1). Two main software components used in the automated pipeline were NUCmer in the MUMmer package  and Circos . This visualization tool is useful in the identification and comparative analysis of two genomes which is reported in a circular ideogram layout, depicting the structural variations and positional relationships between two genomic intervals. The input form of PGC has three parameters which are the minimum percent identity, merge threshold, and link threshold. By default, the Circos in VibrioBase is set to 95% minimum percentage identity and 1,000 bp link threshold; users may change the parameter freely to get different comparative results.
As an example, we compared the genomes of two closely related Vibrio species: V. vulnificus CMCP6 and V. vulnificus YJ016. The two genomes are very similar or conserved in general. Interestingly, Circos plot has shown that there are two putative genomic translocations in the genome of V. vulnificus YJ016 and indels in the two genomes (Figure 2). Further analysis on the sequences of these regions revealed that the two regions in the genome of V. vulnificus CMCP6 are associated with putative prophages as predicted by PHAST , suggesting that these prophages might be inserted into V. vulnificus CMCP6. The insertion of prophages might be due to previous lysogenic bacteriophage infection and this probably confers pathogenicity to the bacterial host and changes its diversity. Here, we have demonstrated that PGC can be used to study the genetic differences between two genomes.
Although a similar tool named Circoletto  is available, there are many differences between this tool and our newly developed PGC. For instance, Circoletto aligns sequences using BLAST (local alignment), but PGC uses the NUCmer (global alignment) package in MUMmer 3.0 , which is advantageous for large-scale and rapid genome alignment. Moreover, PGC also allows users to adjust settings such as minimum percent identity, merge threshold, and link threshold through the provided web interface.
5. Pathogenomics Profiling Tool (PathoProT)
We developed a unique pathogenomics profiling tool (PathoProT) that could identify virulence genes and perform a comparative pathogenomics analysis. PathoProT predicts the virulence genes based on a sequence homology search of the putative protein sequences in each strain against the virulence factors database (VFDB) , depending on the user-defined cut-offs for protein identity and completeness. This tool clusters (agglomerative hierarchical cluster analysis) the predicted virulence genes across different strains using their virulence gene profiles. A user can visualize them in the form of heat map with dendrograms. Furthermore, it allows users to examine the similarities and differences of the virulence gene profiles between different groups of strains, for instance, nonpathogenic versus pathogenic strains. We compared the virulence gene profiles of all 134 strains of the highly pathogenic V. cholera in VibrioBase using PathoProT (Figure 3). Our analysis showed that all strains shared at least 58 conserved virulence genes, amongst which is the exopolysaccharide genes group (epsC-epsF-epsG-epsI-epsJ-epsK-epsL-epsM) [29–31] responsible for carrying out different virulence functions such as protection of bacteria from phagocytosis, phage invasion, and protection of cell against water absorption . Besides showing the conserved virulence genes, the heat map is also able to show strain-specific or group-specific virulence genes.
Interestingly, three strains (V. cholerae HENC 01, V. cholerae HENC 02, and V. cholera HENC 03) have very different virulence profiles as compared to other V. cholerae strains (Figure 4). One of the possible explanations is that these genes were not completely sequenced because the genomes of these strains are not complete. However, this may be unlikely because we observed that many virulence genes are also absent in the three genomes. Another possibility is that these strains were inappropriately classified as V. cholerae. To investigate the second possibility, we constructed a phylogenetic tree based on the 16S rRNA genes from different Vibrio spp. with MEGA4 using the maximum likelihood methods . We found that these three strains are clustered into a group (Figure 4), but not with other V. cholerae strains, supporting our view that these three strains are not of V. cholera. It should be noted that, at the end of our analysis, these strains were recently renamed in the NCBI database to uncategorized Vibrio spp. Taken together, we have shown that PathoProT can be a useful tool for comparative pathogenomics analysis of the Vibrio strains or species, which may enhance our understanding of their virulence, evolution, and classification.
6. Implementation and Future Development
VibrioBase is a HTML5 web application developed in PHP language using CodeIgniter and Twitter Bootstrap as back-end and front-end frameworks, respectively. MySQL database is used for storing information and the webserver is configured using Open Panel. To achieve higher level of performance, a job management and scheduler were developed to manage and monitor jobs, which include Circos plots and BLAST search. AJAX calls were emerged in different components of the application to provide a better user experience.
VibrioBase will be updated from time to time as more genome annotations and genomic sequences of Vibrio become available. To accelerate the development of this database for the use of the scientific community, we encourage other research groups to contact us if they would like to share annotations and related datasets with us. Suggestions on improving VibrioBase and requests for additional functions are also welcome.
Although there are many existing valuable genomic data and analysis tools from different sources, it is important to pool these resources together in order to allow rapid searching and browsing and also to facilitate the analysis of these data. We believe that a specialized resource such as the VibrioBase that is empowered with different analysis tools will prove to be invaluable to the Vibrio researchers both in basic science and clinical research areas.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by research Grants UM.C/HIR/MOHE/08 and PG077-2012B from the University of Malaya, Kuala Lumpur, Malaysia.
Supplementary Table 1 shows the list of Vibrio species and strains available in VibrioBase. VibrioBase currently hosts a total of 252 genome sequences or strains of Vibrio species retrieved from the NCBI database. There are 24 complete genomes and 228 draft genomes.