Abstract

A metabolome—the collection of comprehensive quantitative data on metabolites in an organism—has been increasingly utilized for applications such as data-intensive systems biology, disease diagnostics, biomarker discovery, and assessment of food quality. A considerable number of tools and databases have been developed to date for the analysis of data generated by various combinations of chromatography and mass spectrometry. We report here a web portal named KOMICS (The Kazusa Metabolomics Portal), where the tools and databases that we developed are available for free to academic users. KOMICS includes the tools and databases for preprocessing, mining, visualization, and publication of metabolomics data. Improvements in the annotation of unknown metabolites and dissemination of comprehensive metabolomic data are the primary aims behind the development of this portal. For this purpose, PowerGet and FragmentAlign include a manual curation function for the results of metabolite feature alignments. A metadata-specific wiki-based database, Metabolonote, functions as a hub of web resources related to the submitters' work. This feature is expected to increase citation of the submitters' work, thereby promoting data publication. As an example of the practical use of KOMICS, a workflow for a study on Jatropha curcas is presented. The tools and databases available at KOMICS should contribute to enhanced production, interpretation, and utilization of metabolomic Big Data.

1. Introduction

A metabolome, which comprises comprehensive data on quantification of metabolites in an organism calculated using metabolomic technologies [9, 10], has been increasingly used for the analysis and practical applications of biological and environmental systems. Within the data-intensive systems biology discipline, metabolomics is particularly important compared to other “omics” (genome, transcriptome, and proteome) disciplines since metabolomes are more closely related to phenotype and regulate gene and protein expression networks [1113]. Mass spectrometry (MS) and nuclear magnetic resonance spectroscopy (NMR) are complementary techniques often used for the detection and identification of metabolites. MS technology has integrated separation techniques and is used in most cases because of its sensitivity, selectivity, speed, and broad applicability [1416]. Owing to the wide range of chemical diversity, there is no ideal apparatus that is capable of analyzing all possible metabolites. Combinations of separation techniques with MS, such as liquid chromatography- (LC-) MS, gas chromatography- (GC-) MS, and capillary electrophoresis- (CE-) MS, are chosen according to a study’s purpose [1719]. Metabolomics technology, including instrumental analysis, detection and identification of metabolites, statistical interpretation, and generation of hypotheses with computational support, is used for a variety of studies, such as functional analysis of biological systems [2022], biomarker discovery [23, 24], medical diagnostics [14, 25], quality assessment of foods [26, 27], evaluation of genetically modified crops [28, 29], and assessment of environmental pollution [30, 31].

A considerable number of software tools and databases have been developed for processing the complicated and multidimensional metabolome datasets generated by various types of MS-based instruments [3235]. A typical workflow of metabolomic data analysis includes the following processes: (a) preprocessing of raw data for extraction of metabolite features, annotation of the metabolites, and finally generation of metabolome data; (b) mining and visualization of metabolome data for statistical interpretation of its nature and hypothesis generation; (c) storing and disseminating the data for further utilization and comparison (Figure 1). XCMS2 [36], MzMine2 [37], MathDAMP [38], MetAlign [39], and MET-IDEA [40] are typical tools for preprocessing including detection, alignment, and annotation of metabolite features. Some of these tools also provide statistical analysis functions for data interpretation. MassBank [41], METLIN [42], PRIMe [43], and HMDB [44] are available as references of mass spectra for metabolite annotation. The metabolite data are interpreted by means of the genome information from compound databases such as KNApSAcK [45], PubChem [46], and Chemspider (http://www.chemspider.com/) and by means of metabolic pathway databases including KEGG [47], BioCyc [48], and Reactome [49], which enable data visualization on pathway maps. The raw and processed data are stored publicly in databases such as PlantMetabolomics.org [50], [email protected] [51], SetupX (currently not available), MetabolomeExpress [52], MetaboLights [53], and Metabolomics Workbench (http://www.metabolomicsworkbench.org/).

We report here a portal website named KOMICS (The Kazusa Metabolomics Portal, http://www.kazusa.or.jp/komics/), which hosts tools and databases that we developed for metabolomics. Although an increasing number of tools and databases have become available, two major issues remain to be resolved, that is, comprehensiveness of metabolites [54, 55] and data dissemination [53, 56, 57]. Our primary aim in developing data preprocessing tools is to help researchers with the manual annotation process that remains essential for nontarget metabolomics [54]. PowerGet for LC-high-resolution-MS and FragmentAlign for GC-MS are tools that enable curation of peak alignment results. SpiceHit is a high-throughput metabolite identification tool for CE-MS analysis using the selected ion monitoring (SIM) method. We have also developed data mining and visualization tools for the generation of working hypotheses (KaPPA-View and RnR). Real data is indispensable for comparative analysis and for the development and improvement of preprocessing tools [53, 58]. MassBase is one of the largest raw data repositories, and KomicMarket is a database of metabolic profiling data. We developed a metadata-specific database, Metabolonote, to promote data publication by researchers. These resources for a wide range of metabolome data processing are expected to contribute to improved production and utilization of metabolomic data.

2. Materials and Methods

The standalone tools for metabolome data production, PowerGet, FragmentAlign, and SpiceHit, were developed in Java (Oracle Corporation). The web-based tools and databases were developed and are run in Apache, PHP, Perl, MySQL, Java, and Tomcat on Linux servers. The KOMICS website was constructed using the content management system “Joomla!” running on a Linux server with Apache, PHP, and MySQL. The details of the development and license information are described in the individual introduction pages of KOMICS, in manuals, or in other relevant help resources. The tools and databases are freely available to academic users.

Details of the analytical methods for the evaluation of preprocessing tools are described in the Supplementary Material (see Supplementary Material available online at http://dx.doi.org/10.1155/2014/194812).

3. Results and Discussion

The tools and databases we have developed and provided at the KOMICS web portal are classified into three categories according to the typical workflow of metabolomic data analysis, namely, (a) preprocessing tools, (b) data mining tools, and (c) databases for data dissemination (Figure 1). Here we describe several representative examples. All the currently available tools and databases are listed in Table 1. The number of records in each metabolomics-related database is shown in Table 2. The formats of input and output files and the availability of sample datasets are summarized in Table 3.

3.1. Data Preprocessing Tools
3.1.1. PowerGet

PowerGet is a standalone Java software package for detection, alignment, and annotation of metabolite features from data obtained using LC-high-resolution-MS (HRMS). Accurate mass values measured by HRMS, such as Fourier transform ion cyclotron resonance MS and Orbitrap MS (Thermo Fisher), allow users to predict the elemental composition of a metabolite. The intensity ratio of 13C to 12C isotopic ion peaks is useful for estimating the number of carbon atoms in a molecule. Estimation of ion adducts attached to the metabolites by coeluted ions is helpful for calculating elemental composition and for search of compound databases by mass values of nonionized molecules. The PowerFT module in PowerGet attaches these data automatically to all metabolite features in the LC-HRMS data. In the PowerMatch module, the metabolite features are aligned among the samples taking into account the similarity of MS/MS fragmentation patterns. A tool for refining the alignment results, MatchedIonsFinder [1], is also available via KOMICS.

To evaluate the accuracy of mass values of the peaks detected using PowerGet, the mass differences between a theoretical mass and a detected mass were compared to those of the peaks detected using the commercial software, Xcalibur (see Supplementary Method S1). PowerGet exhibited greater accuracy (0.579 ± 0.481 ppm (mean ± SD)) than Xcalibur (0.783 ± 0.563 ppm) in the evaluation of 143 standard compounds (Supplementary Table ).

One of the unique functions of PowerGet is that the alignment results are manually editable: a user can promptly check metabolite’s characteristics, such as mass chromatogram shape, existence of adjacent features, and MS/MS fragmentation patterns, by means of a graphical user interface (GUI), as shown in Figure 2. Alignment is essential for preparing matrices of samples to metabolite intensity data for further comparison and statistical analysis. Alignment is highly valuable when users need to annotate the metabolites, especially for unknown features. By comparing the features from several replicate samples, (1) the estimation error of the ion adducts is verified, (2) accuracy of mass measurement can be improved, and (3) reproducibly detected features are prioritized for further annotation. Therefore, alignment errors should be assessed and corrected during detailed annotation of unknown metabolites. PowerGet is utilized in preparing data for KomicMarket and Bio-MassBank (http://bio.massbank.jp/ ).

3.1.2. FragmentAlign

This is a standalone Java tool designed for GC-MS data analysis with functions for alignment and annotation of metabolite features. A GUI for editing the alignment results is also implemented in this software (Figure 3). The similarity of fragment ion patterns generated by electron ionization (EI) is taken into account in the alignment of metabolite features. The metabolite features can also be annotated based on EI fragment patterns, by comparing to patterns from standard compounds. The fragment pattern data of standard compounds can be imported and utilized when the data is written in the format defined by the National Institute of Standards and Technology (NIST), USA.

To evaluate the applicability of data matrices generated by FragmentAlign for further statistical analyses, a principal component analysis (PCA) was conducted using the GC-MS data obtained from 3 biological sources: Arabidopsis leaves, Lotus japonicus leaves, and Arabidopsis cultured cells. Five replicates of each source were mapped to similar positions, whereas the 3 sources were mapped separately from one another on the score plot (Supplementary Figure ). High-correlation coefficients for peak intensity within the replicates were observed (Supplementary Figure ). These results suggest that appropriate feature extraction and generation of data matrices can be performed successfully using FragmentAlign.

3.1.3. SpiceHit

The standalone Java tool SpiceHit is intended for high-throughput identification of metabolite features detected using the selected ion monitoring (SIM) method in CE-MS. The metabolite features are identified based on migration times relative to internal standard compounds and are compared to those of the standard compound library prepared in-house. The tool is designed for processing a large number of data files in a high-throughput manner; it requires checking and correcting the assignment errors manually.

To ascertain whether SpiceHit is applicable to practical data analysis, the accuracy of peak quantification was compared to that acquired using the commercial software ChemStation (Agilent Technologies, Palo Alto, CA). In the detection of amino acids, the results from SpiceHit were strongly correlated with those from ChemStation, as well as with theoretical concentrations (Supplementary Table ). Similar relative standard deviation (RSD) values for each amino acid in triplicate analyses were observed for SpiceHit and ChemStation (Supplementary Figure ). Good linearity of peak areas common to SpiceHit and ChemStation was observed in the amino acid peaks detected in the biological samples (Supplementary Figure ). These results suggest that the accuracy and the sensitivity of peak quantification by SpiceHit are similar to those of ChemStation and that SpiceHit is suitable for practical use.

3.1.4. MFSearcher

This is a web service that allows for rapid prediction of elemental composition from accurate mass values and for rapid searching of compound databases [3]. A GUI tool for MFSearcher queries is also provided as a module in PowerGet. PowerGet has a batch search function for querying thousands of detected metabolite features via MFSearcher.

Because MFSearcher is a RESTful web service, the query parameters for MFSearcher should be included in the description of a URL. Numerous sample queries are available as URL links on the MFSearcher website.

3.2. Data Mining Tools
3.2.1. KaPPA-View

This is a web-based tool for the visualization of metabolomic data on metabolic pathway maps from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [5]. The degree of change in metabolite abundance between several samples is expressed as hue of the compound symbols drawn on the KEGG pathway maps, based on the compound IDs assigned. Alterations in transcriptome data can be simultaneously depicted on the maps. This tool can be used for the integrated analysis of metabolomic, transcriptomic, and possibly proteomic data.

Sample data for testing the color representations on the pathway maps are available on the “Analysis” page of KaPPA-View. Users can select the items according to the directions presented on the page. Sample files for input data are available on the “Download” page.

3.2.2. RnR

This database contains data on the relationship between metabolites and genes; these data were generated via simultaneous measurement of the metabolome and transcriptome of approximately 200 transgenic cultured cell lines of the model plant Arabidopsis. The gene expression patterns and metabolite changes resulting from specific transgenes are compiled in the database. Users can search, for example, genes that can affect the abundance of the queried metabolites and vice versa. The database should contribute to knowledge discovery related to gene-to-metabolite regulatory networks in Arabidopsis cells.

To view an example dataset, a clickable pie chart of metabolites is presented on the main page of the RnR website. Clicking on a section of the pie chart will show a list of metabolites. After choosing a metabolite name, users will be able to view candidate genes that are related to the metabolite.

3.3. Databases for Data Dissemination
3.3.1. MassBase

The primary purpose of MassBase is the distribution of raw data generated by analytical instruments. Dissemination of raw data would enable the development and improvement of data analysis tools by bioinformatics researchers [53]. Binary raw data and near-raw text data exported from raw binary results are provided.

Users can search records by sample name, sample description, instrument type, and ionization mode on the “Advanced Search” page (Figure 4(a)). A summary of the records classified by species and instrument type is available on the “Summary” page.

3.3.2. KomicMarket

KomicMarket is a sample-centric database aimed at the distribution of metabolic profiles with and without metabolite annotations (Figure 4(b)). Previous results on the detection and annotation of metabolite features in certain samples can serve as good references for future metabolite annotations [56].

The records can be queried by keywords in the sample descriptions, including peak characteristics such as mass values, and in annotations of the peaks via the GUI on the KomicMarket website. The system provides application programming interfaces (APIs) for performing software-based querying of the data. Using the APIs, we employed the MFSearcher module of PowerGet to search metabolites in KomicMarket.

3.3.3. Metabolonote

This Semantic MediaWiki-based database is intended for managing metadata, which is the detailed information on experimental procedures accompanying the generation of data. Metabolonote is expected to accelerate publication of metabolomics data. The raw data obtained from the experiments or the processed data derived from them are not the target of Metabolonote, and the “actual data” are normally stored in other databases specifically built for a given purpose. Separation of the management of complicated metadata of metabolomics from the management of actual data makes it possible to share the same metadata among multiple actual databases such as raw data repositories, metabolic profile databases, reference libraries of MS/MS, and research papers. One-stop-shop management of complicated metadata of metabolomics eliminates the redundant management of metadata in several databases and reduces labor for data submitters. We defined a simple data format named Togo metabolomics data format (TogoMD) for easier description of metadata. Specifications of the TogoMD format are documented on the Metabolonote website (http://metabolonote.kazusa.or.jp/TogoMetabolomeDataFormat). Metabolonote provides application programming interfaces (APIs) for semantic searching of the records and retrieval of metadata. Because Metabolonote is a wiki system, it allows the submitters to attach free additional information about the metadata, such as images of the samples, video recordings of tricky analytical procedures, and links to a journal’s website where the results are published. Therefore, metadata written by the submitter function as a hub of the web data resources related to the submitter’s work. The increased presence of the submitters’ published work on the web should increase the citations by others [58]. Therefore, the wiki system is expected to facilitate the dissemination of data to the public. Metabolonote is already linked to seven actual databases, including MassBase, KomicMarket, Bio-MassBank, and Riken PRIMe.

The metadata deposited in Metabolonote are listed on the “Public Pages.” The registered metadata are semantically searchable by various items (and combinations thereof) on the “Metadata Search” page.

3.4. Practical Use

Here we present a workflow for a metabolomics study of Jatropha curcas L. [59, 60], a biofuel plant, to illustrate an example of the practical use of the KOMICS resources (Figure 5). LC-Orbitrap-MS analyses were conducted using 4 developmental stages of J. curcas fruit samples. The acquired data were primarily recorded as a binary raw file with commercial software (Xcalibur, Thermo Fisher). To analyze the data with PowerGet, the chromatogram data were exported to text files using the MSGet tool, which is also available on the KOMICS website. The raw files and extracted text files were published on MassBase. The text data were then processed using the PowerGet tool to generate the metabolomic data. MatchedIonsFinder was used to refine the alignment results. MFSearcher was used for high-throughput search of elemental composition and compound databases. MS-MS Fragment Viewer was used for interpreting MS/MS fragments in the metabolite annotations. The peak information, profile data (in the TogoMD format), and MS spectrum data were stored on KomicMarket (on the New KomicMarket temporary website) and on Bio-MassBank, respectively. These data were recursively used for metabolite annotations during the preprocessing step. Subsequently, the nature of the metabolomic data was interpreted by visualization on pathway maps using KaPPA-View4 and other statistical analyses. Consequently, a drastic change in metabolites during the maturation of J. curcas fruit was detected, and these data should contribute to further analysis of oil production by J. curcas. The record in Metabolonote (metadata ID: SE5) is a good guidepost for finding data resources deposited in various databases on the web.

4. Conclusions

We have developed various tools and databases for a wide range of processes related to metabolomic studies: preprocessing, data mining, and publication. To our knowledge, PowerGet and FragmentAlign are the first tools to allow users to curate alignment results via GUI. The unique concept of a metadata-specific database should accelerate data publication and dissemination. This infrastructure is expected to assist researchers to attain superior utilization of metabolomics’ Big Data. Nonetheless, annotation of novel metabolites (the so-called unknown unknowns) remains a serious bottleneck in building comprehensive metabolomic datasets [16, 54]. Continuous efforts are needed to improve and automate annotation tasks. In addition, a systematic collection of annotation skills from experts will be necessary in the near future, as will the analysis and transfer of these skills to the public domain for education of fledgling annotators.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was partly supported by a grant from the New Energy and Industrial Technology Development Organization (NEDO, Japan) as part of the project “Development of Fundamental Technologies for Controlling the Material Production Process of Plants,” a grant from the National Bioscience Database Center (NBDC) of Japan Science and Technology Agency (JST) as part of the project titled “The Life-Science Database Integration Project,” and a grant from the Kazusa DNA Research Institute.

Supplementary Materials

The Supplementary Material include 3 supplementary methods and 4 supplementary figures as follows.

Supplementary Method S1: The method for the evaluation of the accuracy of mass values of the peaks detected using PowerGet tool.

Supplementary Method S2: The method for the evaluation of applicability of a data matrix resulting from peak alignment in FragmentAlign tool to a comparative metabolomics analysis.

Supplementary Method S3. The method for the evaluation of accuracy of peak area quantified using SpiceHit tools.

Supplementary Figure S1: Principal component analysis (PCA) of metabolite peaks in Arabidopsis leaves, Lotus japonicus leaves, and Arabidopsis cultured cells detected using GC-MS and aligned by means of FragmentAlign.

Supplementary Figure S2: Correlation coefficients of the values of peak intensity aligned in FragmentAlign.

Supplementary Figure S3: Reproducibility of peak area quantification in SpiceHit and ChemStation.

Supplementary Figure S4: Comparison of values of the peak area quantified using SpiceHit and ChemStation.

Four supplementary tables are as follows.

Supplementary Table S1: The standard compounds used for the comparison of accuracy of mass values obtained using PowerGet and Xcalibur.

Supplementary Table S2: The data used for peak alignment of GC-MS data through FragmentAlign.

Supplementary Table S3: The data used for the evaluation of quantification of peaks using SpiceHit.

Supplementary Table S4: Comparison of accuracy of peak area quantification in SpiceHit and in ChemStation.

  1. Supplementary Materials
  2. Supplementary Table S1
  3. Supplementary Table S2
  4. Supplementary Table S3
  5. Supplementary Table S4