MarVis-Filter: Ranking, Filtering, Adduct and Isotope Correction of Mass Spectrometry Data
Statistical ranking, filtering, adduct detection, isotope correction, and molecular formula calculation are essential tasks in processing mass spectrometry data in metabolomics studies. In order to obtain high-quality data sets, a framework which incorporates all these methods is required. We present the MarVis-Filter software, which provides well-established and specialized methods for processing mass spectrometry data. For the task of ranking and filtering multivariate intensity profiles, MarVis-Filter provides the ANOVA and Kruskal-Wallis tests with adjustment for multiple hypothesis testing. Adduct and isotope correction are based on a novel algorithm which takes the similarity of intensity profiles into account and allows user-defined ionization rules. The molecular formula calculation utilizes the results of the adduct and isotope correction. For a comprehensive analysis, MarVis-Filter provides an interactive interface to combine data sets deriving from positive and negative ionization mode. The software is exemplarily applied in a metabolic case study, where octadecanoids could be identified as markers for wounding in plants.
A central aim of untargeted Metabolomics and Metabonomics studies is the identification of marker metabolites which play a crucial role in the experimental context [1, 2]. Mass spectrometry combined with either gas chromatography (GC/MS) or liquid chromatography (LC/MS) has become a key technology for metabolome analysis under different experimental conditions [3, 4]. A typical data set after peak detection and sample alignment [5–7] consists of several thousand marker candidates which are characterized by a retention time (RT), a mass-to-charge value (), and a multivariate intensity profile of abundance levels per condition, respectively . The experimental conditions are represented by replicate samples and may correspond to environmental disease or genetic perturbations [9–11]. In order to obtain a high-quality data set of experiment-related marker candidates, the raw data set is usually ranked and filtered using supervised machine learning techniques such as Random Forest classification [12, 13] or statistical analysis based on ANOVA or Kruskal-Wallis tests [14–16]. The filtered marker candidates are then annotated according to known metabolites from public biological and biomedical compound databases [17–21]. A central task of annotation is the calculation of actual molecular masses corresponding to each marker candidate by correcting the ratios according to the ionization mode, potential adduct formation, and included natural isotopes . This problem can be addressed by applying the ionization rules , where denotes the number of combined target molecules, the mass of attached molecules (adduct formation), and the degree of ionization (e.g., single or double). Additionally, the number of included isotopes has to be estimated in order to query databases which contain monoisotopic compound masses. Based on a potential ionization rule with parameters , , and the number of included isotopes, the corresponding compound mass can be calculated.
For the corrected masses which cannot be assigned to particular compounds, the identification can be supported by calculating possible molecular formulas. The number of considered formulas can be significantly reduced by incorporating information from preprocessing as well as rules for heuristic filtering of molecular formulas , respectively. A major step in this process is the estimation of the number of included carbon atoms based on the intensity profiles of previously detected isotopologues.
There are a great number of software packages available, which provide tools for statistical analysis of multivariate experimental data [25, 26]. A number of tools for peak detection and sample alignment of mass spectrometry data, such as MetAlign or OpenMS, also support the deconvolution of isotopologues and statistical analysis [27, 28]. For the XCMS platform , a package for the annotation of LC/ESI-MS mass signals based on adduct rules has been implemented . The calculation of possible ionization products and the rule-based heuristic filtering of molecular formulas is provided by several software packages [22, 24]. However, to the best of our knowledge, there is no software available which incorporates all of these methods in a single user-friendly tool as offered by MarVis-Filter.
2. Materials and Methods
In the following sections, the algorithm for adduct/isotope correction and the implementation of MarVis-Filter are described in detail.
2.1. Algorithm for Adduct and Isotope Correction
The algorithm is based on the input of the retention times, ratios, and raw intensity profiles of all marker candidates in a data set and calculates as output the potential monoisotopic mass, ionization rule, and number of included -isotopes for every candidate. The approach is based on a greedy strategy which minimizes the number of potential molecular masses and simultaneously maximizes the similarity of intensity profiles between candidates with a similar retention time and actual mass. This concept follows the paradigm that in mass spectrometry analysis a metabolite is usually represented by several marker candidates with a similar retention time and intensity profile, but different ratios according to the various possibilities of ionization and number of included isotopes. As parameters, the algorithm expects a list of ionization/adduct rules sorted according to their relevance, the assumed maximal number of -isotopes per marker candidate, a mass tolerance, an RT tolerance, and a minimal cosine similarity of intensity profiles. The isotopologues correction is restricted to the detection of .
For storage of pairwise cosine similarities between candidate profiles, the algorithm utilizes a five-dimensional matrix . Each entry corresponds to the maximal cosine similarity between the intensity profile of candidate , assuming ionization rule and -isotopes, and another candidate, which has a similar retention time (within tolerance) and corrected mass (within tolerance) assuming ionization rule and -isotopes. For each candidate , the algorithm then chooses the ionization rule and number of -isotopes which is supported by the highest sum of cosine similarities. In the following, the algorithm is described in detail.(1)Initialize with zeros.(2)Calculate all possible masses by applying all ionization rules and number of -isotopes to all candidate ratios.(3)Consider all pairs of potential masses under the following constraints and fill with pairwise cosine similarities of corresponding candidate profiles.(i)Consider only pairs of different marker candidates.(ii)Consider only pairs within the mass and RT tolerance.(iii)Consider only pairs with at least the requested cosine similarity.(iv) Consider only pairs with different combinations of adduct rules and number of isotopes.(v) For each entry in hold only the maximum cosine similarity.(4) Calculate the reduced three-dimensional matrix with summed entries: (5)Choose for each candidate : the adduct rule and isotope number with the maximal sum of similarities . If , use the first ionization rule and zero -isotopes as default.(6)Calculate the masses according to chosen rules and isotope numbers.
In order to avoid apparently false associations between marker candidates, negative cosine similarities are disregarded. If for a given candidate different selections of the ionization rule and the number of isotopes maximize the sum of cosine similarities, the ionization rules with the highest relevance and the minimal number of -isotopes are selected.
Following the annotation of the ionization rules and -isotopes, the number of carbon atoms per candidate is estimated by comparing the raw intensities of marker candidates with zero predicted -isotopes () and the respective marker candidates including one -isotope () according to the following formula: corresponding to the natural abundances of carbon isotopes. Given a pair of candidates, annotated as isotopologues ( and ) and with the same ionization rule, a robust estimation of the number of carbon atoms is obtained by calculating the median over all samples included in both intensity profiles.
MarVis-Filter is implemented in the Matlab and C programming language and has been compiled together with the MarVis-Cluster tool  for Microsoft Windows XP/Vista/7. Execution of the software requires installation of the Matlab Compiler Runtime, which is provided with the software. The installation packages, the documentation, and example data sets can be downloaded from the project home page http://marvis.gobics.de/.
For data import and export MarVis-Filter uses the CSV (Comma Separated Values) file format, which can easily be processed by statistical analysis software and spreadsheet applications. MarVis-Filter also supports the direct import of aligned mass spectrometry samples from MarkerLynx Application Manager of MassLynx (Waters Corporation, Milford). For interactive analysis, ranking and filtering of multivariate intensity profiles MarVis-Filter provides the well-known one-way ANOVA and Kruskal-Wallis tests  combined with methods for value adjustment for multiple-hypothesis testing [30, 31]. Based on customizable lists of ionization rules, the adduct/isotope correction can be performed on raw or filtered data sets. The ionization rules are imported as text files and can easily be adapted or extended.
Figure 1 shows the main window of MarVis-Filter after import and ranking. The “Ranking plot” (1) displays the adjusted values (-axis) of all candidate intensity profiles in the current data set sorted in ascending order. The data set can interactively be filtered according to a user-defined significance level by selecting a marker, sliding the red separator line or jumping to a predefined level. The “Profile plot” (2) shows the raw intensity profile of the currently selected marker candidate. Intensity values of replicated samples belonging to the same experimental conditions are marked in the same color. The “Marker information box” (3) displays information about all marker candidates of the data set arranged according to the values and characterized by the ratio, RT and additional user-defined scores, which can be imported along with the data set. After adduct and isotope correction, the additional annotations are displayed in this listbox as well. The “Data set clipboard listbox” (4) shows data sets which are currently held in the MarVis clipboard. The current (filtered or unfiltered) data set can simply be added or removed to/from this list. The data set clipboard supports an adduct and isotope correction of selected data sets in a batch mode. Data sets which were corrected based on different sets of ionization rules (e.g., positive and negative ionization) may be combined into one single data set.
For selected candidate profiles, bar plots, standard error plots, and boxplots can easily be inspected and exported in various image formats. For detailed analysis, the user can zoom into all plots. Additionally, MarVis-Filter provides a convenient interface for quick candidate search based on the ID, RT, , or mass value.
MarVis-Filter also provides a molecular formula calculator, which is based on the Seven Golden Rules  and utilizes the estimated number of carbon atoms per marker candidate obtained after adduct and isotope correction.
MarVis-Filter and MarVis-Cluster  are combined in the MarVis-Suite which features the direct data exchange between preprocessing in MarVis-Filter and convenient visualization of multivariate intensity profiles and high-level cluster analysis in MarVis-Cluster.
3. Results and Discussion
The functionality of MarVis-Filter is demonstrated using two data sets of a metabolomic case study for plant wounding experiments . The data sets are available on the project homepage http://marvis.gobics.de/ together with a detailed description of the extraction and UPLC-TOF method. Additionally, the data sets are available for import in MarVis-Filter after installation of the MarVis-Suite (wound_neg_raw.csv and wound_pos_raw.csv in the examples directory).
3.1. Case Study and Data Sets
The case study reflects a wounding time course of Arabidopsis thaliana wild-type (WT) plants as well as of mutant plants (dde 2-2), which are deficient in the biosynthesis of the plant wound hormone jasmonic acid and its derivatives . The wounding time course represents eight experimental conditions. The first four conditions reflect the metabolic situation within a wounding time course of wild-type (WT) plants, starting with the unwounded control plants (abbreviation wt_0) followed by the plants harvested 0.5 (wt_30), 2 (wt_2), and 5 hours past wounding (wt_5). The conditions 5 to 8 represent the analogous time course for the jasmonate deficient mutant plant dde 2-2 (aos_0, aos_30, aos_2, aos_5). Each condition contains nine replicate samples.
3.2. Data Import and Analysis in MarVis-Filter
The two data sets are imported sequentially in MarVis-Filter using the “Import raw CSV data” entry in the “File” menu with the following options: Delimiter: “,”; Start row: 5, Start column: 3; ID label: “id”; Generate IDs: activated; column: 2; label: “rt”; column: 3; label: “”; Condition identifiers: “wt_0, wt_30, wt_2, wt_5, aos_0, aos_30, aos_2, aos_5”.
After data import, the marker candidates are sorted and ranked according to the values of a Kruskal-Wallis test and the Bonferroni-Holm adjustment for multiple hypothesis testing  by selecting the corresponding checkboxes in the “Filter dialog” and the “Adjustment for multiple testing” dialog.
Adduct and isotope correction are performed on the full data sets separately using predefined sets of adduct rules for the negative (Table 2) and positive ionization mode (Table 3), an RT tolerance of 0.04 minutes, a mass tolerance of 0.005 Da, a minimal cosine similarity of 0.75, and a maximum number of two -isotopes per candidate. The adduct rules had been determined in previous targeted UPLC-TOF-MS experiments. After correction, the data sets are filtered according to a significance level for adjusted values of 0.01 (“Goto level” entry in “Selection” menu) and added to the MarVis data set clipboard. Table 1 shows the initial number of imported marker candidates and the number of high-quality marker candidates after filtering. Finally, the two data sets in the MarVis clipboard are concatenated using the “combine” button. The combined data set can be sorted according to a user-defined method once again and is then presented in a new MarVis-Filter window. After selecting the whole data set, the combined subset of 3504 high-quality marker candidates can be exported as a CSV file, and clustered as well as visualized using MarVis-Cluster (“Goto MarVis-Cluster” entry in the “MarVis-Suite” menu). Figure 2 shows the results from clustering of the filtered and combined data in MarVis-Cluster.
3.3. Identification of Metabolites
The corrected, filtered, and combined data sets were used to identify metabolites which show a significant change of abundance in the wound time course in WT and/or jasmonate deficient mutant plants. First, the corrected masses of marker candidates were matched to molecular masses of all compounds recorded in the KEGG  and AraCyc  database or literature  based on a tolerance of 0.005 Da. The identity of marker candidates was confirmed based on the isotopic pattern and coelution with identical standards or MS/MS fragmentation . Thus, a number of oxylipins could be identified as wound-induced metabolite markers (see Table 4). Oxylipins are metabolites deriving from lipid peroxidation and are involved in regulating developmental processes as well as environmental responses, like the inflammatory or wound response, in nearly every organism. Among these bioactive lipids, the mammalian and plant oxylipins are the best characterized ones. Mammals use predominantly C20 fatty acids (eicosanoids), while in plants C18 fatty acids are most abundantly used for the biosynthesis of oxylipins or so-called octadecanoids . The identified oxylipins (see Table 4) are part of the -linolenic acid metabolism or members of the compound class of mono- and digalactosyldiacylglycerols. They are described in the context of plant wounding [33, 34, 36]. Thirteen of the fifteen identified oxylipins could only be detected in either the negative or the positive ionization mode. On average, five ions/marker candidates could be assigned per compound. The findings are supported by very low adjusted values from the Kruskal-Wallis test of the intensity profiles (see previous section and Table 4).
MarVis-Filter combines essential preprocessing tools for mass spectrometry data analysis within a single user-friendly tool. Large data sets from the negative and positive ionization mode can easily be imported, corrected, filtered, and combined. Lists of ionization rules for adduct correction can be customized, extended, and commented in a convenient way using a standard text editor. Within the MarVis-Suite filtered and combined data sets can directly be clustered, visualized, and analyzed in detail using the MarVis-Cluster tool. In a case study 75 high-quality marker candidates could be clearly assigned to fifteen compounds of the oxylipin class based on the adduct and isotope correction in MarVis-Filter. The combination of data sets deriving from the negative and positive ionization mode is an important step for further data analysis. In the case study, most of the identified metabolites could only be detected in either the negative or the positive mode. The significance of the selected wound markers is supported by a high number of annotated and assigned ions/marker candidates and by very low adjusted values from the Kruskal-Wallis test. The statistical filtering of marker candidates reduced the complexity of the data sets from about 48000 to 3500 significant candidates (about 7 percent).
This work was partially supported by the DFG FOR-546-FE 446/2-3 to I. Feussner and by the Federal Ministry of Education and Research (BMBF 0315595A) to P. Meinicke and I. Feussner. A. Kaever and M. Landesfeind were supported by the Biomolecules program of the Göttingen Graduate School for Neurosciences, Biophysics, and Molecular Biosciences (GGNB). The authors are grateful to Lars Söder for support in software development and testing, to Farina Schill for critical reading the manuscript and to Pia Meyer for expert technical assistance.
J. K. Nicholson, J. C. Lindon, and E. Holmes, “'Metabonomics': understanding the metabolic responses of living systems to pathophysiological stimuli via multivariate statistical analysis of biological NMR spectroscopic data,” Xenobiotica, vol. 29, no. 11, pp. 1181–1189, 1999.View at: Google Scholar
K. Nahlik, M. Dumkow, Ö Bayram et al., “The COP9 signalosome mediates transcriptional and metabolic response to hormones, oxidative stress protection and cell wall rearrangement during fungal development,” Molecular Microbiology, vol. 78, no. 4, pp. 964–979, 2010.View at: Publisher Site | Google Scholar
M. Beckmann, D. P. Enot, D. P. Overy, and J. Draper, “Representation, comparison, and interpretation of metabolome fingerprint data for total composition analysis and quality trait investigation in potato cultivars,” Journal of Agricultural and Food Chemistry, vol. 55, no. 9, pp. 3444–3451, 2007.View at: Publisher Site | Google Scholar
J. Gibbons and S. Chakraborti, Nonparametric Statistical Inference, CRC Press, 2003.
M. Kanehisa and S. Goto, “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, 2000.View at: Google Scholar
R. Tautenhahn, C. Böttcher, and S. Neumann, “Annotation of LC/ESI-MS mass signals,” in 1st International Conference on Bioinformatics Research and Development (BIRD '07), vol. 4414 of Lecture Notes in Computer Science, pp. 371–380, Berlin, Germany, March 2007.View at: Google Scholar
R. Gentleman, R. Ihaka et al., http://www.r-project.org/.
B. Jones, MATLAB: Statistics Toolbox User's Guide, MathWorks, 1993.
S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979.View at: Google Scholar
Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society. Series B, vol. 57, pp. 289–300, 1995.View at: Google Scholar
B. Von Malek, E. Van Der Graaff, K. Schneitz, and B. Keller, “The Arabidopsis male-sterile mutant dde2-2 is defective in the ALLENE OXIDE SYNTHASE gene encoding one of the key enzymes of the jasmonic acid biosynthesis pathway,” Planta, vol. 216, no. 1, pp. 187–192, 2002.View at: Publisher Site | Google Scholar
A. Ibrahim, A. Schütz, J. Galano et al., “The alphabet of galactolipids in Arabidopsis thaliana,” Frontiers in Plant Physiology, vol. 2, article 95, 2011.View at: Google Scholar