Abstract

A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique -mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome of any fragment. We show that our method is comparable to BLAST for small 25 bp fragments but does not have the ambiguity of BLAST's tied top scores. We demonstrate that this approach is scalable to identify any fragment from hundreds of genomes. It also performs quite well at the strain, species, and genera levels and achieves strain resolution despite classifying ubiquitous genomic fragments (gene and nongene regions). Cross-validation analysis demonstrates that species-accuracy achieves 90% for highly-represented species containing an average of 8 strains. We demonstrate that such a tool can be used on the Sargasso Sea dataset, and our analysis shows that NBC can be further enhanced.

1. Introduction

While pattern recognition methods have been used in intron/exon identification [1], motif-finding [2], and microRNA prediction [3], these methods have not been applied to whole-genome identification and taxonomical relationships until recently. Now, there are a rapidly growing number and diversity of sequenced genomes across the evolutionary spectrum enabling a systematic study. This makes it possible to use these methods combined with biological insight to identify meaningful features and patterns to reveal relationships among DNA sequences that are not just limited to specific 16S rRNA genes but to any random fragment. A direct parallel between text classification and DNA classification can be made and seen in Figure 1. Until recently, bioinformatics approaches to metagenomics have been limited due to their lack of available data. Because of the lack of knowledge about genome diversity, most phylogenetic studies of metagenomic samples examine 16S ribosomal RNA genes for diversity [4]. This is because 16S rRNA sequences produce the fundamental protein needed for transcription, and therefore they are highly conserved across all species of life. Also, they contain insertion and deletion variation that makes their information content unique to various genera and species [4]. However, it has been shown that organisms that are identical or cluster tightly under 16S criterion cannot be concluded to share all or, in some cases, essential physiological similarities [5]. Thus, definition of species on this basis is not adequate for assessing the functional diversity of prokaryotic communities. In fact, it has been noted that the hot-spring microbes have ecologically important differences that have less than 1% 16S rRNA sequence divergence [5]. This has led scientists to consider new ways to identify the species/strain content of a clinical or environmental sample.

Unfortunately, in a less than ideal metagenomic sample, scientists do not always have the luxury of extracting these 16S genes. If blind methods existed to assess the taxonomical content of the sample from these random fragments, it would yield a high-throughput analysis especially when combined with short-read next-generation sequencing technology. Next-generation sequencing promises extremely high throughput, but at a price, it yields short reads. Currently, many metagenomic tools use BLAST as a first step to identify a sample's content [68]. But BLAST's [9] ability to assign short reads to strains in the database yields many ambiguous results, and it has been recently reported that BLAST breaks down when going from long 600–900 bp reads to short 100–200 bp reads for metagenomics data [7]. Huson et al. suggest that a “sweet" spot may exist around the 200 bp threshold for accuracy rates [6]. Wang et al. verify that with 16s rRNA sequences, one can get 83.2% accuracy (200 bp fragments) and 51.5% (50 bp) on the genus level via a leave-one-out cross-validation test set [10]. Krause et al. suggest that with 80–120 bp reads, the superkingdom can be classified with 81% accuracy and the order can be classified with 64% accuracy [11]. Of course, most of these techniques use different sets of corpora and therefore the methods are difficult to compare although the main goal in identification is to gain as good of accuracy rates as possible. In general, researchers have deduced that fragments longer than 200 bp are needed in metagenomic applications. Yet, newer and faster sequencing technologies yield 20–35 bp reads in order to parallel the process, and scientists are questioning whether the technology is worth it due to the short reads [7]. Therefore, the holy grail of high throughput metagenomics is short-read DNA classification with reasonable accuracy.

In this paper, we construct the unique -mer frequency profiles of 635 microbial genomes (including 470 unique species and 260 unique genera), publicly available as of February 2008. These profiles are used to train a naive Bayes classifier (NBC) that can be used to identify the genome from which a fragment may have been sequenced as part of a metagenomic data set. In Section 3, the methodology for naive Bayes classifier is presented, an example is given, the word frequency computations are discussed, and the methodology to obtain the confidence of our classifier validation is presented. In Section 4, NBC for the small (25 bp) fragment case is compared to the most widely used identification method, BLAST. We then assess the method's cross-validation performance (unseen-strains) for species-level classification. Finally, we test the NBC on the Sargasso Sea set and compare the results to MEGAN, a BLAST-based taxonomy presentation. The preliminary results show that an -mer-count global perspective can yield a reasonable classification of metagenomic sequences that does not produce ambiguity. In Section 5, a discussion of the advantages and disadvantages of the method is shown. With further enhancements, this approach can yield a promising way to solve the strain resolution that BLAST has no chance to resolve with sequence identity scoring.

2. Background

Sequence classification methods have traditionally aligned two sequences (usually homologous genes) to compare their similarity. The progress has been slow due to lack of demand, with the Needleman-Wunsch [12] algorithm introduced in 1970 and the Smith-Waterman algorithm [13] following over a decade later. Multiple-sequence alignment is an extremely important tool for phylogeny but did not have viable tools until the late 80s [14] due to the lack of sequenced genomes. Counting on BLAST to find homologous genes and sequence phylogenies is feasible, but it would be simpler to identify characteristic features unique to a group, such as a genotype signature representing all pathogenic E. coli, independent of encoded genes. In fact, most comparative techniques focus on the comparison of genes because they signify conserved regions and functions related to a phenotype [15]. Also, they conveniently ignore horizontal transfer which can insert locally anomalous characteristics [16]. In bacteria, this is especially true and phylogenetic footprinting uses gene homologs although there has been mounting evidence of use of noncoding RNAs [17]. Also, standard methods that ignore horizontal gene transfer cannot analyze the complete evolution of a microbial community or identify characteristic markers that may exist. Therefore, we seek a framework that represents the entire DNA in a sample without prior knowledge of the genes, promoters, and so forth. in the DNA sequences. We propose such approach that uses the naive Bayes classifier, which is able to identify significant features in a blind and high-throughput manner.

Existing methods to identify metagenomic sample content involves profiling clones with microarrays that identify previously unknown genes in environmental samples [18], subtractive hybridization to eliminate all sequences that hybridize with another environment, or subtractive hybridization to identify differentially expressed genes [19], and genomic signature tags [20]. The latter method is a way to extract particular 21-22 bp tag sequences that can be used to examine intraspecific genomic variation and, if genome information is available, provide immediate species identity. Further, it pinpoints areas of a genome that might have undergone changes which add or delete restriction sites.

Our approach is to use -mer frequencies, or words, of sequences as features to classify genomic fragments. Using DNA words as genomic features for discrimination and phylogenetic measures has previously been explored. For example, when faced with a contig that originated from an unknown, and never-to-be-recovered bacterial cell, Glöckner described how multivariate analyses of small-scale DNA architecture (e.g., comparing tetra-nucleotide usage) revealed a reasonable measure of fragment relatedness [21]. For tetra-nucleotides, it has even been demonstrated that their frequencies carry an innate but weak phylogenetic signal [22]. Other researchers have explored observing the patterns in oligonucleotide frequencies and unusual features [2325]. A recent notable work is to construct a phylogenetic tree via variable-length segments and their frequency occurrence [26].

3. Methods

3.1. Naive Bayes Classifier

The term “classifier” is used in the sense of a statistical tool, trained using the full genomic data, to discriminate between “classes.” Each “class” is a strain, species, family, and so forth, which depends on the particular class label definitions. In our work, we examine the cases where the classes are strains, species, and genera. The classification method will provide us with a way to predict class labels from fragment features, and the results are assessed for varying length of features and fragment sizes.

A naive Bayes classifier (NBC) is based on applying Bayes's theorem assuming that each feature in the classification is independent of each other. This strong independence assumption is the naivity of the algorithm, but the NBC has been shown to perform well in complex situations [27].

In this case, our features are composed of DNA words (-mers). -mers are -length words of DNA that may or may not be overlapping. The foundation of our analysis is correlating the frequencies of these -mers in a sequence to its overall identity. It is analogous to predicting the genre of a book from its word content. For example, a book about law is more likely to contain high frequencies of “law,” “court,” and “ruling” than this article which contains high frequencies of “genome,” -mer,” and “fragment.”

Let be the feature vector, composed of a set of words (or -mers) in an -length fragment, . To label in one of the genome classes, , the posterior probability of a particular class, , given the feature vector, , is . The Bayes classifier chooses the predicted class, , with the largest posterior probability given that is observed This expression guarantees minimum error across the whole space spanned by the features in .

The posterior probability, , can be calculated by using the Bayes rule: In other words, the probability, , of the genome class given the word features is equal to the probability, , of the words given the class times the prior probability of observing that genome class, , divided by the unconditional probability of observing the words, , that compose a fragment, . The is constant given a particular fragment.

The naive Bayes classifier assumes conditional independence between the -mer features and calculates the class-conditional probability as a product of individual probabilities: where is the number of overlapping -mers in the fragment, .

The individual conditional probabilities, , are obtained by dividing the number of each fragment -mer in the genome, , by the total number of -mers in that genome , where is the length of .

In (2) , the prior probability of the genome, , is assumed to be in our hypothetical environmental sample. We make the assumption that our sample is uniform, or each genome is equally likely. In this case, our sample content is unknown, and in the absence of such prior knowledge, equal priors are typically used. With prior knowledge about the environment, a better estimate can be obtained. We also do not know the probability of obtaining a fragment with a set of words, , but this unconditional probability will be constant across the scoring function in (1), so it can be omitted. Therefore, we omit both and components in (1) and use the following scoring function for our work:

As increases, the score, , can become very small and introduce precision errors into the computation. Due to numerical precision errors, we take the log probability to obtain our final scoring function

3.2. Calculation of -mer—Frequencies,

Since we need to know the oligo words (or -mers) as genomic features in the naive Bayes algorithm, an efficient implementation is devised to compute the frequencies of all -mers. We denote their frequencies as for each genome. We have implemented two such methods, one optimized for short words and the other optimized for long ones.

The first one is a general method that works for any sequence length, which generates all possible -mers that overlap by nucleotides. Once all possible -mers are generated, they are sorted and then the cardinalities of recurring -mers are computed. An illustration of this method is seen in Figure 2.

An optimized count mechanism is used when is “small” (defined as or less). By storing a finite bit counter for each -mer in memory for , time and memory can be saved because the algorithm does not have to store each word in memory like the first method. We generate a list of the size of all possible combinations of -mers. Each entry in the table stores an -bit counter for each alphabetically-sorted -mers. is heuristically determined by examining the sequence length and the possible -mers—if the sequence length is much less than , then is low, otherwise, is increased accordingly. Then incrementing down the sequence with an -length window, the counter that corresponds to each -mer in the table is incremented. If a counter overflows, another -bit counter is mapped from the first counter to account for the extra counts. While this slows the algorithm down, it is unlikely to occur. This phenomenon is related to Zipf's law [28] which is a power law that states that the frequency of any word is inversely proportional to its rank, , where is the rank of the word, in the frequency table. Therefore, only a few -mers will have high frequencies that need additional counters. The algorithm is summarized in Figure 3.

To further illustrate Zipf's law, we illustrate the -mer frequencies of 3 different strains of E. coli in Figure 4. A trend close to Zipf's law (the inverse rank-frequency relationship). Zipf's law curve can be modeled with an exponent as where is the exponent order, is -mer length, and the -mer_rank is the order of the frequency rank on the -axis. We can see that the log-log E. coli curves tend to follow this law.

A comparison of the algorithm run times for and for various genomes can be seen in Table 1. The optimized run times are similar to those seen in other computational methods for frequencies [29], but other methods rarely calculate -mers larger than 20-mers [30]. We can compute any size, and one of the parameters we will be looking for is the optimum -mer size for separability among the data sets.

3.3. Confidence Intervals for Accuracy Calculations

To validate our model, we choose 100 random fragments from each training-set genome, totaling 63 500 fragments tested. Once we receive the result of the scoring algorithm, the genome that scores the highest is marked as correct or incorrect using prior knowledge of the true genome. This enables us to average the binomial distribution of correct(1)/incorrect(0) labels to produce an average accuracy per genome (as seen in Figure 5). The confidence of our average accuracy over 100 random fragments can be computed by using the formula for computing the confidence interval for a binomial distribution. The binomial distribution is approximated by a normal distribution. It has been shown that for over 30 trials, a binomial distribution obeys the normal distribution due to the central limit theorem. The true accuracy with its confidence interval is defined as where is the estimated average accuracy, is the critical value corresponding to the percentile of the standard normal distribution, is the sample size, and is the standard deviation of the binomial distribution.

4. Results

The naive Bayes classification is performed on all completed microbial sequences in the NCBI Genbank as of February 2008, which totaled 635 distinct microbial strains. The 635 microbes belong to 470 distinct species and 260 distinct genera in this data set. 404 strains are the sole member of their species class while 171 strains are the sole member of their genus in the data set. This shows that some knowledge will be lacking when it comes to species- and genus-class diversity. While 66 species contain more than one strain, 89 genera contain more than one strain. The microbial strains genome lengths range from for Candidatus Carsonella to for Sorangium Cellulosum.

4.1. The Naive Bayes Classification of the 635-Strain Genome Data Set
4.1.1. Matching to the Nearest Strain

To evaluate the performance of our classifier's ability to classify a given fragment in our database, we test over varying and fragment lengths used in the scoring function (5). These two parameters are varied and the scoring function is calculated for all 635 microbial genomes. The fragment length is chosen as 500 bp, 100 bp, and 25 bp to simulate long and short reads. The -mer lengths is varied for 3-, 6-, 9-, 10-, 11-, 12-, 13-, 14-, and, 15-mers to test performance improvement over these lengths.

To validate our model, we choose 100 random fragments from each training-set genome, totaling 63 500 fragments tested. The 100 fragments are averaged to obtain a strain accuracy per genome. Figure 5 demonstrates how increasing changes the individual strain accuracy rates. For , most strains have a very poor 0–5% classification rate, and interestingly various strains have performance across the board with mers.

For a 95% confidence interval, the critical value is 1.96. Therefore, for the strains that have 50% average accuracy, we are 95% confident that they are between and using (7), with , , , and . The interval is an upper bound. The interval has a quadratic drop-off as the binomial estimates tend towards 0% or 100%.

The accuracy per genome is then averaged and produces a composite “overall” accuracy for the genome strains in our data set. This overall accuracy is computed for each fragment and -mer length. The accuracy of each strain classification can be seen in Figure 6. The strain average accuracies are then averaged together to form an overall average of the 63,500 fragments. The upper bound on the confidence interval for the overall accuracy is %. To calculate this bound, the same parameters for (7) are used except . The accuracy seems to level off for mers for 500 bp and 100 bp fragments while 25 bp fragments do the best with -mer calculations (and probably beyond).

Because Sandberg et al. [31] never ventured past for the -mer size, the result of a jump in performance is never discovered. Again, we believe this is due to the fact that -mer sparsity begins at around because that is when the number of possible combinations surpasses the lengths of the microbial genomes.

4.1.2. Classification to Higher-Level Classes: Examples of Species and Genera

One of the reasons for misclassification of fragments is the sequence overlap between different strains within the same species, and possibly within species belonging to the same genus. In particular, for the case of strains, different strains may be characterized by the loss of genes, addition of genes, or possibly the addition of extrachromosomal genes through the addition of a plasmid. In those cases, there may be only random single base changes in the remainder of their genome, if any. Thus, fragments taken from them using our procedure described above may identically appear in multiple organism sequences. Moreover, if only one base differs, the -mer frequency profiles may be sufficiently similar for the NBC to misclassify the fragment. To study this issue, we consider the performance of fragment identification by pooling the results based on species and genus identity. In doing so, we define genus, species, and strain identifies based on the conventional NCBI taxonomy. For example, Yersinia pestis CO-92 and Yersinia pestis KIM-9 are two strains of the same species; Yersinia pestis and Yersinia pseudotuberculosis are two species of the same genus.

Therefore, the classification with pooling is performed for “species” and “genus” classes instead of individual strain classes. In other words, as long as the genome is classified to a genome within the same species or genera, it is considered correct for that classification. A comparison of the strains, species, and genera classifications can be seen for 500 bp fragments in Figure 7, 100 bp fragments in Figure 8, and 25 bp fragments in Figure 9, respectively. The accuracy for genera is better as expected but follows the general trends for increasing . For genera, the accuracy levels off at 99.8% for 500 bp, 99.3% for 100 bp, and 97.5% 25 bp fragments, respectively, and shows the potential power of the method.

4.2. Comparison Against BLAST Results for 25 bp Fragments for the 635 Genome Data Set

BLAST [9] is expected to do very well for long fragment lengths. In this section, a direct comparison of how the naive Bayes classifier compares to BLAST is shown. It has been reported that BLAST does not yield sufficient results for 25 bp because of ambiguity [6]. It looks for local and global alignments of sequences to score a particular fragment's identity. But there are a slew of parameters controlling the significance of this score, and when a scientist is looking for the closest matching genome to a particular sequence, we will see that in some rare cases, it is incorrect or does not provide an answer. In many cases, it provides too many of the same top scores, yielding ambiguous results. To conduct the comparison, we took all 63 500 fragments (100 fragments per database genome), and BLASTed them against our 635 genome database. The results were compared to the NBC case.

BLAST finds the significance of alignment via an -value, which is the number of highest scoring pairs (HSPs) expected by chance. Therefore, the higher the -value, the lower the significance. In our tests, we desired BLAST to give all tied HSPs despite the score; therefore, we desire an infinite -value. But too many hits were produced by the local BLAST program for an -value above 3000 causing memory errors. This limited us to use an -value of 3000, but because this means that 3000 HSPs may occur by chance, it is a reasonable -value to use in BLAST since it is likely to cause BLAST to produce insignificant scores and hits. More on the -value and BLAST is discussed in [9].

Despite the high -value, 287 or 0.5% of the fragments scored “No Hit” which can be interpreted that all matches in the database were insignificant. One must remember that all fragments BLASTed are from the database, so this is an unexpected result from BLAST. Many of these fragments are only found one time in one genome across the database. Because of this uniqueness, NBC is able to classify the correct genome that produced it, 71% of the time. There is also the issue of multiple top-scoring hits because BLAST only gave 66% of the fragments a unique top-scoring hit and is correct for all of them. Comparably, the naive Bayes classifier classifies 99% of those as well. Out of the multiple top-scoring hits, BLAST completely missed 13 of them, meaning that there are multiple top-hits but the correct one is not in that list. The remaining ones have the correct classification embedded in a list that could range from 2 to 200 top-scoring hits. If one “flips a coin” whenever multiple ambiguous choices occur for a top hit, the correct genome can be guessed 29% of the time overall. The NBC chooses the correct genome 31% of the time out of this set. A comparison between the (a) unique BLAST hits, (b) multiple top-hits, and (c) no hits cases can be seen in Table 2.

To summarize, BLAST is able to find the correct genome (even if ambiguous) in 63200 of the reads but can only resolve 41641 uniquely. With the top hits and flipping a coin for the ambiguous multihits, BLAST would get 47889 (75.4%) correct. The NBC scored 48118 (75.8%) correct which is shown in Figure 6. If is increased, the NBC can potentially get better strain resolution.

The primary issue with BLAST concerning small fragments is that the probability of a unique score becomes lower. Due to NBC's spatial independence, the algorithm can classify correctly 31% of the fragments that are ambiguous in BLAST. The NBC algorithm can be extended to exploit its top multiple scores to obtain better accuracy. With an intelligent examination of the scores, it may be able to get better performance than just predicting the genome with the maximum score. While BLAST gives the same score to multiple organisms, NBC ranks the organisms by score. Surprisingly, NBC never has a tied score for any of the 63 500 fragments. This means that each fragment combination yields a unique probability for the top-ranking genome. This outcome opens up further work in how to exploit the histogram of the genomes' NB prior and posterior probability scores to gain better accuracy. In any case, for 25 bp fragments, it is shown that NBC performs at least as well as BLAST with no augmentations.

4.3. Cross-Validation Performance of NBC Versus BLAST (Using a 9-Species Subset)

In order to fully assess the performance of both methods, we propose to leave some of the data set out for testing. When carefully partitioning the data so that each test set contains a unique subset, this is known as cross-validation and particularly -fold cross-validation for partitions. A major obstacle in conducting cross-validation for our data set is choosing the . We treat each genome as a single strain, training only on full genomes, and do not train on parts of genomes. Thus, for cross-validation, we wish to train on a subset of the example strains in a species and then classify test-strain fragments to the closest training-set species. If strains classify to a strain within their same species, it is marked as correct. As reported before, 66 species contain more than one strain, and many classes contain 2 example strains.

Cross-validation involves partitions. In many cases, the rule of thumb for cross-validation is to use training/test sets [32]. One of the many reasons for is to uniformly train on 90% of the data at a time in order to obtain a better estimate. This poses a difficulty for our sparse data set because only 4 species have 10 or more strains. 9 species have 5 or more example strains, and therefore we determine 5-fold cross-validation to be sufficient for this small data set. The 9 species classes, containing 77 strains, are selected. For each 5-fold cross-validation set, about 62 strains are trained on while about 15 strains are left out (approximately 1/5 of each class).

4.3.1. The NBC Species Cross-Validation Results

In Figure 10, the performance of the classifier using 5-fold cross-validation is shown. Each fragment size can be classified to over 90% accuracy. An interesting note is that while the maximum performance is for 15 mers for 500 bp and 100 bp fragments, 14 mers yield the maximum performance for the 25 bp fragments. The accuracy and standard deviation, respectively, for each fragment size is 97.3 1.0% for 500 bp fragments, for 100 bp fragments, and for 25 bp fragments.

4.3.2. Comparison Against 25 bp BLAST Cross-Validation

A BLAST database is built using each 60-strain training set, and the 15 strains are left out at a time to form a validation set. 25 bp validation fragments are BLASTed, and if the top matches contain only those strains belonging to fragment's species, the results are considered correct. We generate a list of the size of all possible combinations of -mers. On the other hand, if the top-scoring correct species is tied with an incorrect species, the classification is marked as incorrect. Both no hits cases and purely incorrect hits are marked as incorrect as well. 25 bp fragments are scored with an accuracy of , which is comparable performance to NBC's . There is also slightly less variance for NBC showing that it has the potential to be more stable classifier for species classification. As shown in the strain-level BLAST comparison, NBC performs at least as well as BLAST with no augmentations, and this holds true for species-level accuracy using never seen before strains in cross-validation.

4.4. 10 K Reads from the Sargasso Sea Set

The Sargasso Sea data set was published in 2004 [33] by Venter et al. Four geographic sampling sites' microbial cells were shotgun sequenced yielding 1.66 million reads of average length 818 bp. For our analysis, we selected the first 10 000 reads from Sample 1 for analysis which Huson et al. have also analyzed in their MEGAN analysis [6]. In this section, we wish to show how our classifier can be used to analyze this data and compare it to Huson's results which uses BLAST and the NCBI taxonomy database. In metagenomic applications, scientists seek the overall taxonomic content, or the evolutionary relationship of all the microorganisms in the sample. The first step is to identify different strains, or just to identify what phyla/genera an organism is from. In our results, we do an exact strain-matching test on the set (where species/genera can be inferred, such as the example of Yersinia pestis/pseudotuberculosis in Section 4.1.2. We evaluate the 10 K fragments through our classifier for mers and mers to see how different performed for strain recognition to our database and compared it with MEGAN's BLAST-based results.

A comparison of the results can be seen in Table 3. Venter's analysis of the Burkholderia genera in the Sargasso Sea sample 1 is around 38.5%. With the exact same first 10 K reads of sample 1, MEGAN found Burkholderia to be 25.2% of the sample. In our top 10 analysis, we find Burkholderia is 21% for 9 mers and 24.6% for 15 mers). Venter et al. estimated 14.4% for the Shewanella genera in Sample 1. MEGAN specifically finds 17.4% In our top 10 analysis, Shewanella composes 11.4% with 9 mers, and 17.4% with 15 mers.

As explained above, the NBC is able to find the classification rate comparable to BLAST methods of a genera within the top 10 content of the sample for 15-mer analysis. This leads us to a question: do higher -mer models overfit the unknown data? for example, Burkholderia 383 is shown to have a substantially greater percentage in the sample in the 15-mer set (20.4%) over the 9-mer set (6.93%). The same phenomenon occurs with Shewanella ANA-3.

5. Discussion

While the naive Bayes classifier works well on our training data set, is comparable to BLAST, and is able to classify some genomes in an environmental sample, it needs further refinement. For example, in Figure 5, one can see that the mers have consistently poor accuracy for 25 bp fragments, but for mers, the accuracy performs well. Although, one can see that the mer histogram is approaching a binomial distribution, because most strains perform near 100% but some strains never able to resolve and perform poorly near 10%. These fragments should be investigated further.

We compare our work to that of Sandberg et al. [31]. Sandberg used parts of 28 eubacterial and archaeal genomes to train a naive Bayes classifier that would classify segments into 25 species classes. The performance worked quite well and obtained 85% accuracy for more fragment sizes of more than 400 bp, and a promising result is that 35 bp reached 35% accuracy. An unintuitive result in the work of Sandberg et al. was that there seemed to be an upper threshold on how much the -mer (motif in the paper's terminology) size could help in the naive Bayes computation. In our computations, we show that for a large data set, the optimal -mer size increases as the length of the fragment decreases. Also, the mer length needed is larger than what Sandberg et al. needed due to the larger size of our database. On the training data, we show we can achieve 89% strain accuracy and 99.8% genus accuracy for 500 bp fragments. And a great result is that the NBC can resolve training data 25 bp fragments with 76% accuracy for strains and 98% for genera. Training on multistrain species, we show that this method can obtain over 90% for all fragment sizes on unseen strains, and we obtain comparable results to BLAST. In fact, there has been little analysis on the performance of BLAST for general organism recognition, and this paper opens the opportunity for further study of BLAST to metagenomic applications. The results demonstrate great promise for use of this classifier in metagenomic applications.

Our results are comparable to Huson et al.'s work [6] for metagenomic samples, and for comparison, Table 4 lists the top 10 of MEGAN and our method's side-by-side comparison. There are a few surprising differences. While MEGAN finds Candidatus pelagibacter as the second most abundant organism, the NBC finds it as a less common sequence. It has been shown in the literature to be a prolific organism and common in the Sargasso Sea [34]. However, about 20% of the reads that gave Candidatus pelagibacter in MEGAN correspond to Trichodesmium erythraeum in the naive Bayes method. While 20%, 50% (9 mers, 15 mers) of the pelagibacter reads end up being Clostridium beijerinckii. In addition, a surprising difference from MEGAN is that more reads, 533/584 (), are assigned to Trichodesmium erythraeum IMS101. This organism has been found in the Sargasso sea through gene expression studies [35], but MEGAN only shows 3 reads for this organism. The naive Bayes classifier finds this organism consistently in the top 10 organisms present. The NBC could signal some of these organisms that BLAST-like methods do not find, but further analysis should be conducted.

The differences of our Sargasso sea findings from the BLAST findings cause concern, especially since it has been shown that Candidatus pelagibacter is arguably the most abundant prokaryote in the ocean [36]. With further analysis, we find that the NBC gives preference to longer genomes for long fragments and high . Comparing Tables 4 and 3, we can see that pelagibacter is the 2nd most common taxa found from the reads in BLAST, but the NBC does not find it in its top 10. Instead, genomes that have 10–14 million bases show up high on the list. For example, when , there are 1 billion possible words, but all genome sizes are between 320 K and 26 million nucleotides (both sides). With those genome sizes, the mers that exist in them are usually singletons (one occurrence). Therefore, a long genome that is probabilistically more likely to have a mers from a fragment, is more likely to get a “hit” and have a higher score than a small genome. This is especially the case when a fragment is not from a genome in the database. Therefore, the scoring vector needs more intelligence for classifying unknown fragments in order to not penalize smaller genomes.

The analysis of -gram models may yield insight into ways to distribute the probability mass in a more effective manner. Overall, while the accuracy is quite good for fragments existing in our database, the method will need to be improved for unseen species and even genera, and how to assess if the fragment is from an unseen genome.

6. Conclusion

Our approach differs from sequence alignment-based methods because word composition of the sequences is taken into account instead of string matching and alignment. Counting the word-frequencies present in a genome represents global features of the genome as opposed to the local similarities and differences scored by alignment-based methods. More than ever, a method is needed to classify all fragments resulting from high-throughput sequencing technology. It is shown that a global classifier that utilizes -mer frequencies is able to achieve good results (90% for cross-validation species-resolution accuracy) and has great potential to be used in metagenomic applications. In our work, we demonstrate that this approach is viable for any fragment and is scalable to hundreds of genomes. It also performs well for strain and higher-class identifications. It also has the advantage of resolution despite classifying ubiquitous genomic fragments.

In conclusion, global -mer frequency-based profiling based on NBC is a general method for classifying organisms and their genomic content. It can be used for a broad range of applications for analyzing all data from a metagenomic set that will be generated through large-scale projects in ecology, agriculture, and human health. Given that the Human Genome Project is still at an early stage, these new kinds of massive data sets will require innovative informatics approaches for their analysis and translating them into useful knowledge.

Acknowledgment

The authors would like to thank Christopher Pearson for the -mer counting code.