Abstract

A major challenge facing metagenomics is the development of tools for the characterization of functional and taxonomic content of vast amounts of short metagenome reads. The efficacy of clustering methods depends on the number of reads in the dataset, the read length and relative abundances of source genomes in the microbial community. In this paper, we formulate an unsupervised naive Bayes multispecies, multidimensional mixture model for reads from a metagenome. We use the proposed model to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. We model the distribution of word counts along a genome as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. We employ either a mixture of Gaussians or mixture of Poissons to model reads within each bin. Further, we handle the high-dimensionality and sparsity associated with the data, by grouping the set of words comprising the reads, resulting in a two-way mixture model. Finally, we demonstrate the accuracy and applicability of this method on simulated and real metagenomes. Our method can accurately cluster reads as short as 100 bps and is robust to varying abundances, divergences and read lengths.

1. Introduction

Metagenomics is defined as the study of genomic content of microbial communities in their natural environments, bypassing the need for isolation and laboratory cultivation of individual species [1]. Its importance arises from the fact that over 99% of the species yet to be discovered are resistant to cultivation [2]. This limitation imposed by cultivation of isolated clones has severely skewed our view of microbial diversity. Metagenomics promises to enable scientists to study the full diversity of the microbial world, their functions and evolution, in their natural environments.

Next Generation Sequencing (NGS) technologies generate data more efficiently, economically, and with a greater depth than ever before. NGS has opened up an array of possibilities for many applications including whole-genome sequencing, epigenetics, and metagenomics. Of these, the characterization of diversity of heterogeneous microbial environments, metagenomes, has recently gained significant interest. Although a host of methods for whole-genome assembly have been developed, reconstruction of individual clones from metagenomes still remains a challenge. As compared to existing technologies, reads produced by NGS are typically shorter and more error-prone. The growth in the size of the datasets is fast outpacing the computational power needed to analyze it. Thus, many computational challenges arise while analyzing deep sequence data from heterogeneous populations [3]. The computational method we present here aims to quantify the microbial diversity within a metagenome based on a set of deep sequencing reads.

In single genome sequencing, we can be certain that all extracted DNA fragments belong to the same genome. This makes sequence assembly and annotation tractable. However, in majority of metagenomic samples, it is not possible to isolate and culture individual clones. It is further complicated by the fact that the data comes from heterogeneous microbial communities, where the number of species as well as their relative abundance is unknown. Sequence data is usually partial and fragmentary, as environmental sequence sampling rarely produces all the sequences required for assembly. Many of these species do not have fully sequenced genomes available. Moreover, metagenomic datasets are beset with increased amounts of polymorphism and horizontal gene transfer. Sequences from closely related species will most likely have homologous sequences shared between them, hindering their separation [4]. As a result, the reconstruction of a whole genome is generally not possible.

In the light of new data, we need to adapt the traditional approaches to analyze metagenomic sequences. An additional step in metagenomics that is not required in single genome assembly is binning the reads belonging to different species that is the need to associate the reads with its source organism. Clustering methods aim to identify the species present in the sample, classify the reads by their species of origin, and quantify the abundance of each of these species. Clustering provides deeper insight into the structure of the community. It can lead to faster and more robust assembly by reducing the search space [5].

Most of the existing classification methods are supervised and depend on the availability of reference data for training [49]. A metagenomic dataset may, however, contain reads from unexplored phyla which cannot be labeled into one of the existing classes. Most metagenomic analysis methods until now have been relatively inaccurate in classifying short reads. Poor performance on the short fragments is mostly due to the high dimensionality and sparsity associated with the data [10]. To overcome the limitation imposed by the length, the reads are often assembled into longer contigs and then clustered. However, there is the danger of assembling reads from different species together, thereby creating interspecies chimeras. The presence of highly conserved sequences further occludes cluster boundaries between species. Moreover, the abundances of different species can be potentially skewed such that the within-species variance overwhelms the between-species variance [10].

In this paper, we develop a method for clustering the short metagenome reads that addresses the challenges posed by the nature of metagenomic data. We formulate an unsupervised two-way multispecies, multidimensional mixture model to represent reads from a metagenome. We model the distribution of word counts along a genome as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. We employ either a mixture of Gaussians or a mixture of Poissons to model reads within each bin. The proposed model is used to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. Our method is unsupervised in that it does not require any training data. It is a composition-based method that seeks to distinguish between genomes based on their characteristic DNA compositional pattern. Our method handles the high-dimensionality and sparsity associated with the data by grouping the set of words comprising the reads, to regularize the parameters in the mixture model. This implies that, for every group, only one statistic of the words in this group is needed to cluster reads. We show that a high clustering accuracy can be obtained at a much lower dimension. We provide a framework that complements existing similarity-based methods. Later in the paper, we evaluate the applicability of the multidimensional mixture of distributions and its ability to estimate the parameters of genome abundance accurately, for simulated and real metagenomes. We compare the performance of our method with LikelyBin and Scimm, two other unsupervised composition-based method. Also, we demonstrate the robustness of our method to changes in the relative abundance of different species.

The last decade has seen an explosion in the number of computational methods developed to analyze metagenomic data. Literature abounds in methods for classifying (as opposed to clustering) metagenome reads into taxon-specific bins [5, 7, 8]. Current approaches to metagenomics clustering can be classified into two main categories: similarity-based and composition-based methods.

The similarity-based approaches align the reads to close phylogenetic neighbors and hence depend on the availability of closely related genomes in existing databases [6, 11, 12]. MEGAN, a metagenome analysis software system [6], is a representative example of this kind. It uses sequence homology to assign reads to common ancestors based on best match as given by BLAST (Basic Local Alignment Search Tool) [13]. As most of the extant databases are highly biased in their representation of true diversity, such methods fail to find homologs for reads derived from novel species.

A second class of computational methods bin the reads based on DNA composition. These methods rely on the intrinsic features of the reads such as oligonucleotide distributions [7, 8, 14], codon usage preference [15], and GC composition [16] to differentiate between reads belonging to different species. These “genome signatures’’ are known to be fairly constant throughout the genome. A significant limitation of most composition-based methods developed so far is that they do not perform well on reads shorter than 500 bp. Composition-based clustering methods of metagenome reads complement those based on similarity.

Phylopythia [7] is a supervised composition-based classification method that trains a support vector machine to classify sequences of length greater than 1 kbp. Phymm uses interpolated Markov models to characterize variable length DNA sequences by their phylogenetic group [8]. Its accuracy of assignment drops drastically (to just 7.1% at genus level) for short reads and reads from unknown species. Nasser et al. [5] demonstrated that a 𝑘-means based fuzzy classifier, trained using a maximal order Markov chain, can separate fragments that are about 1 kbp long at the phylum level with a high accuracy. Rosen et al. trained a Naive Bayes classifier using publicly available microbial genomes [9]. CompostBin is a semisupervised algorithm for grouping fragments that uses a novel weighted PCA (Principal Component Analysis) and a normalized cut clustering algorithm to classify the sequences [10]. They have demonstrated an error rate bounded by 10%, when guided by information from phylogenetic markers, on datasets of low complexity. However, the accuracy of this method on reads less than 1 kbp has not been shown. Recently, Chan et al. developed a semisupervised seeded growing self-organizing map (S-GSOM) [4] to cluster metagenomic sequences. It extracts 8–13 kbp of flanking sequences of highly conserved 16S rRNA from the metagenome and uses them as seeds to assign the remaining reads using composition-based clustering. The caveat with SOMs is that it was shown to work well only on DNA fragments that are longer than 8 kbp and lose much accuracy for reads with length below 1 kbp. All the above supervised methods depend on the availability of reference data for training. A metagenomic dataset may, however, contain reads from unexplored phyla which cannot be labeled into one of the existing classes. The accuracy of these methods on dataset containing reads from unknown species is yet to be demonstrated. LikelyBin is an unsupervised method that clusters metagenomic sequences via a Monte Carlo Markov Chain approach [17]. The method was tested on samples that were sufficiently divergent according to derived criteria. Scimm is a recently developed model-based approach to sequence clustering where interpolated Markov models represent clusters and optimization is performed using a variant of the 𝑘-means algorithm [18]. In the results section, we compare the accuracy of our proposed method with LikelyBin and Scimm on datasets of different divergences. Abundance Bin can be used to classify reads from species with different abundance levels [19]. However, if it is known a priori that the reads differ widely in their abundances, then we recommend using Abundance Bin over other binning methods.

3. Methods

One of the most common genome signatures is the frequency of occurrence of words (or oligomers) in a DNA sequence [20]. In our method, we model each cluster, containing reads from a species, as a function of probability distributions of words comprising them. The inherent basis of this method is that the set of reads sequenced from a species have a characteristic genome signature that distinguishes it from reads belonging to other species. The distribution of word counts along a genome can be approximated as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare [21]. We propose an unsupervised multidimensional Naive Bayes Poisson mixture model and derive an Expectation Maximization algorithm for the same. The corresponding algorithm for Gaussian mixture model can be derived similarly. At times, longer words tend to be more discriminatory than the shorter ones [22]. However, with the increase in the length of words, the dimensionality of the data increases exponentially, while the word counts become sparse. To tackle high dimensionality and sparsity of word counts, we impose a clustering structure on the word counts as well. Such a model is called a two-way mixture model. In essence, the proposed method provides a general statistical framework for associating each read with its species of origin, based on its genome signatures.

3.1. Need for Multidimensional Word Distribution

A genome signature is a compositional parameter reflecting the relative abundance of different words along the genome. In general, it is similar between closely related species and dissimilar between nonrelated species. Some words that are deemed to be biologically significant are very common in a genome, while others may never be encountered [23]. Composition-based methods use genome signatures to ascertain the origin of the DNA reads. The underlying basis is that the distribution of words in a DNA is specific to each species and undergoes only slight variations along the genome. By establishing the dictionary of words used by a species and their frequency of occurrence, one can point out the basic words of the genome [24].

Literature abounds in methods that study the statistical distribution of the word locations along a sequence and word frequencies [21, 25]. The exact distribution of count of words is known under the hypothesis that the letters are independent (Bernoulli) or under the Markov model. However, in practice, it is extremely time consuming to compute the exact distribution for long sequences or for frequent words. Hence, two kinds of approximations exist. Distribution of word counts along a genome can be approximately modeled as a Gaussian distribution for short words (that are more frequent), or a Poisson distribution for longer words (that are rare) [21].

A metagenomic dataset consists of reads from different species. The reads sampled along a genome of a species will reflect its genome signature. As different words occur with different frequencies along the genome, each word follows its own distribution. Thus, reads belonging to a species can be modeled as a multidimensional distribution of words (one dimension for each word) comprising them. Figure 1 illustrates the distribution of dimer and pentamer counts across reads sampled from the genome of Haemophilus influenzae (for the purpose of clarity, only a few distributions are shown). We see that count of each dimer (a short word) across the reads tends to a Gaussian distribution with a different mean and standard deviation and that of a pentamer tends to a Poisson distribution. Hence, the problem of clustering metagenomic reads can be cast as a multidimensional mixture of Gaussians (or Poissons for longer words) where distribution of each word is modeled as a Gaussian (or Poisson). In other words, this corresponds to the multidimensional Naive Bayes model, where each dimension is modeled as a unimodal Gaussian (or Poisson) distribution. Such a general statistical model takes into account the compositional heterogeneity of words along the genome.

3.2. Multispecies Multidimensional Mixture of Distributions

In this paper, we formulate an unsupervised multidimensional Poisson mixture model for clustering reads within a metagenome by their species of origin. We propose to model the reads from a species as a multidimensional distribution of the words comprising them. Therefore, each cluster is represented by the distribution of word counts within the species. The multidimensional model for Gaussian mixtures can be derived analogously. We present the results for both the models.

Mixture models cover the data well, that is, dominant patterns in the data are captured by the component distributions. They allow better approximations of the true distributions, and their parameters are relatively easy to estimate [26]. An additional advantage of using generative models is that they are flexible and can handle a large number of classes. For instance, a mixture of Poissons can be multimodal, while a Poisson distribution is always unimodal.

We begin with a metagenome, 𝐗={𝐱𝟏,𝐱𝟐,,𝐱𝐍}, containing 𝑁 reads from 𝑀 species. Let 𝛼𝑚 be the proportion of species 𝑚 in the dataset, with 𝑀𝑚=1𝛼𝑚=1. We assume that 𝐗 is observed and is governed by some density function 𝑝(𝐗Θ) with parameter Θ. Our goal is to cluster the reads by their species of origin, based on the frequency of words that appear in the reads. For every species 𝑚, we want to determine 𝛼𝑚, its proportion in the dataset, and Θ, the parameter governing the distribution of words within the reads. Let 𝐘={𝑦1,𝑦2,,𝑦𝑁} be the cluster labels. We assume that 𝑦𝑖=𝑚 for 𝑚1,𝑀 if the 𝑖th read belongs to the 𝑚th species. Also, 𝑝(𝑦𝑖=𝑚)=𝛼𝑚. Cluster label 𝐘 is unknown. We call (𝐗,𝐘) the complete dataset.

For a word of length 𝑙, we obtain 𝑝=4𝑙 different words (combinations of 𝐴, 𝐶, 𝑇, 𝐺), denoted by 𝑊={𝑤1,𝑤2,,𝑤𝑝}. Each read 𝐱𝐢 is represented by a 𝑝-dimensional feature vector, 𝐱𝐢={𝑥𝑖1,𝑥𝑖2,,𝑥𝑖𝑝}, where 𝑥𝑖𝑗 is the count of word 𝑤𝑗 in read 𝐱𝐢. We model the distribution of words within every species 𝑚 by a multidimensional Poisson distribution, say 𝝀𝐦={𝜆𝑚1,𝜆𝑚2,,𝜆𝑚𝑝}. That is, given that read 𝐱𝐢 belongs to species 𝑚, the distribution of each word 𝑤𝑗 is Poisson with parameter 𝜆𝑚𝑗, where 𝑚=1,2,,𝑀 and 𝑗=1,2,,𝑝,𝑝𝑤𝑗𝜆𝑚𝑗𝑤=𝜙𝑗𝜆𝑚𝑗=𝑒𝜆𝑚𝑗𝜆𝑥𝑖𝑗𝑚𝑗𝑥𝑖𝑗!.(1) We assume independence between features of read vector. The probability of a read 𝐱𝐢, given it belongs to species 𝑚, is,𝑝𝐱𝐢𝑦𝑖𝐱=𝑚,Θ=𝑝𝐢𝜆𝑚=𝑝𝑗=1𝜙𝑥𝑖𝑗𝜆𝑚𝑗.(2) At first glance, it might seem imprudent to represent a read as a collection of words comprising it, because it leads to the loss in information about the sequencing read. Strictly speaking, even if the sequence of bases in a DNA is independently and identically distributed, distribution of word occurrences are not independent, due to overlaps [21]. Bayesian networks or belief networks can be used to represent the conditional dependencies between the words comprising the reads [27]. Although, in practice, methods for exact inference in Bayesian networks are often computationally expensive. An attractive alternative to Bayesian networks is the Naive Bayes algorithm that assumes independence between the different features of the read. This assumption makes the otherwise complicated problem tractable. Naive Bayes is known to perform well on complex models and takes time that is linear in the number of components. In addition, lost information can be restored at later stages. In this paper, we have presented the formulation of mixture models with the assumption that the different features (word counts) of the read are independent of each other. We outline the Expectation Maximization (EM) algorithm below.

3.3. Parameter Estimation

To initialize the estimation algorithm, we randomly assign each read to a cluster 𝑚. The posterior probability 𝑞𝑖,𝑚 is set to 1, if read 𝑖 is assigned to cluster 𝑚 and 0 otherwise. With the initial posterior probabilities, a Maximization step (M-step) is derived to obtain the initial parameters. The EM iterations then follow as follows.

Expectation Step
We estimate the posterior probability 𝑞𝑖,𝑚 of read 𝐱𝐢 belonging to species 𝑚. By Bayes theorem, we have 𝑝𝑦𝑖=𝑚𝐱𝐢=𝛼,Θ𝑚𝐱𝑝𝐢𝜆𝑚𝑀𝑘=1𝛼𝑘𝐱𝑝𝐢𝜆𝑘=𝑞𝑖,𝑚,𝑞𝑖,𝑚𝛼𝑚𝑝𝑗=1𝜙𝑥𝑖𝑗𝜆𝑚𝑗subjectto𝑀𝑚=1𝑞𝑖,𝑚=1.(3)

Maximization Step
The M-step uses 𝑞𝑖,𝑚 to compute the expectation of complete data log likelihood, 𝑄Θ(𝑡+1),Θ(𝑡)=𝐸𝑝(𝑌𝑋,Θ)=log𝑝(𝐗,𝐘Θ)𝑀𝑁𝑚=1𝑖=1𝑝𝑦𝑖=𝑚𝐱𝐢,Θ(𝑡)𝑝𝐱log𝐢,𝑦𝑖=𝑚Θ(𝑡+1)=𝑀𝑁𝑚=1𝑖=1𝑞𝑖,𝑚𝛼log𝑚𝐱𝑝𝐢𝜆𝑚.(4) We also take into account the constraint, which requires that 𝛼𝑚’s sum to 1 by adding a Lagrange multiplier: 𝑄Θ(𝑡+1),Θ(𝑡)=𝑀𝑁𝑚=1𝑖=1𝑞𝑖,𝑚𝛼log𝑚𝐱𝑝𝐢𝜆𝑚+𝛽𝑀𝑚=1𝛼𝑚.1(5) We maximize the above expression with respect to the parameters, Θ(𝑡+1)=argmaxΘ𝑄(Θ(𝑡+1),Θ(𝑡)), and update the parameters, 𝛼𝑚(𝑡+1)=𝑁𝑖=1𝑞𝑖,𝑚𝑁,𝜆(𝑡+1)𝑚𝑗=𝑁𝑖=1𝑞𝑖,𝑚𝑥𝑖𝑗𝑁𝑖=1𝑞𝑖,𝑚.(6) Finally, these two steps are repeated as necessary. Each iteration is guaranteed to increase the log-likelihood, and the algorithm is guaranteed to converge to a local maximum of the likelihood function.

3.4. Word Grouping

Higher-order words are known to be more discriminative than shorter ones [22]. With the increase in length of the word, there are two major consequences that need to be addressed. Firstly, the distribution of words tends to Poisson and not Gaussian (by law of rare numbers), see Figure 1. Secondly, the length of the read vector grows exponentially (e.g., for 𝑙=10, 4𝑙106). With increase in dimensions, many words will tend to have similar distributions and hence can be clustered together into a “word group.’’ At the same time, the number of distinct words in any read is usually substantially smaller than the number of dimensions. That is, the feature matrix becomes high dimensional and sparse. Hence, the model may fail to predict the true feature distribution of different components. Therefore, dimension reduction becomes necessary before estimating the components in the model. However, reduction of the number of words using feature selection cannot be too aggressive, otherwise the clustering accuracy will suffer.

In this paper, we handle the above challenge by “word grouping.’’ A supervised two-way Poisson mixture model with word grouping was originally proposed by Li and Zha for simultaneous document classification [28]. Such a two-way clustering involves simultaneous clustering of reads as well as of words. The clusters means are regularized by dividing the words into groups and constraining the parameters for the words within the same group to be identical. The grouping of the words is not predetermined but optimized as part of the model estimation. This implies that, for every group, only one statistic for all the words in this group is needed to cluster reads. For instance, in Figure 1, we observe the distribution of pentamers falls into three distinct group. Therefore, words following similar distributions can be clustered together into a “word group.’’

We extend our formulation to an equivalent two-way unsupervised Poisson mixture model in order to simultaneously cluster word features and classify reads and derive an Expectation Maximization algorithm to estimate its parameters. Figure 2 depicts the paradigm for two-way mixture model of reads. Note that we make a distinction on the use of “cluster” to refer to binning of reads belonging to the same species and “group” to refer to binning of words within read in a cluster.

Recall that the genome signature is similar between closely related species and dissimilar between nonrelated species. The parameter constraint implies that words have the same distribution within each cluster. Therefore, we can assume that, within each cluster, words in different reads have equal Poisson parameters, while, for reads in different clusters, words may follow different Poisson distributions. For simplicity, we assume that all clusters have the same number of word groups. It is trivial to extend to the case where different clusters may have different number of word groups [29].

Let 𝑙1,,𝐿 denote the word groups. We define a group assignment function 𝑐(𝑚,𝑗)1,2,,𝐿, which denotes the group to which word 𝑤𝑗 belongs in class 𝑚. Words in the same word group will have the identical parameters, that is, 𝜆𝑚𝑘=𝜆𝑚𝑗=𝜃𝑚,𝑙, if 𝑐(𝑚,𝑘)=𝑐(𝑚,𝑗). The group assignments of the words vary from cluster to cluster. Let the number of words in group 𝑙 of class 𝑚 be 𝜂𝑚𝑙. The likelihood of 𝐱𝐢 is now𝑝𝐱𝐢𝜆𝑚=𝑝𝑗=1𝑝𝑥𝑖𝑗𝜆𝑚𝑗=𝑝𝑗=1𝑝𝑥𝑖𝑗𝜃𝑚,𝑐(𝑚,𝑗).(7) Now, we can perform clustering using no more than 𝑀𝐿 dimensions. Word grouping leads to dimension reduction in this precise sense.

We can derive an EM algorithm similar to the one outlined above to estimate the Poisson parameters 𝜃𝑚,𝑙 where 𝑚1,,𝑀, 𝑙1,,𝐿, the group assignment function 𝑐(𝑚,𝑗)1,,𝐿, where 𝑚1,,𝑀, 𝑗1,,𝑝 and the prior mixture components 𝛼𝑚, for 𝑚1,,𝑀. We initialize by setting each value of the group assignment function 𝑐(𝑚,𝑗) randomly to a number in 1,,𝐿. We start with the same word group partition for all the clusters, that is, 𝑐(𝑚,𝑗)s are initially identical over 𝑚. We update the parameters as follows:𝛼𝑚(𝑡+1)=𝑁𝑖=1𝑞𝑖,𝑚𝑁,𝜃(𝑡+1)𝑚,𝑙=𝑁𝑖=1𝑞𝑖,𝑚𝑗1𝑥𝑖𝑗𝜂𝑚𝑙𝑁𝑖=1𝑞𝑖,𝑚,where𝑐(𝑚,𝑗)=𝑙.(8) Once 𝜃(𝑡+1)𝑚𝑙 is fixed, the word cluster index 𝑐(𝑡+1)(𝑚,𝑗) can be found by doing a linear search over all components:𝑐(𝑡+1)(𝑚,𝑗)=argmax𝑙𝑖=1𝑞𝑖,𝑚𝑥𝑖𝑗log𝜃(𝑡+1)𝑚,𝑙𝜃(𝑡+1)𝑚,𝑙.(9)

3.5. Naive Bayes Mixture of Multinomials

Theorem 1. If (𝑋1,𝑋2,,𝑋𝑝) are independent Poisson variables with parameters 𝜆1,𝜆2,,𝜆𝑝, respectively, then the conditional distribution of (𝑋1,𝑋2,,𝑋𝑝) given that 𝑋1+𝑋2++𝑋𝑝=𝑛 is multinomial with parameters 𝜆𝑗/𝜆, where 𝜆𝜆=𝑗, that is, Mult(𝑛,𝜋), where 𝜋=(𝜆1/𝜆,𝜆2/𝜆,,𝜆𝑝/𝜆) [30].

The above theorem implies that the unconditional distribution (𝑋1,𝑋2,,𝑋𝑝) can be factored into a product of two distributions: a Poisson for the overall total and a multinomial distribution of 𝑋, 𝑋Mult(𝑛,𝜋). Therefore, the likelihood-based inferences about 𝜋 are the same whether we regard 𝑋1,𝑋2,,𝑋𝑝 as sampled from 𝑝 independent Poissons or from a single multinomial. Here, 𝑛 refers to the length of the reads and our interest lies in the proportion of words in the reads. Any estimates, tests, inferences about the proportions will be the same whether we regard 𝑛 as random or fixed.

We can now derive the Naive Bayes mixture of multinomials as standardized mixture of Poissons. We assume that the distribution of words within the reads of a species is governed by the parameters of a multinomial distribution Θ=(𝜃𝟏,𝜃𝟐,,𝜃𝐦), where each 𝜃𝐦 is the parameter for species 𝑚 and is given by 𝜃𝐦=(𝜃𝑚1,𝜃𝑚2,,𝜃𝑚𝑝). Therefore, the likelihood of the data will be𝑃𝐱𝐢𝑦𝑖𝐱=𝑚=𝑃𝐢𝜃𝑚=𝑛𝑖!𝑝𝑗=1𝑥𝑖𝑗!𝑝𝑗=1𝜃𝑥𝑖𝑗𝑚𝑗,(10) The sum of the probabilities satisfies the constraint 𝑝𝑗=1𝜃𝑚𝑗=1. The EM algorithm for Naive Bayes mixture of Multinomials can be derived similarly, and we only give the final set of equations:𝛼𝑚=𝑁𝑖=1𝑞𝑖,𝑚𝑁,𝜃𝑚𝑗=𝑁𝑖=1𝑞𝑖,𝑚𝑥𝑖𝑗𝑁𝑖=1𝑝𝑗=1𝑞𝑖,𝑚𝑥𝑖𝑗=𝑁𝑖=1𝑞𝑖,𝑚𝑥𝑖𝑗𝑁𝑖=1𝑞𝑖,𝑚𝑛𝑖.(11) If we assume the length of each read to be a constant 𝑛, we get the same results as that with Poisson distribution; hence, the two distributions are equivalent in modeling the distribution of words within reads of a species. Also, since the multinomial distribution is single distribution, we do not perform a two-way dimension reduction on it.

4. Results

4.1. Simulated Metagenomes

The algorithm has been implemented in Matlab and C. The space and time complexity scale linearly with the number of reads and species. Space complexity scales quadratically with the number of dimensions in the search space. Our method converged for all the cases we tested and was robust to the choice of initial conditions.

Metagenomics being a relatively new field lacks standard datasets for the purpose of testing clustering algorithms [31]. As the “true solution’’ for sequence data generated from most metagenomic studies is still unknown, we focused on synthetic datasets for benchmarking. We also apply our method to the actual Acid Mine Drainage dataset to identify the dominant species. In order to test the accuracy of our proposed method, we used MetaSim to simulate synthetic metagenomes [6]. MetaSim takes as input the sequencing technology to be used (Sanger, 454, Exact), a set of known genomes, length of the reads, and an abundance/coverage profile which determines the relative abundance of each genome in the simulated dataset. The genomes used for generating the synthetic metagenomes were downloaded from National Center of Biotechnology Information (NCBI). We generated datasets with reads of lengths between 50 and 1000 bp and various abundance ratios. In the first part of this section, we demonstrate the performance of the multidimensional Gaussian mixture model on several datasets. A default word length of 2 is used. Additionally, as the number of dimensions is relatively small, we do not perform word grouping. Next, we describe the results using the two-way Poisson mixture model with a word length of 5. The method has been implemented for word lengths from 2 to 9. In order to calculate the clustering accuracy, we assign each cluster to the source species that is most frequent in the cluster. Accuracy is given by the percentage of total correct read assignments.

The number of species in each dataset is supplied as an input. Determining the number of clusters from a statistical perspective is a difficult problem and has been addressed by [32]. Previously, 16 s/18 s rDNA have been used for phylotyping and assessing species diversity using a rarefaction curve [33]. Tools such as MetaPhyler and TreePhyler can be used for making an educated guess of the number of species [34, 35]. Estimating species diversity is still an active area of research, and we do not address it in this paper.

Experiments in the 1960s and 1970s have shown that the dinucleotide relative abundance in a genome is a remarkably stable property [36, 37]. Closely related organisms display more similar dinucleotide composition than do distant organisms [20]. In [38], the authors proposed a measure of intergenomic difference between two sequences 𝑓 and 𝑔, called the average dinucleotide relative abundance,𝛿1(𝑓,𝑔)=16𝑋,𝑌||𝜌𝑋𝑌(𝑓)𝜌𝑋𝑌||(𝑔),(12) where 𝜌𝑋𝑌(𝑓)=𝑓𝑋𝑌/𝑓𝑌𝑓𝑌 and 𝑓𝑋 denotes the frequency of 𝑋 in 𝑓. A measure of intergenomic difference was obtained by comparing different genome signatures. In order to assess the robustness of our method, we test it across datasets representative of 𝛿 values ranging from 34 to 340. In general, lower 𝛿 values correspond to “closely related species’’ and higher values correspond to “distant species.’’

In Figure 3, we plot the performance of our proposed multidimensional Poisson model over 450 datasets with 𝛿 values ranging from 34 to 340. We observed a positive correlation between the intergenomic difference and the accuracy of our method, as also noted in [17]. The initial increase in the accuracy with word length is justified by the increased discriminative power of higher order words. However, any further increase in word length has to be accompanied by dimension reduction, otherwise owing to the high dimensional and sparse nature of feature matrix, the accuracy begins to drop.

In Figure 4, we compare the accuracy of our proposed multidimensional Gaussian model with two other unsupervised composition-based methods LikelyBin [17] and Scimm [18] on several datasets. Default parameters are used for these algorithms. We varied the read length between 200 to 500 bp, 𝛿 values from 60 to 300, and the abundance ratio up to 1 : 5. Note that the distribution of dimers tends to a Gaussian. As the number of dimensions is relatively small (42=16), the algorithm performs well without word grouping. Our method clearly outperforms LikelyBin and performs as well or better than Scimm on most instances. Another point worth noting from the figure is that our method’s error rate is bounded by 10% for datasets with read length as short as 200 bp.

We analyzed the accuracy and applicability of our method on binning reads from low complexity communities, containing 3–5 species (see Table 1). With the increase in number of species, there was a slight degradation in performance, though the accuracy was consistently above 85%. This is in agreement with the results from the 2 species dataset, considering that the total coverage of each species is much lower in a multispecies dataset (Reads from B. Burgdorferi form only 6% of the 5th dataset).

Next, we evaluated the robustness of our method to changes in the abundance ratio between species as well as the length of the reads. We simulated three sets of metagenomes with two species each at different abundance ratios. We varied the abundance ratio from 10 : 1 to 1 : 10 in stages for the two species. From Figure 5, we note that there was only a slight drop in performance for extreme abundance ratios. Therefore, the proposed method is suited for binning relatively rare species as well. It is noteworthy to point out that estimates are good at all abundances. In order to test the usefulness of the method for analyzing data produced by the current NGS technologies (especially Solexa and SOLiD) that generate short reads, we tested three datasets of varying 𝛿 values for read lengths between 50 and 1000 bp. With the decrease in read length from 1000 to 50 bp, the drop in accuracy of our method is bounded by 15%.

Recall that with the increase in the length of the words and the simultaneous increase in the number of dimensions, the distribution of the words tends to a Poisson and word grouping becomes necessary. In this section, we present the clustering results obtained by estimating the two-way Poisson mixture model with different number of word groups 𝐿. We observed the variation in classification accuracy to be more prominent for lower values of 𝐿. Therefore, in Table 2, we report the results for values of 𝐿<50 for a 2 species dataset. If word grouping is not performed, then clustering based on mixture model is essentially the Naive Bayes algorithm with each dimension modeled by a Poisson distribution (last column of Table 2). From the results, we can infer that word grouping resulted in considerable increase in accuracy compared to the Naive Bayes algorithm. That is, the characteristic vectors are of a much lower dimension with 𝐿𝑝. Also, a high clustering accuracy can be achieved using no more than 𝑀𝐿 dimensions, significantly smaller than the original dimension, 1024. Note that it is difficult to know a priori, the exact value of 𝐿 that yields the best clustering. However, among the values we tested, lower values of 𝐿 provided a higher accuracy.

In Table 3, we compare the performance of our 2-way Poisson mixture model with Gaussian mixture model for datasets with low 𝛿 values. In real situations, it is difficult to know beforehand, the most discerning order of the word to use. However, from our experiments, we can infer that higher-order word-based models, in general, tend to be more discriminatory than those based on lower order words. If it is known a priori that lower-order words (of length 2-3) are more discriminatory in the dataset, then we recommend using a Gaussian mixture model. For other datasets, we use a Poisson mixture model.

Our method’s accuracy in classifying reads from the datasets composed of species across various taxonomic ranks is reported in Table 4 we used the Poisson mixture model without word grouping. The error rates are bounded by 10% on all datasets. We can infer that the accuracy is mostly correlated to the phylogenetic distances between the species. For example, reads from datasets containing species with taxonomic differences at the level of class were classified with a very high accuracy.

4.2. Real Metagenome: Acid Mine Drainage (AMD) Dataset

The ultimate goal of binning methods is to cluster reads in a real metagenome, by their species of origin. Clustering in real situations is error-prone and affects our final estimates of species abundance. Moreover, evaluating clustering methods on real metagenomes can be problematic as the true taxonomic composition of the data is mostly unknown. The accuracy of unsupervised clustering methods decreases with increase in the complexity of metagenomes and for species present at very low abundances. However, the composition of Acid Mine Drainage metagenome has been substantially characterized, and we used this dataset to evaluate the performance of our proposed method [39]. The AMD microbial community is reported to consist of two dominant populations (Ferroplasma sp. Type II and Leptospirillum sp. Group II) and three other less abundant ones (Ferroplasma acidarmanus Type I, Leptospirillum sp. Group III, and Thermoplasmatales archaeon GpI). We downloaded the reads, as well as the scaffolds assembled from the reads for the 5 species of the actual AMD dataset from NCBI. Only 58% of the AMD reads can be mapped back to the assembled scaffolds using BLAST [19]. Therefore, in order to compute the accuracy of our method, we simulated a metagenome with reads sampled from the downloaded scaffolds. The simulated AMD dataset consisted of 110,000 reads of average length 732 bp (average read length in the actual AMD dataset) from the 5 species, in the ratio 4 : 4 : 1 : 1 : 1. We characterized the dataset in two stages. Notice that the dataset contains reads with two distinct abundance levels. Therefore, we can simplify the problem by first separating the reads into two bins based on their abundance. In the first stage, the reads were grouped into two bins, using Abundance Bin, with a resulting accuracy of 93.3%. The bins corresponding to the abundance levels of 4 and 1 had a cluster purity of 93.2% and 98.2%, respectively. In the next stage, we used the reads from each of the bins output by Abundance Bin, as an input to our proposed 2-way Poisson mixture model, to further classify the reads by their species of origin. We used a word length of 5. Our method clustered the reads from the bin containing dominant species into two clusters corresponding to Ferroplasma sp. Type II and Leptospirillum sp. Group II, with an accuracy of 96.88% (with 𝐿=10). The other bin consisted of very few reads from the remaining three species Ferroplasma acidarmanus Type I, Leptospirillum sp. Group III, and Thermoplasmatales archaeon GpI. Our method clustered the reads from this bin into three clusters, with an accuracy of 70.34% (with 𝐿=10). This decrease in accuracy can be attributed to the low bin count.

5. Discussion

In this paper, we formulated an unsupervised two-way multispecies, multidimensional mixture model to represent reads from a metagenome. We used the proposed model to cluster metagenomic reads by their species of origin and to characterize the abundance of each species. The distribution of word counts along a genome can be approximated as a Gaussian for shorter, frequent words and as a Poisson for longer words that are rare. Therefore, we use a multidimensional mixture of Gaussians or Poissons to model the reads from each bin. An additional reason to use these distributions is their flexibility, stability, and ease of parameter estimation. Our method is an unsupervised method that does not require any training data. This is critical for success as most metagenomic datasets contain reads from unexplored phyla which cannot be labeled into one of the existing classes. Our probabilistic approach can be used to identify reads which belong to more than one species and occlude the cluster boundaries. Such reads should be further investigated to identify the presence of conserved regions.

Note that our proposed method is primarily a composition-based method that seeks to distinguish between genomes based on their characteristic DNA compositional pattern. Therefore, it cannot distinguish between genomes unless their DNA compositions are sufficiently divergent (see Figure 4, dataset with B. Burgdorferi and C. Jejuni). It is unlikely that our method will be able to accurately distinguish between strains of the same species. For such datasets, genome signature alone is insufficient for inferring taxonomic relationships reliably. Composition-based methods must be used in conjunction with other similarity-based methods and abundance-based methods to yield better performance.

Note that the two-way Poisson mixture model was originally proposed for classification of documents. In this work, we demonstrate the relevance and applicability of such a general statistical framework for modeling metagenome reads. We have illustrated that the proposed method can accurately classify reads from low to medium complexity datasets into taxon-specific bins, based on genome signatures.

Our framework complements the existing similarity-based and abundance-based methods and hence can be combined with such methods to obtain a better performance. We intend to develop such hybrid methods in the future that can tackle the problem of classifying sequences in complex metagenomic communities.

Acknowledgment

The authors wish to thank Jia Li for her useful insights into the two-way Poisson mixture model problem and the reviewers for their valuable comments and suggestions.