Research Article  Open Access
2Way kMeans as a Model for Microbiome Samples
Abstract
Motivation. Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of clustercharacterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2way clusters. It defines a mixture model that allows 2way cluster assignment and describes a variant of generalized kmeans for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project.
1. Introduction
Microbiome analysis [1] by sequencing of ubiquitous genes, most commonly 16S rRNA, is a standard, costeffective way to characterize the composition of a microbial sample. Standard analysis tools facilitate quantifying the fraction of sequence reads from each bacterial species in a sample [2]. Interpretation of composition vectors across a collection of samples typically relies on dimensionality reduction followed by clustering in the lowerdimensionality space [3]. This allows identification of functionally meaningful subsets of samples with characteristic microbiota. The Human Microbiome Project [4] and its derivatives such as the Human Vaginal Microbiome Project [5] have collected and thus analyzed large numbers of samples towards elucidating the structure and composition of microbiota across physiological and pathological states.
Similar to variation in microbial genomes across different human individuals, variants along the nuclear genomes have been summarized by a small number of dimensions [6]. However, in contrast to analyses of microbiome samples, analyses of inherited genetic variation standardly assume and observe samples to be spread across a continuum in the reduced space, rather than be clustered [7]. Samples in between clusters are interpreted as originating from intermediate locales along a geographic cline [8] or as representing different levels of a mixture between clusterspecific populations.
In this paper, we formally tackle the problem of clustering while allowing elements to belong to two clusters. Specifically, we will describe in detail a model for clustering in . We construct a model that generalizes kmeans clustering by allowing data points to be assigned to a point in the space along the line between two assigned clusters [9]. Each cluster is still modeled as a Gaussian with uniform, spherical covariances; the key difference is the presence of a parameter for each 2wayassigned data point , which determines the proportional assignment of x_{i} between its two cluster representatives. We first describe the 2way model’s inputs, parameters, and outputs. We then give the objective function, an algorithmic description, and a series of performance metrics. Next, we evaluate the performance on simulated data, describing benchmarks for optimal performance. Finally, we apply the model to real data of 16S rDNA sequencing from 1500 midvaginal bacterial samples by the Vaginal Human Microbiome Project.
2. Methods
2.1. 2Way kMeans
The model characterizes a mixture where points are each sampled either from a kmixture of uniform, spherical Gaussian distributions or from pairwise weighted averages of these Gaussians.
Formally, we describe a generative model for a set of data points . The model involves clusters. The jth cluster is parametrized by its mean . To simulate , the model first chooses a pair of cluster indices () along with a weighting . is drawn from a Gaussian distribution whose parameters are u_{i}weighted averages of two representative clusters. Specifically, such that and is the given uniform, spherical covariance matrix.
The inference problem involves the inputs of data and number of clusters , seeking output of the generative model parameters, that is, the vectors of assignments and weights .
2.2. Generalized kMeans
Given input and cardinality , kmeans traditionally provides us with the following objective: where are the cluster representatives. The kmeans objective can be generalized as the following: where are the cluster assignments and are the cluster representatives.
A common generalization of kmeans is to permit each to have s nonzero entries (in our case, we set ). An algorithm for this generalized objective is simply to hold fixed while performing sparse regression on and then hold fixed and use ordinary least squares (OLS) to find .
In our case, because we only allow points to lie uniformly between two cluster representatives, the two nonzero entries in a given are restricted to some and . Our problem is instead the following: subject to
2.3. 2Way kMeans Algorithm
Our goal is to find a nonnegative 2sparse solution for each . To do so, we can minimize over all cluster representative possibilities. This 2sparse solution gives us indices which correspond with the two cluster representatives. This corresponds with the following objective: subject to
For a given and , minimizing with respect to reveals a global minimum at
After minimizing with respect to , we project to the region . We set if the minimizer is less than 0 and set if the minimizer is greater than 1. This allows us to achieve the minimum value of over the domain for .
After minimizing the assignment , we then use OLS to pick optimal as specified before. Formally, OLS produces a vector that minimizes the squared residual error between an input matrix Φ^{T} and vector .
Taking the gradient and setting equal to zero yields the following formula:
Thus, we perform OLS for all vectors at once with matrix multiplication:
Thus, this gives us representatives that minimize the residual error between the cluster representatives and data points subject to . We then alternate this process for rounds until convergence.
2.4. Performance Metrics
We use the 2way kmeans objective as a performance metric in measuring the accuracy of a model in unsupervised examples. where has at most two nonzero entries with values and .
Additionally, we also use four different error rates to measure the accuracy of 2way kmeans on test cases. Let and be the ground truth instance parameters, that is, respectively, true 2way cluster assignment of , center of cluster , and 2way weighting for between clusters .
defines the 01 error rate for 2way cluster assignment:
defines the squared deviation from optimal :
defines the squared deviation from optimal . WLOG, we assume , where u is the variable drawn from :
3. Results
3.1. Example Run for 2Way kMeans
We find it illuminating to demonstrate the performance of 2way kmeans versus vanilla kmeans on a cartoon example.
In Figure 1, we simulated data points in from three clusters, with respective means and covariance matrices . Data points are drawn into pairwise clusters by choosing two cluster representatives without replacement from the following prior probabilities:
We initialize the cluster representatives with vanilla kmeans. Vanilla kmeans achieves the results in Figure 2. Statistics for vanilla kmeans are given as follows:
kmeans predicts the 2way cluster assignments of points incorrectly due to skewing the cluster means toward the middle of the graph. 2way kmeans, however, significantly improves on all error rates. After 10 rounds of 2way kmeans, we achieve the results in Figure 3.
For every statistic, the results are clearly an improvement on standard kmeans. The error rate on cluster assignment still exists because 2way kmeans points closest to cluster representatives may be assigned to an incorrect secondary cluster.
3.2. Benchmarks
3.2.1. Sparsity (Avg. of 10 Trials, 10 Rounds Each)
Our sparsity test was conducted by keeping cluster prior probabilities and cluster centers μ constant while varying the number of data points (ratio of means ). From Figure 4, we see that the algorithm performs consistently well under a variety of conditions, but too few data points can hurt performance to an extent.
3.2.2. Cluster Separation (Avg. of 10 Trials, 10 Rounds Each)
We test the error rate as a function of the Euclidean distances of (ratio of means ). From the results in Figure 5, we can see that a certain threshold is required for proper performance of the algorithm. This makes sense, as when , the clusters are almost on top of each other and difficult to distinguish. Additionally, as the cluster centers are moved farther apart, the norm between the cluster representative determined by the algorithm and the actual cluster representative increases (but this is to be expected).
3.2.3. Variance (Avg. of 10 Trials, 10 Rounds Each)
We increase the variance of the clusters while fixing cluster prior probabilities, data points, and cluster centers (ratio of means ). From the results in Figure 6, we can see that large variance hurts proper performance of the algorithm. Analogous to cluster separation, as when , the clusters are too close to distinguish.
3.3. Real Data
Publicly available sequence data for the Human Microbiome Project (HMP) study SRP002462, described as metagenomic sequencing of 16S rDNA from vaginal and related samples from clinical and twin subjects, was downloaded from the NCBI SRA database [10]. The downloaded sets of data correspond to two separate submissions: SRA169809 (1608/1608 samples were downloaded) and SRA273234 (34/133 samples were downloaded), for a total of 1642 samples.
The SFF files were processed and cleaned using the microbial community analysis software mothur [11], based on a standard protocol developed for 454 sequence data processing and quality control [12]. The dissimilarities between the samples were calculated using the ClaytonYue dissimilarity measure. The data was subsampled to 5000 sequences per sample (this step results in dropping out 136 samples that had less than 5000 reads in total) 500 times to produce the distance file, which was used to calculate principal coordinates. Figure 7 shows the graph of ~1500 data points after PCoA. After implementing the 2way kmeans algorithm [13], we initialized with kmeans and ran 2way kmeans for 5 rounds on the data.
Unfortunately, the nonlinear arches between the clusters pushed the cluster representatives slightly outside the clusters. Nonetheless, the algorithm was still an improvement over kmeans. We note that after kmeans, the 2way objective had a value of 108.0 while our 2way kmeans algorithm converged on an objective of after 5 rounds. Additionally, the algorithm gives us a characterization of the samples lying between two clusters. The results can be seen in Figure 8.
3.4. Discussion
We first get the most abundant operational taxonomic unit (OTU) in each sample (down to the genus level) and the closest cluster assignment for each sample. We use this to observe which OTUs are the most common to each cluster. We can find the closest sample to each data point by simply taking the for each data point .
From Table 1, we see that four of the five clusters have a unique and most abundant OTU, while cluster c_{3} has a variety of abundant types. Aside from the top four OTUs, separating the data into discrete clusters obscures how the rest of the OTUs can be characterized.

By using each data point’s clusterpair assignment, we further separate the data into clusters. Let designate the data points that are between clusters and but are nearer to cluster than cluster . We take the most abundant OTUs in each sample and the cluster pair for each sample. We can then find the most abundant OTUs for each cluster pair.
Table 2 shows the structure of the most abundant OTU types for each 2way cluster defined before. Once again, we find that clusters , , , and are all dominated by the same single OTU from before. Yet, observing clusters provides us with a more indepth understanding of the diverse cluster c_{3}.

Interestingly, we see that the makeups of c_{31}, c_{32}, c_{34}, and c_{35} are remarkably different. We immediately see that the top four OTUs are all predominantly contained in cluster pairs that include their single main cluster. In addition, we notice that the samples with abundant Sneathia_{1}, Prevotella_{2}, and unclassified types are predominantly contained in c_{31}. c_{32} contains samples with a variety of abundant OTUs. Lactobacillus_{3}, Lactobacillus_{4}, Prevotella_{1}, Streptococcus_{1}, Streptococcus_{2}, and Bifodobacterium are abundant in samples that are predominantly contained in c_{34}. Finally, almost no samples are in the cluster pair c_{35}, aside from a few Sneathia_{1} types.
In this way, 2way kmeans also opens up a wealth of information on the relationships between samples. In particular, it now makes more sense to characterize the samples as being in 6 different clusters: c_{1}, c_{2}, c_{31}, c_{34}, and c_{5}. We also see that certain clusters have mixed relationships, while others have almost no interaction. Without 2way kmeans, this would not be immediately obvious.
4. Conclusion
The complexity of microbial populations is unfolding as microbiome data becomes increasingly available. Yet, standard methodologies oversimplify microbial compositions by pigeonholing them into discrete clusters. This paper further refines the models for microbial abundance across groups of samples. We allow samples to be presented as a weighted average of two clusters, rather than belonging to only one. This may be motivated biologically, as the sample often reflects a mixture of two sources of microbiota, each well represented by a cluster. An alternative explanation is that the averaged sample represents an intermediate, potentially temporary state of the microbial composition, between the more stable ones represented by the clusters themselves.
Technically, we formalize this model as a generalization of kmeans. We derive a simple algorithm to infer such a structure and validate its benchmarks on simulated data.
Applying our algorithm to real data from the Human Vaginal Microbiome Project provides empirical support to the 2way model. We showed that while most of the samples lie in six clusters: four welldefined clusters and two subclusters. Furthermore, while previously, a sizable fraction of samples in between clusters was ignored, the 2way model characterized the entire distribution. Using 2way kmeans, we can tell that a large portion of the previously unclustered samples, which lie in between two clusters, contains shared properties. In addition, we see that certain clusters have mixed relationships, while others have almost no interaction.
5. Further Research
In addition, this paper leaves several open questions and opportunities for further research: (i)How can we efficiently characterize a 2way distribution with nonspherical covariance matrices?(ii)How can we efficiently characterize a kway distribution?(iii)How can we efficiently characterize a 2way distribution with nonlinear paths between cluster representatives?
Addressing these questions will further help us understand the composition of microbial populations.
Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the National Science Foundation under CISE EAGER Grant no. 1547120.
References
 Human Microbiome Project Consortium, “Structure, function and diversity of the healthy human microbiome,” Nature, vol. 486, no. 7402, pp. 207–214, 2012. View at: Publisher Site  Google Scholar
 J. Qin, R. Li, J. Raes et al., “A human gut microbial gene catalog established by metagenomic sequencing,” Nature, vol. 464, no. 7285, pp. 59–65, 2010. View at: Publisher Site  Google Scholar
 A. Ramette and P. L. Buttigieg, “A guide to statistical analysis in microbial ecology: a communityfocused, living review of multivariate data analyses,” FEMS Microbiology Ecology, vol. 90, no. 3, pp. 543–550, 2014. View at: Publisher Site  Google Scholar
 Human Microbiome Project Consortium, “A framework for human microbiome research,” Nature, vol. 486, no. 7402, pp. 215–221, 2012. View at: Publisher Site  Google Scholar
 J. M. Fettweis, J. P. Brooks, M. G. Serrano et al., “Differences in vaginal microbiome in African American women versus women of European ancestry,” Microbiology, vol. 160, Part 10, pp. 2272–2282, 2014. View at: Publisher Site  Google Scholar
 R. Plenge, M. Weinblatt, N. Shadick, A. Price, N. Patterson, and D. Reich, “Principal components analysis corrects for stratification in genomewide association studies,” Nature Genetics, vol. 38, no. 8, pp. 904–909, 2006. View at: Publisher Site  Google Scholar
 J. Novembre, D. H. Alexander, and K. Lange, “Fast modelbased estimation of ancestry in unrelated individuals,” Genome Research, vol. 19, no. 9, pp. 1655–1664, 2009. View at: Publisher Site  Google Scholar
 J. Novembre, T. Johnson, K. Bryc et al., “Genes mirror geography within Europe,” Nature, vol. 456, no. 7219, p. 274, 2008. View at: Publisher Site  Google Scholar
 S. P. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, pp. 129–137, 1982. View at: Publisher Site  Google Scholar
 National Center for Biotechnology Information, 2014, https://www.ncbi.nlm.nih.gov/sra/?term=SRP002462.
 P. D. Schloss, S. L. Westcott, T. Ryabin et al., “Introducing mothur: opensource, platformindependent, communitysupported software for describing and comparing microbial communities,” Applied and Environmental Microbiology, vol. 75, no. 23, pp. 7537–7541, 2009, http://aem.asm.org/content/75/23/7537.short?rss=1&ssource=mfc. View at: Publisher Site  Google Scholar
 P. D. Schloss, D. Gevers, and S. L. Westcott, “Reducing the effects of PCR amplification and sequencing artifacts on 16s rRNAbased studies,” PLoS One, vol. 6, no. 12, article e27310, 2011. View at: Publisher Site  Google Scholar
 W. J. Jackson, “2way cluster assignment,” 2016, https://github.com/westonjackson/2WayClusterAssignment. View at: Google Scholar
Copyright
Copyright © 2017 Weston J. Jackson et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.