#### Abstract

*Motivation*. Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of cluster-characterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2-way clusters. It defines a mixture model that allows 2-way cluster assignment and describes a variant of generalized *k*-means for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project.

#### 1. Introduction

Microbiome analysis [1] by sequencing of ubiquitous genes, most commonly 16S rRNA, is a standard, cost-effective way to characterize the composition of a microbial sample. Standard analysis tools facilitate quantifying the fraction of sequence reads from each bacterial species in a sample [2]. Interpretation of composition vectors across a collection of samples typically relies on dimensionality reduction followed by clustering in the lower-dimensionality space [3]. This allows identification of functionally meaningful subsets of samples with characteristic microbiota. The Human Microbiome Project [4] and its derivatives such as the Human Vaginal Microbiome Project [5] have collected and thus analyzed large numbers of samples towards elucidating the structure and composition of microbiota across physiological and pathological states.

Similar to variation in microbial genomes across different human individuals, variants along the nuclear genomes have been summarized by a small number of dimensions [6]. However, in contrast to analyses of microbiome samples, analyses of inherited genetic variation standardly assume and observe samples to be spread across a continuum in the reduced space, rather than be clustered [7]. Samples in between clusters are interpreted as originating from intermediate locales along a geographic cline [8] or as representing different levels of a mixture between cluster-specific populations.

In this paper, we formally tackle the problem of clustering while allowing elements to belong to two clusters. Specifically, we will describe in detail a model for clustering in . We construct a model that generalizes *k*-means clustering by allowing data points to be assigned to a point in the space along the line between two assigned clusters [9]. Each cluster is still modeled as a Gaussian with uniform, spherical covariances; the key difference is the presence of a parameter for each 2-way-assigned data point , which determines the proportional assignment of *x _{i}* between its two cluster representatives. We first describe the 2-way model’s inputs, parameters, and outputs. We then give the objective function, an algorithmic description, and a series of performance metrics. Next, we evaluate the performance on simulated data, describing benchmarks for optimal performance. Finally, we apply the model to real data of 16S rDNA sequencing from 1500 midvaginal bacterial samples by the Vaginal Human Microbiome Project.

#### 2. Methods

##### 2.1. 2-Way *k*-Means

The model characterizes a mixture where points are each sampled either from a *k*-mixture of uniform, spherical Gaussian distributions or from pairwise weighted averages of these Gaussians.

Formally, we describe a generative model for a set of data points . The model involves clusters. The *j*th cluster is parametrized by its mean . To simulate , the model first chooses a pair of cluster indices () along with a weighting . is drawn from a Gaussian distribution whose parameters are *u _{i}*-weighted averages of two representative clusters. Specifically, such that and is the given uniform, spherical covariance matrix.

The inference problem involves the inputs of data and number of clusters , seeking output of the generative model parameters, that is, the vectors of assignments and weights .

##### 2.2. Generalized *k*-Means

Given input and cardinality , *k*-means traditionally provides us with the following objective:
where are the cluster representatives. The *k*-means objective can be generalized as the following:
where are the cluster assignments and are the cluster representatives.

A common generalization of *k*-means is to permit each to have *s* nonzero entries (in our case, we set ). An algorithm for this generalized objective is simply to hold fixed while performing sparse regression on and then hold fixed and use ordinary least squares (OLS) to find .

In our case, because we only allow points to lie uniformly between two cluster representatives, the two nonzero entries in a given are restricted to some and . Our problem is instead the following: subject to

##### 2.3. 2-Way *k*-Means Algorithm

Our goal is to find a nonnegative 2-sparse solution for each . To do so, we can minimize over all cluster representative possibilities. This 2-sparse solution gives us indices which correspond with the two cluster representatives. This corresponds with the following objective: subject to

For a given and , minimizing with respect to reveals a global minimum at

After minimizing with respect to , we project to the region . We set if the minimizer is less than 0 and set if the minimizer is greater than 1. This allows us to achieve the minimum value of over the domain for .

After minimizing the assignment , we then use OLS to pick optimal as specified before. Formally, OLS produces a vector that minimizes the squared residual error between an input matrix **Φ**^{T} and vector .

Taking the gradient and setting equal to zero yields the following formula:

Thus, we perform OLS for all vectors at once with matrix multiplication:

Thus, this gives us representatives that minimize the residual error between the cluster representatives and data points subject to . We then alternate this process for rounds until convergence.

##### 2.4. Performance Metrics

We use the 2-way *k*-means objective as a performance metric in measuring the accuracy of a model in unsupervised examples.
where has at most two nonzero entries with values and .

Additionally, we also use four different error rates to measure the accuracy of 2-way *k*-means on test cases. Let and be the ground truth instance parameters, that is, respectively, true 2-way cluster assignment of , center of cluster , and 2-way weighting for between clusters .

defines the 0-1 error rate for 2-way cluster assignment:

defines the squared deviation from optimal :

defines the squared deviation from optimal . WLOG, we assume , where *u* is the variable drawn from :

#### 3. Results

##### 3.1. Example Run for 2-Way *k*-Means

We find it illuminating to demonstrate the performance of 2-way *k*-means versus vanilla *k*-means on a cartoon example.

In Figure 1, we simulated data points in from three clusters, with respective means and covariance matrices . Data points are drawn into pairwise clusters by choosing two cluster representatives without replacement from the following prior probabilities:

We initialize the cluster representatives with vanilla *k*-means. Vanilla *k*-means achieves the results in Figure 2. Statistics for vanilla *k*-means are given as follows:

*k*-means predicts the 2-way cluster assignments of points incorrectly due to skewing the cluster means toward the middle of the graph. 2-way *k*-means, however, significantly improves on all error rates. After 10 rounds of 2-way *k*-means, we achieve the results in Figure 3.

For every statistic, the results are clearly an improvement on standard *k*-means. The error rate on cluster assignment still exists because 2-way *k*-means points closest to cluster representatives may be assigned to an incorrect secondary cluster.

##### 3.2. Benchmarks

###### 3.2.1. Sparsity (Avg. of 10 Trials, 10 Rounds Each)

Our sparsity test was conducted by keeping cluster prior probabilities and cluster centers *μ* constant while varying the number of data points (ratio of means ). From Figure 4, we see that the algorithm performs consistently well under a variety of conditions, but too few data points can hurt performance to an extent.

###### 3.2.2. Cluster Separation (Avg. of 10 Trials, 10 Rounds Each)

We test the error rate as a function of the Euclidean distances of (ratio of means ). From the results in Figure 5, we can see that a certain threshold is required for proper performance of the algorithm. This makes sense, as when , the clusters are almost on top of each other and difficult to distinguish. Additionally, as the cluster centers are moved farther apart, the norm between the cluster representative determined by the algorithm and the actual cluster representative increases (but this is to be expected).

###### 3.2.3. Variance (Avg. of 10 Trials, 10 Rounds Each)

We increase the variance of the clusters while fixing cluster prior probabilities, data points, and cluster centers (ratio of means ). From the results in Figure 6, we can see that large variance hurts proper performance of the algorithm. Analogous to cluster separation, as when , the clusters are too close to distinguish.

##### 3.3. Real Data

Publicly available sequence data for the Human Microbiome Project (HMP) study SRP002462, described as metagenomic sequencing of 16S rDNA from vaginal and related samples from clinical and twin subjects, was downloaded from the NCBI SRA database [10]. The downloaded sets of data correspond to two separate submissions: SRA169809 (1608/1608 samples were downloaded) and SRA273234 (34/133 samples were downloaded), for a total of 1642 samples.

The SFF files were processed and cleaned using the microbial community analysis software mothur [11], based on a standard protocol developed for 454 sequence data processing and quality control [12]. The dissimilarities between the samples were calculated using the Clayton-Yue dissimilarity measure. The data was subsampled to 5000 sequences per sample (this step results in dropping out 136 samples that had less than 5000 reads in total) 500 times to produce the distance file, which was used to calculate principal coordinates. Figure 7 shows the graph of ~1500 data points after PCoA. After implementing the 2-way *k*-means algorithm [13], we initialized with *k*-means and ran 2-way *k*-means for 5 rounds on the data.

Unfortunately, the nonlinear arches between the clusters pushed the cluster representatives slightly outside the clusters. Nonetheless, the algorithm was still an improvement over *k*-means. We note that after *k*-means, the 2-way objective had a value of 108.0 while our 2-way *k*-means algorithm converged on an objective of after 5 rounds. Additionally, the algorithm gives us a characterization of the samples lying between two clusters. The results can be seen in Figure 8.

##### 3.4. Discussion

We first get the most abundant operational taxonomic unit (OTU) in each sample (down to the genus level) and the closest cluster assignment for each sample. We use this to observe which OTUs are the most common to each cluster. We can find the closest sample to each data point by simply taking the for each data point .

From Table 1, we see that four of the five clusters have a unique and most abundant OTU, while cluster *c*_{3} has a variety of abundant types. Aside from the top four OTUs, separating the data into discrete clusters obscures how the rest of the OTUs can be characterized.

By using each data point’s cluster-pair assignment, we further separate the data into clusters. Let designate the data points that are between clusters and but are nearer to cluster than cluster . We take the most abundant OTUs in each sample and the cluster pair for each sample. We can then find the most abundant OTUs for each cluster pair.

Table 2 shows the structure of the most abundant OTU types for each 2-way cluster defined before. Once again, we find that clusters , , , and are all dominated by the same single OTU from before. Yet, observing clusters provides us with a more in-depth understanding of the diverse cluster *c*_{3}.

Interestingly, we see that the makeups of *c*_{31}, *c*_{32}, *c*_{34}, and *c*_{35} are remarkably different. We immediately see that the top four OTUs are all predominantly contained in cluster pairs that include their single main cluster. In addition, we notice that the samples with abundant *Sneathia*_{1}, *Prevotella*_{2}, and unclassified types are predominantly contained in *c*_{31}. *c*_{32} contains samples with a variety of abundant OTUs. *Lactobacillus*_{3}, *Lactobacillus*_{4}, *Prevotella*_{1}, *Streptococcus*_{1}, *Streptococcus*_{2}, and *Bifodobacterium* are abundant in samples that are predominantly contained in *c*_{34}. Finally, almost no samples are in the cluster pair *c*_{35}, aside from a few *Sneathia*_{1} types.

In this way, 2-way *k*-means also opens up a wealth of information on the relationships between samples. In particular, it now makes more sense to characterize the samples as being in 6 different clusters: *c*_{1}, *c*_{2}, *c*_{31}, *c*_{34}, and *c*_{5}. We also see that certain clusters have mixed relationships, while others have almost no interaction. Without 2-way *k*-means, this would not be immediately obvious.

#### 4. Conclusion

The complexity of microbial populations is unfolding as microbiome data becomes increasingly available. Yet, standard methodologies oversimplify microbial compositions by pigeonholing them into discrete clusters. This paper further refines the models for microbial abundance across groups of samples. We allow samples to be presented as a weighted average of two clusters, rather than belonging to only one. This may be motivated biologically, as the sample often reflects a mixture of two sources of microbiota, each well represented by a cluster. An alternative explanation is that the averaged sample represents an intermediate, potentially temporary state of the microbial composition, between the more stable ones represented by the clusters themselves.

Technically, we formalize this model as a generalization of *k*-means. We derive a simple algorithm to infer such a structure and validate its benchmarks on simulated data.

Applying our algorithm to real data from the Human Vaginal Microbiome Project provides empirical support to the 2-way model. We showed that while most of the samples lie in six clusters: four well-defined clusters and two subclusters. Furthermore, while previously, a sizable fraction of samples in between clusters was ignored, the 2-way model characterized the entire distribution. Using 2-way *k*-means, we can tell that a large portion of the previously unclustered samples, which lie in between two clusters, contains shared properties. In addition, we see that certain clusters have mixed relationships, while others have almost no interaction.

#### 5. Further Research

In addition, this paper leaves several open questions and opportunities for further research:
(i)How can we efficiently characterize a 2-way distribution with nonspherical covariance matrices?(ii)How can we efficiently characterize a *k*-way distribution?(iii)How can we efficiently characterize a 2-way distribution with nonlinear paths between cluster representatives?

Addressing these questions will further help us understand the composition of microbial populations.

#### Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

#### Acknowledgments

This work was supported by the National Science Foundation under CISE EAGER Grant no. 1547120.