Journal of Healthcare Engineering

Volume 2017 (2017), Article ID 5284145, 7 pages

https://doi.org/10.1155/2017/5284145

## 2-Way *k*-Means as a Model for Microbiome Samples

^{1}Department of Computer Science, Columbia University, New York, NY 10027, USA^{2}Department of Biological Sciences, Columbia University, New York, NY 10027, USA

Correspondence should be addressed to Weston J. Jackson; moc.liamg@noskcaj.j.notsew

Received 20 May 2017; Accepted 17 July 2017; Published 5 September 2017

Academic Editor: Ahmad P. Tafti

Copyright © 2017 Weston J. Jackson et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

*Motivation*. Microbiome sequencing allows defining clusters of samples with shared composition. However, this paradigm poorly accounts for samples whose composition is a mixture of cluster-characterizing ones and which therefore lie in between them in the cluster space. This paper addresses unsupervised learning of 2-way clusters. It defines a mixture model that allows 2-way cluster assignment and describes a variant of generalized *k*-means for learning such a model. We demonstrate applicability to microbial 16S rDNA sequencing data from the Human Vaginal Microbiome Project.

#### 1. Introduction

Microbiome analysis [1] by sequencing of ubiquitous genes, most commonly 16S rRNA, is a standard, cost-effective way to characterize the composition of a microbial sample. Standard analysis tools facilitate quantifying the fraction of sequence reads from each bacterial species in a sample [2]. Interpretation of composition vectors across a collection of samples typically relies on dimensionality reduction followed by clustering in the lower-dimensionality space [3]. This allows identification of functionally meaningful subsets of samples with characteristic microbiota. The Human Microbiome Project [4] and its derivatives such as the Human Vaginal Microbiome Project [5] have collected and thus analyzed large numbers of samples towards elucidating the structure and composition of microbiota across physiological and pathological states.

Similar to variation in microbial genomes across different human individuals, variants along the nuclear genomes have been summarized by a small number of dimensions [6]. However, in contrast to analyses of microbiome samples, analyses of inherited genetic variation standardly assume and observe samples to be spread across a continuum in the reduced space, rather than be clustered [7]. Samples in between clusters are interpreted as originating from intermediate locales along a geographic cline [8] or as representing different levels of a mixture between cluster-specific populations.

In this paper, we formally tackle the problem of clustering while allowing elements to belong to two clusters. Specifically, we will describe in detail a model for clustering in . We construct a model that generalizes *k*-means clustering by allowing data points to be assigned to a point in the space along the line between two assigned clusters [9]. Each cluster is still modeled as a Gaussian with uniform, spherical covariances; the key difference is the presence of a parameter for each 2-way-assigned data point , which determines the proportional assignment of *x _{i}* between its two cluster representatives. We first describe the 2-way model’s inputs, parameters, and outputs. We then give the objective function, an algorithmic description, and a series of performance metrics. Next, we evaluate the performance on simulated data, describing benchmarks for optimal performance. Finally, we apply the model to real data of 16S rDNA sequencing from 1500 midvaginal bacterial samples by the Vaginal Human Microbiome Project.

#### 2. Methods

##### 2.1. 2-Way *k*-Means

The model characterizes a mixture where points are each sampled either from a *k*-mixture of uniform, spherical Gaussian distributions or from pairwise weighted averages of these Gaussians.

Formally, we describe a generative model for a set of data points . The model involves clusters. The *j*th cluster is parametrized by its mean . To simulate , the model first chooses a pair of cluster indices () along with a weighting . is drawn from a Gaussian distribution whose parameters are *u _{i}*-weighted averages of two representative clusters. Specifically, such that and is the given uniform, spherical covariance matrix.

The inference problem involves the inputs of data and number of clusters , seeking output of the generative model parameters, that is, the vectors of assignments and weights .

##### 2.2. Generalized *k*-Means

Given input and cardinality , *k*-means traditionally provides us with the following objective:
where are the cluster representatives. The *k*-means objective can be generalized as the following:
where are the cluster assignments and are the cluster representatives.

A common generalization of *k*-means is to permit each to have *s* nonzero entries (in our case, we set ). An algorithm for this generalized objective is simply to hold fixed while performing sparse regression on and then hold fixed and use ordinary least squares (OLS) to find .

In our case, because we only allow points to lie uniformly between two cluster representatives, the two nonzero entries in a given are restricted to some and . Our problem is instead the following: subject to

##### 2.3. 2-Way *k*-Means Algorithm

Our goal is to find a nonnegative 2-sparse solution for each . To do so, we can minimize over all cluster representative possibilities. This 2-sparse solution gives us indices which correspond with the two cluster representatives. This corresponds with the following objective: subject to

For a given and , minimizing with respect to reveals a global minimum at

After minimizing with respect to , we project to the region . We set if the minimizer is less than 0 and set if the minimizer is greater than 1. This allows us to achieve the minimum value of over the domain for .

After minimizing the assignment , we then use OLS to pick optimal as specified before. Formally, OLS produces a vector that minimizes the squared residual error between an input matrix **Φ**^{T} and vector .

Taking the gradient and setting equal to zero yields the following formula:

Thus, we perform OLS for all vectors at once with matrix multiplication:

Thus, this gives us representatives that minimize the residual error between the cluster representatives and data points subject to . We then alternate this process for rounds until convergence.

##### 2.4. Performance Metrics

We use the 2-way *k*-means objective as a performance metric in measuring the accuracy of a model in unsupervised examples.
where has at most two nonzero entries with values and .

Additionally, we also use four different error rates to measure the accuracy of 2-way *k*-means on test cases. Let and be the ground truth instance parameters, that is, respectively, true 2-way cluster assignment of , center of cluster , and 2-way weighting for between clusters .

defines the 0-1 error rate for 2-way cluster assignment:

defines the squared deviation from optimal :

defines the squared deviation from optimal . WLOG, we assume , where *u* is the variable drawn from :

#### 3. Results

##### 3.1. Example Run for 2-Way *k*-Means

We find it illuminating to demonstrate the performance of 2-way *k*-means versus vanilla *k*-means on a cartoon example.

In Figure 1, we simulated data points in from three clusters, with respective means and covariance matrices . Data points are drawn into pairwise clusters by choosing two cluster representatives without replacement from the following prior probabilities: