BioMed Research International

Volume 2015 (2015), Article ID 918954, 10 pages

http://dx.doi.org/10.1155/2015/918954

## -Profiles: A Nonlinear Clustering Method for Pattern Detection in High Dimensional Data

^{1}Department of Mathematics and Computer Science, Emory University, Atlanta, GA 30322, USA^{2}School of Software Engineering, Tongji University, Shanghai 200092, China^{3}The Advanced Institute of Translational Medicine and Department of Gastroenterology, Shanghai Tenth People’s Hospital, Tongji University, Shanghai 200092, China^{4}Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA

Received 5 November 2014; Accepted 18 December 2014

Academic Editor: Fang-Xiang Wu

Copyright © 2015 Kai Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

With modern technologies such as microarray, deep sequencing, and liquid chromatography-mass spectrometry (LC-MS), it is possible to measure the expression levels of thousands of genes/proteins simultaneously to unravel important biological processes. A very first step towards elucidating hidden patterns and understanding the massive data is the application of clustering techniques. Nonlinear relations, which were mostly unutilized in contrast to linear correlations, are prevalent in high-throughput data. In many cases, nonlinear relations can model the biological relationship more precisely and reflect critical patterns in the biological systems. Using the general dependency measure, Distance Based on Conditional Ordered List (DCOL) that we introduced before, we designed the nonlinear -profiles clustering method, which can be seen as the nonlinear counterpart of the -means clustering algorithm. The method has a built-in statistical testing procedure that ensures genes not belonging to any cluster do not impact the estimation of cluster profiles. Results from extensive simulation studies showed that -profiles clustering not only outperformed traditional linear -means algorithm, but also presented significantly better performance over our previous General Dependency Hierarchical Clustering (GDHC) algorithm. We further analyzed a gene expression dataset, on which -profile clustering generated biologically meaningful results.

#### 1. Introduction

In recent years, large amounts of high dimensional data have been generated from high-throughput expression techniques, such as gene expression data using microarray or deep sequencing [1], and metabolomics and proteomics data using liquid chromatography-mass spectrometry (LC-MS) [2]. Mining the hidden patterns inside these data leads to an enhanced understanding of functional genomics, gene regulatory networks, and so forth [3, 4]. However, the complexity of biological networks and the huge number of genes pose great challenges to analyze the big mass of data [5, 6]. Clustering techniques has usually been applied as a first step in the data mining process to analyze hidden structures and reveal interesting patterns in the data [7].

Clustering algorithms have been studied extensively in the last three decades, with many traditional clustering techniques successfully applied or adapted to gene expression data, which led to the discovery of biologically relevant groups of genes or samples [6]. Traditional clustering algorithms usually process data on the full feature space while emerging attention has been paid to subspace clustering. Traditional clustering algorithms, such as -means and expectation maximization (EM) based algorithms, mostly use linear associations or geometric proximity to measure the similarity/distance between data points [8].

When applying traditional clustering algorithms to the domain of bioinformatics, additional challenges are faced due to prevalent existence of nonlinear correlations in the high dimensional space [9]. However, nonlinear correlations are largely untouched in contrast to the relative mature literature of clustering using linear correlations [5, 10–12]. There are several factors making nonlinear clustering difficult. First, a pair of nonlinearly associated data points may not be close to each other in high-dimensional space. Second, it is difficult to effectively define a cluster profile (i.e., the “center” of a cluster) to summarize a cluster given the existence of nonlinear associations. Third, compared to measures that detect linear correlations, nonlinear association measures lose statistical power more quickly with the increase of random additive noise. Fourth, given the high dimensions, computationally expensive methods, for example, principal curves [13, 14], are hard to be adopted even though they can model nonlinear relationships.

In this paper, we try to address these problems by developing a clustering method that can group data points with both linear and nonlinear associations. We name this method “-profiles clustering.” Our method is based on the previously described nonlinear measure: the Distance Based on Conditional Ordered List (DCOL) [15, 16]. The key concept is to use data point orders in the sample space as the cluster profile. We have previously described a hierarchical clustering scheme named General Dependency Hierarchical Clustering (GDHC). However the computation of GDHC is very intensive. The new -profiles clustering method is much more efficient, representing a ~20-fold reduction in computing time. Conceptually, it is the nonlinear counterpart of the popular -means clustering method, while the existing GDHC is the nonlinear counterpart of the traditional hierarchical clustering method. Another key advantage of the -profiles clustering method is that, by building statistical inference into the iterations, noise genes that do not belong to any cluster will not interfere with the cluster profile estimation, and they are naturally left out of the final results.

#### 2. Methods

##### 2.1. Distance Based on Conditional Ordered List (DCOL)

We first consider the definition of Distance Based on Conditional Ordered List (DCOL) in two-dimensional space. Given two random variables and and the corresponding data points , after sorting the points on -axis to obtainthe DCOL is defined asIntuitively, when is less spread in the order sorted on , is small. We can use to measure the spread of conditional distribution in a nonparametric manner [16].

The statistical inference on can be conducted using a permutation test. Under the null hypothesis that and are independent of each other, the ordering of the data points based on is simply a random reordering of . Thus we can randomly permute times and calculate the sum of distances between adjacent values in each permutation. Then we can find the mean and standard deviation from the values sampled from the null distribution. The actual can then be compared to the estimated null distribution to obtain the value. Notice this process does not depend on . The permutation can be done once for and the resulting null distribution parameters apply to any , which greatly saves computing time.

##### 2.2. Defining a Cluster Profile and Generalizing DCOL to Higher Dimensions

Let be a -dimensional random vector , where each is a random variable; then an instance of random vector can be seen as a point in the -dimensional space. Assuming instances of random vector are sorted in the -dimensional space, then can be computed according to (2) for any random variable . Therefore, the key problem is to define the order of a series of -dimensional points.

When is one-dimensional, we can easily prove that a list of numbers () is sorted if and only if is minimized. We generalize this to -dimensional space and define instances () as sorted if and only if the sum of distances between the adjacent -dimensional points is minimized. Sorting the points is equivalent to finding the shortest Hamiltonian path through the points in dimensions, the solution of which is linked to the Traveling Salesman Problem (TSP) [17]. Many methods exist for solving the TSP [17].

If we consider the random variables as genes, we have effectively defined a profile for the cluster made of these genes. Using this profile, we can compute the for any gene and determine if the gene is close to this cluster, which serves as the foundation of the -profile algorithm.

##### 2.3. The -Profiles Algorithm

In this section, we outline the DCOL-based nonlinear -profiles clustering algorithm. First, we define the gene expression data matrix , where samples are measured for genes and each cell is the measured expression level of gene on sample . Each row represents the expression pattern of a gene while each column represents the expression profile of a specified sample.

The -profiles clustering process is analogous to the traditional -means algorithm overall. However there are two key differences: (1) Different from the -means clustering algorithm, we use the data point ordering (Hamiltonian path) as the cluster profile rather than the mean vector of all data points belonging to this cluster; (2) during the iterations, the association of each point to its closest cluster is judged for statistical significance. Points that are not significantly associated with any cluster cannot contribute to the estimation of the cluster’s profile.

Due to the random initialization of clusters, we use a loose value cutoff at the beginning and decrease it iteration by iteration as the updated cluster profiles become more stable and reflect the authentic clusters more reliably as the clustering process progresses.(a)To start, we compute the null distribution of DCOL distances for each gene (row) and obtain two parameters, mean and standard deviation , for each gene simultaneously by permuting columns of the matrix 500 times. The gene-specific null distribution parameters are used to compute the values of the DCOL whenever assigning a gene to the closest cluster.(b)Initialize clusters by generating random orders as cluster profiles; set value cutoff to upper bound.(c)For each row vector, compute its DCOL distance to each cluster according to corresponding cluster profile , where is the th gene and is the th cluster. Assign it to the closest cluster if the DCOL is statistically significant in terms of value. In this step, we are implicitly computing values for each gene and taking the minimum. Thus we need to adjust the value cutoff to address the multiple testing issue. We assume each cluster profile is independent of the others. Then it follows that, for each gene, the values are independent. Under the null hypothesis that the gene is not associated with any of the clusters, all the values are* i.i.d*. samples from the standard uniform distribution. Thus the nominal value cutoff is transformed to .(d)When all gene vectors have been assigned, recalculate the profile of each cluster using a TSP solver.(e)Repeat steps (c) and (d) until the cluster profiles no longer change or the maximum iteration is reached. We start with a loose value cutoff. In each iteration we reduce the value cutoff by a small amount, until the target value cutoff is reached.The above procedure is conditioned on a given , the number of clusters. We used gap statistics for determination of . Other options such as prediction strength or finding the elbow of the variance-cluster number plot are also available. Here we replace the sum of variances by the sum of negative values.

##### 2.4. Simulation Study

We generated simulation datasets with 100 samples (columns) and gene clusters, each containing 100 genes (rows). Another 100 pure noise genes were added to the data. was set to 10 or 20 in separate simulation scenarios. Within each cluster, we set the genes (rows) to be either linearly or nonlinearly correlated using different link functions, including (1) linear, (2) sine curve, (3) box wave, and (4) absolute value (Figure 1).