BioMed Research International

Volume 2015, Article ID 462549, 13 pages

http://dx.doi.org/10.1155/2015/462549

## Identifying and Assessing Interesting Subgroups in a Heterogeneous Population

^{1}Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, 17177 Stockholm, Sweden^{2}Department of Statistics, Inha University, Incheon 402-751, Republic of Korea^{3}Department of Microbiology, Tumour and Cell Biology, Bioinformatics Infrastructure for Life Sciences, Science for Life Laboratory, Karolinska Institutet, 17177 Stockholm, Sweden^{4}Department of Oncology and Pathology, Science for Life Laboratory, Karolinska Institutet, 17121 Solna, Sweden^{5}Genomics, Institut Gustave Roussy, F-94805 Villejuif, France

Received 10 November 2014; Revised 1 March 2015; Accepted 3 March 2015

Academic Editor: Kristel van Steen

Copyright © 2015 Woojoo Lee et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Biological heterogeneity is common in many diseases and it is often the reason for therapeutic failures. Thus, there is great interest in classifying a disease into subtypes that have clinical significance in terms of prognosis or therapy response. One of the most popular methods to uncover unrecognized subtypes is cluster analysis. However, classical clustering methods such as *k*-means clustering or hierarchical clustering are not guaranteed to produce clinically interesting subtypes. This could be because the main statistical variability—the basis of cluster generation—is dominated by genes not associated with the clinical phenotype of interest. Furthermore, a strong prognostic factor might be relevant for a certain subgroup but not for the whole population; thus an analysis of the whole sample may not reveal this prognostic factor. To address these problems we investigate methods to identify and assess clinically interesting subgroups in a heterogeneous population. The identification step uses a clustering algorithm and to assess significance we use a false discovery rate- (FDR-) based measure. Under the heterogeneity condition the standard FDR estimate is shown to overestimate the true FDR value, but this is remedied by an improved FDR estimation procedure. As illustrations, two real data examples from gene expression studies of lung cancer are provided.

#### 1. Introduction

Biological heterogeneity is common in many diseases; heterogeneity complicates clinical management, as it is often the reason for prognostic and therapeutic failures. Thus, there have been many attempts to classify a disease into subtypes with anticipation that different subgroups are associated with different clinical significance in terms of prognosis or therapy response (e.g., [1, 2]). A significant progress in designing efficient specific treatments can be achieved if novel clinically relevant subtypes are found.

One of the most popular methods for finding unrecognized subtypes is cluster analysis. However, classical clustering methods such as -means clustering or hierarchical clustering are not guaranteed to produce clinically interesting subtypes because the main statistical variability could be dominated by genes not associated with interesting clinical phenotypes. Furthermore, it could be that prognostic factors shared within a subgroup do not have any important role in other subgroups. Thus, the association between prognostic factors and a clinical phenotype is attenuated and not detectable in the whole population. To address these problems, we extend the standard clustering algorithm to find interesting subgroups in the sense that within the subgroup we can find factors (in this paper: genes) strongly associated with the clinical phenotype. This idea can perhaps be illustrated more clearly as follows: suppose that is an outcome (e.g., relapse) and is a randomized treatment; it is common to search for a subgroup for which the treatment effect is largest. In effect we are searching for factors that have significant interactions with , such that a subgroup defined by will have a large treatment effect on . A unique point in our current application is that both and are given by the same set of gene expression data. Also, we allow complex subgroups to be discovered by a clustering method, which makes the process distinct from the standard interaction analysis.

Given a set of gene expression matrix, our goal of cluster analysis is to group patients and genes into subgroups that convey biological or clinical significance. This task can be translated to the biclustering problem. Biclustering methods attempt to simultaneously cluster both patients and genes with the goal of finding subsets of rows and columns in the expression matrix. Cheng and Church [3] firstly introduced biclustering to gene expression analysis. For reviewing the details of biclustering algorithms, see [4]. As Nowak and Tibshirani [5] noticed, however, most of biclustering algorithms tend to be dominated by groups of highly differentially expressed (DE) genes that may not be relevant to the biological process in question. In other words, irrelevant genes with strong signal can mask genes of highest biological relevance. Furthermore, iterative optimization methods adopted in biclustering algorithms depend on initial conditions. To overcome these limitations, we develop an extensive clustering search algorithm to find molecular subtypes (CAMS) based on clustering of patients with partially similar mRNA profile. CAMS is able to uncover the structures arising from relevant genes that may not be highly expressed but moderately expressed within each subtype.

CAMS produces many subtypes. For each subtype, -statistics comparing two distinct phenotype groups (e.g., relapsed/not relapsed) are computed for whole genes and false discovery rate (FDR) estimate is used to correct for multiple comparisons. The number of genes having small FDR estimates (say, less than 0.1) is the basis for assessing the importance of the subtypes. In real data analysis, however, it is a common occurrence in heterogeneous populations that value distributions of the two-sample -statistics show substantial shortage of small values compared to the uniform distribution [6]. If we ignore this effect we would miss potential discoveries by overestimating FDR. Since subtypes produced by CAMS still can be heterogeneous, it is crucial to study how the molecular heterogeneity of distinct subtypes affects the FDR estimate. In this paper, we introduce unobserved group (or latent group) variables into a simple model for gene expression and see how the heterogeneity induced by the unobserved group leads to the depletion of small values even when there are many significant signatures. Thus, without considering this underlying heterogeneity, the use of standard FDR estimate might hide promising discoveries. To resolve this problem, we develop an improved FDR estimation procedure to address the heterogeneity in a dataset.

In estimating FDR, the use of correct null density function is critical. Efron [7] considered three issues that substantially affect the null density estimate in computing FDR: (1) a large proportion of genuine but uninterestingly small effects, (2) hidden correlations, and (3) unobserved covariates. Many researchers have studied how they affect the standard FDR estimate [7–9]. In particular, possible connections between unobserved covariates and FDR have been explored in [6, 10]. Leek and Storey [6] showed numerically that the small values range from being inflated to depleted depending on the configuration of the unobserved covariates. They developed the so-called surrogate variable analysis (SVA) for capturing heterogeneity induced by the unobserved covariates and studied how SVA affects FDR estimate. Stegle et al. [10] considered a Bayesian method to account for hidden confounding variation in expression quantitative trait loci (QTLs) and showed that the method found additional expression QTLs in real datasets. However, their approaches were suggested to study the attenuated relationship by heterogeneity between a measured variable of interest and clinical outcomes, while we focus on finding submerged subtypes by heterogeneity. The novel contributions of this paper are (1) to explain how the heterogeneity induced by unobserved group leads to the depletion of small values analytically, (2) to analyze the bias of standard FDR estimates under the heterogeneity, and (3) to develop an improved FDR estimation procedure. With these in mind a FDR-based measure is considered to assess findings from a novel clustering procedure. This is illustrated using two datasets on lung cancer patients.

The rest of this paper is organized as follows. In Section 2, we describe the implementation details of CAMS. A brief review of notations and a standard FDR estimation method are given in Section 3, and it is analytically shown that the hidden subgroup in the population can induce a bias of standard FDR estimate in Section 4. We propose a FDR estimation procedure resolving the bias problem and show how to assess clustering results from CAMS with it in Sections 5 and 6. Section 7 includes two real data applications and is followed by concluding remarks.

#### 2. Clustering Algorithm for Finding Molecular Subtypes

Consider a set of gene expression profiles from a group of cancer patients. The premise behind CAMS is that the novel molecular information on cancer heterogeneity is hidden in the gene expression profiles. To uncover the heterogeneity, CAMS implements a two-dimensional clustering “patients versus genes" extensively. The full algorithm is given in Algorithm 1.