International Journal of Genomics

Volume 2018, Article ID 6591634, 9 pages

https://doi.org/10.1155/2018/6591634

## Detecting Differentially Variable MicroRNAs via Model-Based Clustering

^{1}Department of Mathematics and Statistics, York University, Toronto, ON, Canada^{2}Channing Division of Network Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, USA

Correspondence should be addressed to Yuejiao Fu; ac.ukroy.tatshtam@oaijeuy

Received 31 December 2017; Revised 12 June 2018; Accepted 26 June 2018; Published 12 July 2018

Academic Editor: Wilfred van IJcken

Copyright © 2018 Xuan Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Identifying differentially variable (DV) genomic probes is becoming a new approach to detect novel genomic risk factors for complex human diseases. The test is the standard equal-variance test in statistics. For high-throughput genomic data, the probe-wise test has been successfully used to detect biologically relevant DNA methylation marks that have different variances between two groups of subjects (e.g., cases versus controls). In addition to DNA methylation, microRNA (miRNA) is another important mechanism of epigenetics. However, to the best of our knowledge, no studies have identified DV miRNAs. In this article, we proposed a novel model-based clustering method to improve the power of the probe-wise test to detect DV miRNAs. We imposed special structures on covariance matrices for each cluster of miRNAs based on the prior information about the relationship between variances in cases and controls and about the independence among them. Simulation studies showed that the proposed method seems promising in detecting DV probes. Based on two real datasets about human hepatocellular carcinoma (HCC), we identified 7 DV-only miRNAs (hsa-miR-1826, hsa-miR-191, hsa-miR-194-star, hsa-miR-222, hsa-miR-502-3p, hsa-miR-93, and hsa-miR-99b) using the proposed method, one (hsa-miR-1826) of which has not yet been reported to be related to HCC in the literature.

#### 1. Introduction

Investigating the relationship between genomics and complex human diseases has greatly improved our understanding of the molecular mechanisms of, and the interplay of environmental factors and genomic factors to, the complex human diseases. High-throughput data from cutting-edge technologies have substantially facilitated the unbiased discovery of the genetic risk factors for many diseases. The standard approach to identify disease-associated genomic probes is to test if the mean level (e.g., DNA methylation) between cases and controls is significantly different. In addition to the mean, the variance is another important summary statistic. The larger the variance is, the more information the data could provide. However, the information about variance has not been directly used to detect disease-associated genomic probes until recent years.

Several groups of researchers have recently identified DNA methylation marks that have different variances between cases and controls [1–3]. They observed that (1) for differentially variable (DV) DNA methylation marks the variability in cases is usually higher than that in controls and (2) DV DNA methylation marks are biologically relevant. DNA methylation is an example of an epigenetic modification. Such modification leads to heritable changes via regulation of gene expression, without changing the genetic code. DNA methylation inhibits gene expression by adding a methyl group to the cytosine or adenine DNA nucleotides. Another example of an epigenetic modification is microRNAs (miRNAs) that are short noncoding 18–25-nucleotide-long RNA and negatively regulate mRNA translation [4, 5]. However, to the best of our knowledge, no studies have investigated differential variability for miRNAs. The main objective of this article is to develop statistical methods to detect DV miRNAs between cases and controls.

The test is the classical method to test for equal variance between two groups of subjects, which evaluates whether the ratio of sample variances between two groups is significantly different from one. For high-throughput genomic data, such as DNA methylation data, the probe-wise test could be used. That is, we first perform the test for each probe to test for equal variances between cases and controls. We then calculate FDR-adjusted value to control for multiple testing, where FDR stands for false discovery rate. If the FDR-adjusted value < 0.05 for a DNA methylation mark, we then claim that this DNA methylation mark is differentially variable between cases and controls. The advantages of this probe-wise approach include flexibility (one model per probe) and easy implementation. However, DV probes might be governed by the same underlying mechanism. Statistically speaking, DV probes might follow the same distribution. Similarly, non-DV probes might also follow the same distribution. We hypothesize that these underlying distributions of variances could help us improve the power of the test to detect DV probes.

A few methods have been proposed in the literature to borrow information across probes to detect differentially variable genomic probes. For example, Bar et al. [6] proposed a mixture-model approach for parallel detection of differential variances in genomic data analysis, by assuming that the ratio of sample variances between two groups for a given probe is drawn from a three-component mixture. Bar et al. [7] introduced a bivariate model (N3) to account for both differential expression and differential variation in high-throughput data analysis, by assuming that both means and variances follow three-component mixture distributions. Bar and Schifano [8] proposed a unified three-component mixture model, the L_{2}N model, that can be used to detect either differential expression (mean) or differential variation, by modeling the differences of means and variances (dispersions) between two groups of samples. In the L_{2}N model, one log-normal component is used to fit underexpressed (dispersed) probes, one log-normal component is used to fit overexpressed (dispersed) probes, and one normal component is used to fit nondifferentially expressed (dispersed) probes. These models characterize the distributions of the summary statistics (e.g., mean, variance, or difference of means), instead of the observed expression levels.

In this article, we propose a mixture of three-component multivariate normal distributions to fit the expression levels of miRNAs to identify DV miRNAs between cases and controls.

#### 2. Method

##### 2.1. Model

We assume that miRNAs belong to one and only one of the following three clusters: (1) miRNAs having higher variances in cases than in controls (denoted as the OV cluster), (2) miRNAs having equal variances between cases and controls (denoted as the EV cluster), and (3) miRNAs having smaller variances in cases (denoted as the UV cluster). We followed Qiu et al. [9] to directly model the marginal distributions of miRNAs in the 3 clusters. In this article, we modified Qiu et al.’s marginal model [9] to allow the detection of DV probes. We assume that (1) data have been normalized to remove the effects of confounding factors, such as chip effect and batch effect, and (2) data have been transformed so that the distributions of miRNA expressions are close to normal distributions.

For a given miRNA, we denote *X _{i}* as the preprocessed expression for the th subject, , where , is the number of cases, and is the number of controls. For the th cluster (, 2, or 3), we assume that the expressions of the cases are identically distributed with mean and variance . We assume that the expressions of the controls are identically distributed with mean and variance . According to Qiu et al. [9], ’s are marginally correlated with correlation for cases and for controls. We also assume that (1) cases and controls are independent, and (2) the random vector (

*X*

_{1},…,

*X*)

_{m}*follows a multivariate normal distribution. For the OV cluster, we require that . For the UV cluster, we require that . For the EV cluster, we require that . We allow the means and correlations to be different between cases and controls in the EV cluster.*

^{T}We used the EM algorithm [10] to estimate the model parameters , , , and . The posterior probability is used to assign the *g*th miRNA to one of the 3 clusters, where is the density function of the multivariate normal distribution for the th cluster. If is the largest posterior probability among , , and , then the *g*th miRNA will be assigned to the 1st cluster (i.e., OV cluster). The supplementary document gives the details about the model and the corresponding parameter estimation procedure.

##### 2.2. Real Datasets

We downloaded two miRNA datasets from NIH’s Gene Expression Omnibus (GEO) [11]: GSE67138 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67138) and GSE67139 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67139). Both datasets are from the same project that aims at detecting miRNAs differentially expressed between human hepatocellular carcinoma (HCC) tumor tissues with and without vascular invasion. GSE67138 is the first batch containing 57 samples (34 invasive tumor tissues and 23 noninvasive tumor tissues), while GSE67139 is the second batch containing 120 samples (60 invasive tumor tissues and 60 noninvasive tumor tissues). The expression levels of miRNAs in both GEO datasets were measured by using Affymetrix Multispecies miRNA-1 Array (GPL8786). Both datasets contain 847 miRNAs.

We checked the data quality by visualizing the plot (Figure A1) of percentiles across arrays and the scatterplot (Figure A2) of the first two principal components. Both plots indicate that the two datasets have been cleaned and have good quality (i.e., no apparent outlying miRNAs, outlying arrays, or technical batch effects). Hence, we directly used the two datasets in the further analyses. Since GSE67139 has a larger sample size than GSE61738, we regarded GSE67139 as the discovery set and GSE67138 as the validation set.

##### 2.3. Simulation

We conducted 4 sets of simulation studies. In the first set (denoted as SimI), we generated miRNA data from the proposed marginal mixture model, where estimated model parameters for GSE67139 (i.e., the discovery set) are used as the true values of the model parameters (, , , , , , , , , , , , , , , , , , , , ). We generated 100 datasets, each of which has 1000 miRNAs for 50 cases and 50 controls. Thirty one percent (310) of the 1000 miRNAs are in the OV cluster. Eleven percent (110) of the miRNAs are in the UV cluster. The remaining 58% (580) miRNAs are in the EV cluster.

In the second set (denoted as SimII), we generated miRNA data from a mixture of 3 multivariate distribution with the same mean vectors and covariance matrices as those in SimI and with 3 degrees of freedom. SimII is used to evaluate the performance of the proposed method when the normality assumption for any one of the three clusters (OV, EV, and UV) is violated.

In the third set (denoted as SimIII) of the simulation studies, we generated miRNA data from the same model as that in SimI, except that the marginal correlation within-subject groups were set to zero ( and ). SimIII is used to evaluate the performance of the proposed method when there are no marginal correlations.

In the fourth set (denoted as SimIV) of the simulation studies, we generated miRNA data from the same model as that in SimII, except that the marginal correlations within-subject groups were set to zero ( and ). SimIV is used to evaluate the performance of the proposed method when there are no marginal correlations and when the normality assumption for any one of the three clusters (OV, EV, and UV) is violated.

##### 2.4. Statistical Analysis

We compared the proposed method (denoted as gs) with sixteen existing differential-variance detecting methods by using both the real datasets and the simulated datasets. The ten equal variance tests are (1) the test (denoted as F), (2) Ahn and Wang’s score test [12] (denoted as AW), (3) Phipson and Oshlack’s AD test [13] (denoted as PO.AD), (4) Phipson and Oshlack’s SQ test [13] (denoted as PO.SQ), (5) Levene’s test [14] (denoted as L), (6) Brown and Forsythe’s test [15] (denoted as BF), (7) trimmed-mean-based Levene’s test [15] (denoted as Ltrim), (8) improved AW test based on Levene’s test [16] (denoted as iL), (9) improved AW test based on the BF test [16] (denoted as iBF), and (10) improved AW test based on the trimmed-mean-based Levene’s test [16] (denoted as iTrim). The remaining six methods are based on Bar et al.’s [7] N3 model and Bar and Schifano’s [8] L_{2}N model. Both N3 and L_{2}N models have been implemented in the R package DVX [8]. For both N3 and L_{2}N, DVX outputs raw values, values, and posterior probabilities that the probe *g* belongs to cluster *k* given its expression profile and estimated model parameters, . Hence, for both N3 and L_{2}N, we used three methods to assign probes to two clusters: DV probes and non-DV probes. The first method is based on the value. If a miRNA has a value < 0.05, we claim it is differentially variable; otherwise, we claim it is nondifferentially variable. The second method is based on the false discovery rate- (FDR-) adjusted value. If a miRNA has an FDR-adjusted value < 0.05, we claim it is differentially variable; otherwise, we claim it is nondifferentially variable. The third method is based on the posterior probabilities. We assign a miRNA to cluster if the posterior probability is the largest among the 3 posterior probabilities, , , and . We denote the 3-miRNA assignment methods as N3.q (L_{2}N.q), N3.f (L_{2}N.f), and N3 (L_{2}N), respectively.

In real data analysis, we followed Qiu et al.’s [9] data preprocessing steps. That is, we first performed the same Box-Cox transformation for each expression level, and then for each miRNA, we performed mean centering and scaling operations so that the mean expression level is 0 and the variance is 1. We then applied the 17 methods (the gs method and the 16 existing methods) to the discovery set (GSE67139) to detect DV miRNAs between invasive tumors and noninvasive tumors. For the 10 probe-wise tests (F, AW, PO.AD, PO.SQ, L, BF, Ltrim, iL, iBF, and iTrim), we obtained FDR-adjusted values. If a miRNA has an FDR-adjusted value < 0.05, we claim that this miRNA has significantly different variances between invasive tumors and noninvasive tumors. We then applied the same procedure to the validation set (GSE67138). We claim that a miRNA is a validated DV miRNA (1) if the miRNA is DV in both discovery and validation sets and (2) if the sign of the difference () is the same in both datasets, where and are sample variances for cases and controls, respectively. We next calculated the proportion of the validated DV miRNAs (i.e., validation rate) , where is the number of DV miRNAs in the discovery set (GSE67139) and *n*_{12} is the number of significant DV miRNAs sharing the same difference direction of variances in both data sets. To estimate the variation of the validation rate pValid, we obtained the 100 bootstrap validation rates based on 100 bootstrap discovery and validation sets. We then test if the median bootstrap validation rate of the gs method is the same as that of each of the other 16 methods by two-sided Wilcoxon signed rank tests.

For the validated DV miRNAs detected by the gs method, we also checked if they are validated differentially expressed (DE) miRNAs by using R Bioconductor package *limma* [17]. A miRNA is a validated DE miRNA if the FDR-adjusted value for testing equal mean expression between cases and controls is <0.05 in both the discovery and validation sets and if the sign of the mean difference is the same in both discovery and validation sets, where and are the sample means of the cases and controls, respectively. Denote as the set of miRNAs that are validated DV, but not validated DE. Denote as the set of miRNAs that are validated DE, but not validated DV. Denote as the set of miRNAs that are both validated DE and validated DV.

We applied the miRSystem [18] to predict the target genes of miRNAs in each of the 3 sets: , , and . The miRSystem also provides the enriched KEGG pathways for the predicted target genes.

For simulated datasets, we calculated the magnitude of agreement between the true cluster memberships of miRNAs and the detected cluster memberships by each of the 17 methods by using the Jaccard index [9, 19]. The maximum value of the Jaccard index is one, indicating perfect agreement. The minimum value of the Jaccard index is zero, indicating that the agreement is by chance. We also evaluate the performances using false positive rate (FPR) (i.e., the proportion of detected DV probes among the true non-DV probes) and false negative rate (FNR) (i.e., the proportion of detected non-DV probes among the true DV probes). The smaller the FPR (FNR) is, the better the performance is.

#### 3. Result

For the real data analyses, the numbers of the DV miRNAs in the discovery set (GSE67139), and the numbers and proportions of the validated DV miRNAs are shown in Table 1. The gs method detected 358 DV probes based on the discovery set (GSE67139), 67 of which are validated in the validation set (GSE67138). Among the 67 validated DV miRNAs, 66 miRNAs are OV and only one miRNA is UV. The proportion of the validated DV miRNAs is 0.19 for the gs method, which is higher than those of the N3 and L_{2}N methods but lower than those of the 10 probe-wise tests. However, the gs method had the highest median bootstrap validation rate among all 17 methods (Figure 1). For all the 17 methods, the number (nValid.OV) of the validated OV miRNAs is much larger than the number (nValid.UV) of the validated UV miRNAs. This observation is consistent with that observed by other researchers using DNA methylation data [3].