Abstract

Human diseases are abnormal medical conditions in which multiple biological components are complicatedly involved. Nevertheless, most contributions of research have been made with a single type of genetic data such as Single Nucleotide Polymorphism (SNP) or Copy Number Variation (CNV). Furthermore, epigenetic modifications and transcriptional regulations have to be considered to fully exploit the knowledge of the complex human diseases as well as the genomic variants. We call the collection of the multiple heterogeneous data “multiblock data.” In this paper, we propose a novel Multiblock Discriminant Analysis (MultiDA) method that provides a new integrative genomic model for the multiblock analysis and an efficient algorithm for discriminant analysis. The integrative genomic model is built by exploiting the representative genomic data including SNP, CNV, DNA methylation, and gene expression. The efficient algorithm for the discriminant analysis identifies discriminative factors of the multiblock data. The discriminant analysis is essential to discover biomarkers in computational biology. The performance of the proposed MultiDA was assessed by intensive simulation experiments, where the outstanding performance comparing the related methods was reported. As a target application, we applied MultiDA to human brain data of psychiatric disorders. The findings and gene regulatory network derived from the experiment are discussed.

1. Introduction

Human diseases involve complex processes that include interactive actions of biological multiple layers such as genetic, epigenetic, and transcriptional regulation. Conducting research based on a single type of biological data produces insufficient results to fully exploit the knowledge of the complex human diseases. The prior research shows that it is essential for the study to be based on a comprehensive consideration of the multiple biological data to grasp an in-depth understanding of the complex mechanisms of the human diseases and the identification of disease markers. The recent advances of high-throughput technologies such as DNA microarray and sequencing technologies efficiently profile various types of genomic data. The genomic data include Single Nucleotide Polymorphism (SNP), Copy Number Variation (CNV), DNA methylation (DM), and gene expression (GE). Integrative genomic analysis of the heterogeneous genomic data plays an important role in profiling a global view of a biological system as well as identifying significant markers of the human diseases.

However, most research has focused solely on investigations of a single type of the genomic data. Genome-Wide Association Studies (GWAS) examine genetic loci which are associated with a trait (e.g., major diseases) using the SNP data [1, 2]. GWAS normally compare the SNP arrays of two groups, disease (case) and normal (control) samples. If a genetic variation on a locus with the disease samples is statistically significant to the controls, the SNP is considered associated with the disease, whereas expression Quantitative Trait Loci (eQTL) studies have been actively done to identify genetic loci that regulate gene expression [3]. Combining the gene microarray data with GWAS not only enables the capture of gene regulatory interactions but also provides insight into the genetic mechanism that regulates gene expression variations. However, both GWAS and eQTL mapping studies still remain as a “missing heritability” problem [4].

In addition to SNP, Copy Number Variation (CNV) and DNA methylation (DM) have also been highlighted as key factors that affect the gene expression regulation. CNV is a structural alternation of DNA in which specific regions of the genome are deleted or duplicated on chromosomes. Although CNV is frequently observed even in healthy individuals, it is hypothesized that the variants may cause diseases by directly affecting gene dosage and gene expression [5, 6]. Specifically, whole-genome association studies of the relationship between CNV and diseases reported that gene expression levels in CNV regions are strongly related to the deletion or duplication of the regions [6]. Typically, the deletion of either particular regions within a gene or regulatory regions of a gene may result in a lower gene expression than what is normally expressed. DM is an epigenetic modification that occurred by the addition of methyl group to the cytosine or adenine of DNA. DM inhibits transcription of the genes with high levels of 5-methylcytosine in their promoter region or recruits proteins such as histone deacetylases that can modify histones [7, 8]. The functionality of DM consequently changes the gene expression levels even on the same DNA bases.

Thus, recent research has actively extended GWAS and eQTL mapping studies to the integrative association studies with multiple types of genomic data. Most integrative genomic research focuses on identifying genetic, epigenetic, or posttranscriptional factors that control gene expression regulation (or microRNA) by considering the complex interactions of SNP, CNV, and DM [911]. Specifically, the Cancer Genomic Atlas [9] conducted large-scale multidimensional analysis with SNP, CNV, DM, and GE to provide comprehensive genomic characterizations for brain cancer. In Aure et al.’s work [10], the combination effects of CNV and DM were examined to identify the association with alterations of miRNA expression in breast tumors. Wagner et al. [11] studied the relationship between SNP, DM, and GE via multiple eQTL analysis.

Most of the integration approaches have used step-by-step processes. Ordinarily, approaches filter candidate markers by using statistical techniques at the first step and find the final markers that satisfy certain criteria at the remaining stages [1215]. This type of integration method often makes increased “type II errors” at each step, that is, fails to find informative markers by incorrectly identifying them as insignificant. Moreover, they do not consider interaction effects of the multiblock data. Mechanism was not considered.

Hence, research has recently started to shift toward approaches using systematical models in order to integrate and analyze the heterogeneous data comprehensively rather than through simple step-wise processes [1618]. Multiblock methods of Partial Least Squares (PLS) and Generalized Canonical Correlation Analysis (GCCA) are representative methods. A derivative of a sparse version of PLS was proposed by penalizing both features and sample dimensions to identify “regulatory modules” [16]. Such PLS-based methods, which maximize the covariance between latent variables, often fail to detect significant factors when their intensities are weak. Furthermore, the method lacks the consideration of the discriminant analysis of the disease. A sparse multiblock analysis method derived from Generalized Canonical Correlation (SGCCA) was developed to identify multiblock association models while considering the relationship between the different data block such as cis-regulated mutations [17]. This work builds a hybrid model by combining both GWAS and eQTL models rather than a multiblock integration model. The data integration approach was suggested by utilizing multiple feature selection methods such as Principal Component Analysis (PCA), PLS, and LASSO [18]. They extracted the important factors using the dimensional reduction and feature selection methods and applied them on Cox survival models. However, combination effects of the multiblock data were ignored in this approach.

To tackle these limitations, we propose a novel Multiblock Discriminant Analysis (MultiDA) method for the integrative genomic study. The proposed method MultiDA makes the following main contributions.(i)A new integrative genomic model for the discriminant analysis is introduced by exploiting class information.(ii)A sophisticated optimal solution is developed to solve the discriminant analysis problem in the integrative genomic model.

First, we built a novel integrative genomic model for the discriminant analysis. The class data is considered as one block, and the total squared correlation including the class block is maximized. The introduction of the class block to the multiblock model enables us to perform discriminant analysis in the integrative genomic model. Secondly, we propose a sophisticated method to solve the discriminant analysis problem in the new integrative genomic model. The discriminant analysis is essential in identifying biomarkers of human diseases in computational biology. Regardless, it has been overlooked in the multiblock analysis. The efficient algorithm for the discriminant analysis and assessment of its performance are explored in this paper.

2. Methods

2.1. Notation

We suppose that there are multiblock data. The multiblock data are measured on numbers of the same set of observations. A block consists of a group of features that share common properties or represent one aspect of the sample. The multiblock data is denoted by . The th block data is -dimensional zero mean column vectors . A matrix is a binary matrix that determines the linkage between the multiblock, where if the block and the block are connected or 0 if otherwise. In the proposed integrative genomic model, SNP, CNV, DM, GE, and class label (case or control) of the samples are considered as the multiblock components. For simplicity, , , , , and represent SNP, CNV, DM, GE, and class label, respectively. Through this paper, we use for the index of the sample and for the multiblock. is used to denote a column vector of a matrix or an element of a vector. For instance, and represent the th column vector of the matrix and the th element of the vector , respectively. Figure 1 illustrates the conceptual overview of the multi-block data and framework.

2.2. Multiblock Discriminant Analysis

Multiblock Discriminant Analysis (MultiDA) builds a sparse association model by not only maximizing the total squared correlations between the multiblocks but also taking into account the discriminative factors in the model. MultiDA considers a linear subspace which is a construction of low-dimensional basis of the data. The linear subspaces of the multiblock, which maximize the total squared correlations, identify the significant factors of the association model with sparsity regularization. The linear subspace (or latent variable) of the th block is represented bywhere is a loading vector. Then, we introduce sparse regularization (elastic net penalization) on the loading vector to reduce the chance of including insignificant variables and to improve their interpretation. The sparse regularization has its advantage especially when the number of features is much larger than the sample number (). Therefore, the basic objective function can be represented as where and represent -norm and -norm of the vectors, respectively, and and are the shrinkage parameters that determine the sparsity. Note that the basic objective function is equivalent to the Sparse Generalized Canonical Correlation Analysis (SGCCA) [17]. Since the integrative genomic model aims to represent gene expression regulated by the combinations of SNP, CNV, and DM, the matrix can be defined as

We further consolidate the model by (1) introducing a weight matrix of the correlation for the balance of the model and (2) providing discriminant analysis in the integrative genomic model. We also provide the sophisticated solution of the model while SGCCA heuristically estimates the optimal solution by following Wold’s algorithm in the previous work [17].

2.2.1. Weight Matrix for the Balance of the Model

The weight matrix of the correlation between the multiblocks, , is introduced in the model. In the original multiblock model, the correlation between gene expression and class label block tends to be overlooked. Instead, the sum of the squared pairwise correlations of , , , and contributes large portions. The correlation weight matrix gives an equal balance of the total squared correlations. In this paper, the weight matrix is defined aswhere the correlation between gene expression and class label blocks is three times more weighted than others. Then, the matrix simply replaces the matrix .

2.2.2. Discriminant Analysis

In the proposed integrative genomic model, we need to find discriminative genes that characterize diseases. However, the integrative genomic model is comprised of combinations of multiple linear regression models. Thus, discriminant analysis such as Logistic Regression (LR) and Linear Discriminant Analysis (LDA) cannot be embedded into the integrative genomic model. To solve this problem, we adapted the Discriminative Least Squares Regression (DLSR) method proposed by Xiang et al. [19]. DLSR was developed based on the linear regression model, and it is proved that DLSR provides equal or superior performance compared to other discriminant methods. The basic concept of DLSR is to enlarge the distance between classes by introducing slack variables. Whereas they considered a multi-class problem and developed its sparse version with -norm regularization in their work, we reformulated its sparse method with elastic net penalization to suit our own needs. In DLSR, the slack variable is introduced into the ordinary linear regression problem: where is a dependent variable (), is a multivariate independent variable (), and is a coefficient vector (). is a direction of the class, where its element if or 1 if otherwise (). The Hadamard product operator of the direction vector and the slack variable vector determines the distance between classes (). The optimal solution will be covered in the next section.

2.2.3. The Objective Function of MultiDA

We finally obtain the objective function of MultiDA:where is defined asThis setting enables one to perform discriminant analysis between gene expression and disease blocks.

2.3. Optimization

The optimal solution of (6) can be obtained by the Lagrangian function:where and are the Lagrangian multipliers. The Lagrangian function (8) is convex, although not differentiable. Therefore, the local optimum of (8) provides a global solution. The partial derivatives of the Lagrangian function with respect to and are derived fromwhere is the vector of ’s sign. Although the stationary equations have no closed form solutions, the optimal solution can be estimated by an iterative algorithm.

We can make (9) simple with the inner component:Then, by introducing the inner component into (9), the solution of can be written asIn (11), is a squared correlation between the latent variables of the th and th block, which is a scalar. Therefore, the inner component is computed by of the previous iteration, and then new is updated in iterations.

Equation (12) is the normal equation of the regression of on with ridge and shrinkage parameter [20]. The final solution can be obtained by using the Univariate Soft-Thresholding (UST) method [21]:where returns a sign of , that is, if or if otherwise. returns only positive values of (i.e., if or if otherwise). can be obtained by -fold cross-validation that minimizes mean squared errors. The parameter can be ignored because the solution of is normalized by (10):

For the discriminant analysis between gene expression and disease data blocks, the optimum of the slack variable and the loading vector can be estimated by solving the following optimization problem:The Lagrangian function of (15) is . The derivative of the Lagrangian function with respect to is where is the sign of and . Thus, the equation of becomes Finally, the optimal solution of for the discriminative analysis is is also determined by -fold cross-validation that minimizes mean squared errors like other ’s. The optimal solutions of are simply derived from [19]The brief algorithm is described in Algorithm 1. In the algorithm, represents a rank of the subspace, which determines the dimension of the subspace. For instance, is th rank of . MultiDA optimizes the first rank subspace and iterates the optimization until the multiblock has no information. In lines 10–14 of Algorithm 1, Wold’s procedure guarantees the convergence [22].

(1) For all block, normalize loading vectors
   
(2)
(3) repeat
(4)  for to do
(5)   for to do
(6)    if block is binary class data then
(7)     estimate and by (18) and (19)
(8)     update
(9)    end if
(10)   if then
(11)     
(12)   else if then
(13)    
(14)   end if
(15)   Compute by UST
      
(16)   Normalize
      
(17)   
(18)  end for
(19) end for
(20) until converges

3. Experiment Results

The goal of the assessment is to identify significant factors of the integrative genomic model with the multiblock data, specifically the discriminative factors of human disease. The discriminant factors include disease-specific locations or regions of SNP, CNV, DNA methylation, and gene expression against normal patients.

3.1. Simulation Study

We assessed the performance of the proposed method MultiDA through simulated data. Simulation data of various complexities were considered. Generation’s schemes of the simulation data for the assessment were extended from the previous related works [16, 23].

Four generation functions of different complexity are defined as shown in Table 1. generates -dimensional normally distributed random variables of a given mean () and a variance (), where is an identity matrix. generates more complicated data than . In , a random model with a threshold () is implemented with the function . Given a uniform distributed random value (), if or 0 if otherwise. considers multicollinearity data in which more than two variables are highly correlated. The matrix data are generated by multivariate normal distribution . The covariance structure is built by the first order of autoregressive process. generates -dimensional normally distributed random variables from a given mean () and a variance ().

The first three multiblocks () were simulated by compounding the generation functions as defined in Table 2, where , , , and . For instance, the first five columns of were generated by and the following five columns were by . The next 30 columns were generated by the generation model with a threshold . The remaining columns of were generated by the multicollinearity random variables . Then, we considered the multiblock linear model, , where is a loading matrix and is a dimensional normally distributed noise matrix (). We assumed that only the first ten variables of each block are significant to explain . The fifth block is class label block. Given a coefficient vector (all zeros but the first ten), the probability of disease was computed by usingThen, the binary class label block was generated using the Bernoulli distribution with the probability .

The simulation study was examined with 50 replications to assess the reproducibility. We compared the performance of MultiDA with the related methods, Sparse Canonical Correlation Analysis (SCCA) [24] and Sparse Generalized Canonical Correlation Analysis (SGCCA) [17]. SCCA is a two-block method that maximizes the correlation between independent and response variable . In SCCA, the three blocks of data were combined into a single block (), and the block GE was considered as response (). The class label block was not considered in SCCA. The multiblock method SGCCA was tuned to be compatible with the proposed integrative genomic model. Note that the same matrix was used in SGCCA, but SGCCA did not take the discriminant analysis into account.

We examined the performance by how well they correctly identify significant factors of the integrative association model. Given a ground truth, we computed a confusion matrix and measured True Positive Rate (TPR), Positive Predictive Value (PPV), and Accuracy (ACCU). In the sparse setting, the true negatives are relatively much larger than false positives. Therefore, True Negative Rates (TNR) and Negative Predictive Values (NPV) were not included in this paper. The results of the simulation experiment are illustrated in Figure 2. The proposed method MultiDA () and the multiblock method SGCCA () outperformed SCCA () in terms of TPR. It supports that the multiblock methods reduce false negatives that incorrectly identify the significant as the insignificant. MultiDA appeared as the best performance in PPV and ACCU. MultiDA produced and for PPV and ACCU, respectively. Higher PPV values represent lower false positives that incorrectly identify the insignificant as the significant. The PPV and ACCU of SCCA were and and were and for SGCCA, respectively.

3.2. Human Brain Data of Schizophrenia

Human brain data were obtained from three major psychiatric disorders such as schizophrenia (SZ), bipolar disorder (BP), and major depression (DP) as well as from control group. Specifically, 39 samples of SZ, 35 samples of BP, 12 samples of DP, and 43 samples of control were provided from the Stanley Medical Research Institute. SNP, CNV, DNA methylation, and gene expression data were acquired from the human prefrontal cortex of the 129 samples in the preparation of this experiment. For each individual, 10,760 SNPs after removing highly correlated ones, 1,028 CNVs, 20,769 DNA methylations, and 19,767 gene expressions were examined. Due to the recent research that reported that genetic effects may be largely shared in major psychiatric disorders such as autism spectrum disorder, attention deficit-hyperactivity disorder, bipolar disorder, major depressive disorder, and schizophrenia, we considered those psychiatric diseases together and performed MultiDA to identify discriminate factors against the control [25, 26].

The multiblock data was analyzed by MultiDA. As a result of the analysis, 78 SNPs, 30 CNVs, 47 DNA methylations, and 35 genes were detected, where the high correlation between the connections was found. The potential gene markers of the psychiatric disorders were inferred from the result of the proposed method. The genes physically located near the selected SNPs and the genes corresponding to the result of CNV and the DNA methylation were chosen. Significantly observed genes among the results of MultiDA are listed in Table 3, where the data source of the gene and literature regarding the psychiatric disorders are described.

The gene regulatory network of the genes from the result was searched by STRING database [27]. Among a number of the retrieved interactions, we take note of one gene regulatory network illustrated in Figure 3. The interaction network consists of HTR7, ADCY8, HTR1F, NPY, CA2, RYR2, QDPR, AKR1D1, and CES1 gene. HTR7 is inferred from the gene expression set, HTR1F and CA2 are from the DNA methylation expression, NPY and CES1 are from the CNV, and the others are from the SNP data. The negative coefficient of HTR1F in the model may support the widely accepted notion that DNA methylation suppresses gene regulation impeding the binding of transcriptional proteins to the gene [34]. In particular, the HTR7 gene (5-hydroxytryptamine receptor 7) is a major neurotransmitter in the central nervous system, and a number of literatures related to bipolar and schizophrenia disorder are reported [28]. Interestingly, the HTR7 gene was found in the gene expression data block in this study, while the other previous researches reported the gene with GWAS on the SNP data block. The gene may have strong incorporated interactions with other heterogeneous data, which is consequently considered to be significant in the integrative model. It supports the strength of the integrative approach. Moreover, we found that HTR7 and NPY are in the same pathway, which is neuroactive ligand-receptor interaction, where the NPY gene is also a neurotransmitter in the brain and is known to play an important role in the emotional process [30]. A large number of psychiatric disorder susceptible genes were associated with this pathway [25]. ADCY8, which interacts with both HTR7 and NPY, may be potentially a susceptibility gene that causes the psychiatric disorders. In previous research [35], they found that ADCY8 is a susceptibility gene for avoidance behavior on mouse and also found that it indirectly induces the susceptibility on human mood disorders. Our result supports their claim.

4. Conclusion

In this paper, we developed the novel Multiblock Discriminant Analysis method in order to dissect the mechanism of complex human disease using multiple genetic data. The genomic association study with single type data may fall short of identifying the mechanisms of the diseases. On the other hand, MultiDA enables comprehensive analysis using multiple genetic data. Moreover, MultiDA provides analysis for the special setting of binary class data, where it greatly detects discriminative factors in the integrative genomic model. The simulation experiments support the outstanding performance of the proposed methods. As a target application, psychiatric disorder disease data, including SNP, CNV, DNA methylation, and gene expression, were analyzed in the integrative genomic model. Among the large number of variables of each block, candidate biomarkers were proposed as significant components of the disease mechanism. The proposed methods capture the global profile of the mechanism that conventional single or two block methods fail to detect. This promising tool for the integrative genomic study can provide flexible extensibility for new types of data in the era, superseding new high-throughput technologies.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.