Abstract
We developed a multinomial ordinal probit model with singular value decomposition for testing a large number of single nucleotide polymorphisms (SNPs) simultaneously for association with multidisease status when sample size is much smaller than the number of SNPs. The validity and performance of the method was evaluated via simulation. We applied the method to our real study sample recruited through the Mexican-American Coronary Artery Disease study. We found 3 genes (SORCS1, AMPD1, and PPARα) to be associated with the development of both IGT and IFG, while 5 genes (AMPD2, PRKAA2, C5, TCF7L2, and ITR) with the IGT mechanism only and 6 genes (CAPN10, IL4, NOS3, CD14, GCG, and SORT1) with the IFG mechanism only. These data suggest that IGT and IFG may indicate different physiological mechanism to prediabetes, via different genetic determinants.
1. Introduction
Genome-wide association studies (GWASs) examine genetic variants across the entire genome to improve the understanding of genetic components underlying complex human disease. With whole-genome genotyping techniques that allow GWAS to involve hundreds of thousands of single nucleotide polymorphisms (SNPs), many studies have successfully identified novel genetic components for many diseases or related quantitative traits. However, the sample size is often limited due to the difficulty of recruiting patients and/or the cost of research. This leads to the situation that the number of SNPs () is much larger than the samples () available, that is, , and makes the traditional statistical methods unsuitable for analyzing multiple SNPs simultaneously. Most of current GWASs deal with this shortcoming by performing single SNP association test, which analyzes one SNP at a time and results in a huge multiple testing problem. These motivated us to develop methods that can avoid the multiple testing problem, in other words, methods that can evaluate multiple SNPs simultaneously when . We first introduced the iterative Bayesian variable selection (IBVS) method [1], which analyzes all SNPs simultaneously when and uses the Bayesian variable selection [2] iteratively to find SNPs that are associated with disease. The method was successfully applied to the simulated rheumatoid arthritis data provided by the Genetic Analysis Workshop 15 (GAW15). We later introduced the Bayesian classification with singular value decomposition (BCSVD) method [3]. The method applies the singular value decomposition (SVD) to the covariate matrix, which is usually the genotype data in GWAS and reduces the dimension of parameters to be estimated to the number of samples. This makes the method feasible to handle multiple SNPs simultaneously when . The validation of the method was demonstrated by applying to the simulated data provided by GAW16. We now extend our method from binary disease to the analysis of polytomous ordinal response variables. We propose here a multinomial ordinal probit model with singular value decomposition method. We show the validity of the newly developed method by applying it to simulated data sets as well as to a real study sample to identify genes contributing to two different mechanisms for prediabetes, namely, impaired glucose tolerance (IGT) and impaired fasting glucose (IFG). With the simulated data sets, we demonstrate that this new method is superior to single SNP analysis method and, with the real data, identify different genes for each mechanism.
2. Method and Materials
2.1. Multinomial Probit Model with Singular Value Decomposition
Logit and probit models are statistical models that are widely used for the analysis of categorical (ordinal/nominal) data. The difference between these two models is the choice of the link function relating the linear predictor to the expected value; the probit model uses the inverse normal cumulative distribution, and the logit model uses the logit transformation. As discussed by Greene [4], in most cases, the choice of the link function is largely a matter of taste. We utilized the probit model here to analyze data with polytomous ordinal response variables. In general, the multinomial ordinal probit model can be expressed by latent (unobserved) continuous variables associated with categorical responses. Let us assume that responses are observed, where takes one of the -ordered categories and are real numbers of bin boundaries, which satisfy that . As discussed by Albert and Chib [5], we denote that are latent continuous random variables. We assume that the latent variable () associated with a categorical outcome () can be explained in terms of an underlying linear model, and that the observed response has category if and only if falls between and . The multinomial ordinal probit model is equivalent to the following model: where is a vector of the explanatory variables for the sample, and is a vector of parameters to be estimated. In vector-matrix notation, we can have the multinomial ordinal probit model where is the vector of latent variables, is the matrix of the explanatory variables, is the vector of unknown regression coefficients, and follows an independent standard multivariate normal distribution,. By applying SVD to the matrix in (2.2), when , the matrix can be expressed as , where is the singular value factor loading matrix with orthonormal columns so that , is the SVD orthogonal factor matrix with , and, the diagonal matrix of positive singular values, ordered as . When , the smallest singular values in are replaced with 0, that is, . Therefore, in the product , the last columns of both and for which are ignored since they interact with the block of zeros in . Hence, this leads to another form of SVD, , that is, the product of the first columns of , the upper block of , and the first columns of . Since the difference between the both scenario is only in dimension of matrices in SVD, we assume that in the rest part of the paper for convenience. Thus, the model in (2.2) with the SVD of can be written as follows: where and . Therefore, , the vector of latent variables in (2.3), has a multivariate normal distribution, that is, . As shown in (2.3), is expressed by a linear combination of the original parameters (). Hence, we call as the vector of superfactors. The model in (2.3) represents a massive dimension reduction from to parameters. The regression model with parameters reduced to that with parameters derived from the SVD of the covariate matrix . Therefore, the statistical inference on the original parameter turns into the superfactors. Let denote the vector of probabilities associated with the assignment of the sample into categories , where denote the probability that a sample falls into category . From (2.1) and (2.3), it follows that where and denote the probability density function and the cumulative density function of the standard normal distribution, respectively, and is the row of matrix . Let denote the vector of responses observed for all samples. Then, the probability of observing data is given as follow: From (2.4) and (2.5), the log likelihood function for can be written as
2.2. Model Fitting with Maximum Likelihood Estimation
The maximum likelihood estimates (MLEs) of the superfactors, , in (2.3) can be obtained by the iteratively reweighted least squares (IRLSs) procedure [6] using the log likelihood function for in (2.6). The procedure can be briefly described as follows. Let denote the vector of all model parameters, that is, . Note that and are not included in this vector because their values are assumed to be 0 and , respectively, for the purpose of model identifiability. Also note that the smallest singular values together with their corresponding factors are dropped from the parameters since the number of parameters must not exceed the number of samples. Assuming that, define that and , where denotes the derivative of the standard normal cumulative distribution function at . Take , where is the vector of probabilities that the individual falls in each category, that is, , and let be a vector of observation, that is, . After initialization of all elements, the iteration can be written as The MLE of can be found by performing the process recursively until the change between and is negligible.
2.3. General Solution for the Original Parameters
We have discussed how to estimate the superfactor () in (2.3) thus far. Since the primary interest is to find SNPs that are significantly associated with a disease, it is necessary to transform the superfactor () to the original parameters () in (2.1). The equation in (2.3) can be utilized for the transformation even though is nonsquare matrix. As discussed by Graybill [7], the unique solution for can be achieved by taking the generalized inverse matrix of as since . Therefore, the unique solution for SNP effect () can be calculated by .
2.4. Selection of Significant SNPs
Finding significant SNPs is the same as testing if each SNP effect is statistically significant, that is, testing the hypothesis: : versus : , . The simple method is to use Wald’s test statistic, which forms and assumes a normal distribution. However, when , it is hard to calculate directly from the data. We therefore utilized permutation test to select significant SNPs. The rationale behind the test is that, under the null hypothesis, the estimate of obtained from the raw (unpermuted) data is similar to the estimate of obtained from the permuted data. That is, the difference between two estimates is closed to zero under . With this idea, we can construct Wald’s test statistic as follows. Let () be the estimate of the SNP effect from the raw data and () be the estimate of the SNP effect from the -permuted data. Let us define as the difference between and , that is, . Then, Wald’s test statistic can be as follows: where is the sample mean of ’s, which is =, and se() is the standard error of . Under the null hypothesis, the statistic defined in (2.9) follows approximately standard normal distribution when is large. value for rejecting the null hypothesis at a significance level can be utilized to identify significant SNPs.
2.5. Application of the Multinomial Probit Model with SVD
2.5.1. Simulated Multinomial Ordinal Data
The validity of the proposed method was evaluated using simulated data sets. The procedure of data generation was composed of three steps: generating genotype data with certain genetic model, generating the latent variable, and defining the disease status variable by applying the predefined bin boundaries. The brief scheme of each step is as follows: we first generated 10 sets of the simulated genotype data under an additive genetic model, each set consists of 100 samples and 1000 SNPs. From (2.1), we can notice that the latent variable () consists of two parts: the expected value () and the random error (). In order to generate the expected value, we assumed that, for each sample, 9 out of the 1000 SNPs (every 101 SNP, except the last one) contribute to disease status , where and are set as −1 and 1, respectively. Hence, the latent variable can be obtained from the sum of the expected values () and the random error generated from standard normal distribution. We then generated disease status variable () assuming 3 disease development stages. Therefore, when applying the proposed method to the simulated data sets, we would expect 9 strong signals corresponding to each of the 9 disease-associated SNPs. We also compared results obtained from the proposed method with that from single SNP analysis with multinomial ordinal probit model.
2.6. Mexican-American Coronary Artery Disease (MACAD) Study
We also applied the proposed method to study sample recruited through the Mexican-American Coronary Artery Disease (MACAD) study [8, 9]. The study population consists of probands who are Mexican American aged between 45 and 75 with coronary artery disease: spouses of probands, adult offspring (≥18), and their spouses. For the offspring generation, we performed oral glucose tolerance test and genotyped 132 SNPs in 32 genes selected based on a prior relationship to insulin physiology. The goal of the study herein was to identify genes involved in the development of IGT and/or IFG, where IGT was defined as a 2 hr glucose level between 140 and 199 mg/dL and IFG defined as a fasting glucose level between 100 and 125 mg/dL. In order to identify and compare genes affecting the development of IGT and/or IFG, we generated two study samples, for which each sample has 3 disease stages (D1) both 2 hr and fasting glucoses normal , IGT only () and IGT and IFG (IGT/IFG) (D2) both 2 hr and fasting glucoses normal , IFG only () and IGT and IFG (IGT/IFG) .
3. Results and Discussion
3.1. Simulated Multinomial Data
Figure 1 summarizes the results of association analyses when applying the multinomial ordinal probit model with SVD to the simulated data sets. All numbers shown in the figures are the average of the estimates obtained from the 10 simulated data sets. As mentioned previously, we expected 9 strong signals corresponding to the 9 SNPs designed to be associated with disease development when generating the simulated data sets, and 9 were observed in our analysis. Similar results from the single SNP analysis were shown in Figure 2.
(a) Estimates of SNP effects by the proposed method
(b) values of testing SNP effects ((P value)) |
(a) Estimates of SNP effects by single SNP analysis
(b) values of testing SNP effects ((P value)) |
Figure 1(a) summarizes MLEs of SNP effects calculated with the multinomial ordinal probit model with SVD. The figure shows that almost all MLEs except 9 were between −0.1 and 0.1, while there were 9 large MLEs (4 around 0.3, 5 around −0.3) corresponding to the 9 SNPs contributed to disease status. Figure 1(b) gives values in scale for testing SNP effects. The line in Figure 1(b) corresponds to significance level . 9 SNPs were clearly separated from the rest and had .
Figure 2(a) summarizes MLEs of SNP effects obtained by the single SNP analysis and shows that no signal was strong enough to be distinguished from all other signals. The values are given in Figure 2(b) in scale. SNP501 in the middle of the figure had a relatively strong signal compared to all others. However, the ( value) was much less than 1.3, which corresponds to significance level . Thus, no SNPs were identified as statistically significant from the single SNP analysis method. In contrast to the fact that no SNP was identified as statistically significant by the single SNP analysis, the multinomial ordinal probit model with SVD method was able to identify all 9 SNPs contributing to disease status as statistically significant at significance level . These results indicated that the proposed method should be reliable for the analysis of large-scale genome-wide association data that have polytomous ordinal responses when .
3.2. Mexican-American Coronary Artery Disease (MACAD) Study
We analyzed the data sets D1 and D2 (see methods) generated from a subsample of subjects recruited through a coronary artery disease proband in the Mexican-American Coronary Artery Disease Project as described in the method section, using both the multinomial ordinal probit model with SVD method and the single SNP analysis method.
Figure 3 summarizes the analysis results with the data D1 (N/N-IGT/N-IGT/IFG) using the single SNP analysis. Figure 3(a) gives MLEs of SNP effects. Figure 3(b) plots values of association analysis in scale. With Sidak correction, which is often used to correct multiple testing problem, the adjusted significance level should be , where is significance level, and represents the number of tests. Thus, the corrected ( value) threshold for significance level is 3.4, which corresponds to the line in Figure 3(b). We applied the adjusted significance level to the values in Figure 3(b) since the values are before correcting multiple testing problem. No SNP was identified as statistically significant (Figure 3(b)).
(a) Estimates of parameters by single-SNP analysis
(b) values of testing SNP effects ((P value)) |
The data set D2 (N/N-N/IFG-IGT/IFG) was analyzed with the same method, and analysis results are given in Figure 4. Figure 4(a) plots MLEs of the SNP effects. Since values in Figure 4(b) are before the the multiple testing correction, we used 3.4 as the ( value) threshold corresponding to 0.05 significance level after the multiple testing correction. Two SNPs corresponding to SORC1 and SORT1 were found significant.
(a) Estimates of parameters by single SNP-analysis
(b) values of testing SNP effects ((P value)) |
We then also analyzed D1 and D2 with the multinomial ordinal probit model with SVD method. Figure 5 summarizes the analysis results for data D1. Figure 5(a) plots MLEs of SNP effects. Figure 5(b) plots values in scale for testing SNP effects. The multiple testing correction does not need to be applied now since the method tests all SNPs simultaneously. With the 1.3 value threshold, which corresponds to 0.05 significance level, we identified that 8 out of the 32 candidate genes (SORCS1, AMPD1, PPAR, AMPD2, PRKAA2, C5, TCF7L2, and ITR) were associated with the disease path defined in D1.
(a) Estimates of parameters by the proposed method
(b) values of testing SNP effects ((P value)) |
The multinomial ordinal probit model with SVD method was applied to data set D2 as well. The results are shown in Figure 6. In Figure 6(a), MLEs of the SNP effects were summarized. Figure 6(b) plots the values in scale for testing the SNP effects. It showed that 11 SNPs corresponding to 9 out of 32 candidate genes (SORCS1, AMPD1, PPAR, CAPN10, IL4, NOS3, CD14, GCG, and SORT1) have ( value) greater than the 1.3 value threshold. From the analyses of D1 and D2, we found that SNPs in 3 genes (SORCS1, AMPD, and PPAR) were associated with both IGT and IFG; SNPs in 5 genes (AMPD2, PRKAA2, C5, TCF7L2, and ITR) were associated with IGT only; SNPs in 6 genes (CAPN, IL4, NOS3, CD14, GCG, and SORT1) were associated with IFG only. These results suggest that IGT and IFG may indicate different pathways to diabetes, with different genetic determinants.
(a) Estimates of parameters by the proposed method
(b) values of testing SNP effects ((P value)) |
Thus, using both simulated data and a real study sample, we demonstrated that multinomial ordinal probit model with SVD method can be utilized to identify associated markers involved in disease development when multidisease stages are considered. For relatively small size of data set used in the paper, which is 100 samples and 1000 SNPs for the simulation study, the computation took about less than 10 minutes to complete. However, the computation time might be a concern when applying this method to large data set, such as GAWS with millions of SNPs and thousands of samples.
Acknowledgments
This research was partly supported by Mexican American Coronary Artery Disease (MACAD) study Grant NIH-NHLBI HL 088457, Diabetes Endocrinology Research Center (DERC) Grant NIH-NIDDK DK063491, and UCLA Clinical and Translational Science Institute (CTSI) Grant NIH-NCATS UL1TR000124.