Abstract

We developed a multinomial ordinal probit model with singular value decomposition for testing a large number of single nucleotide polymorphisms (SNPs) simultaneously for association with multidisease status when sample size is much smaller than the number of SNPs. The validity and performance of the method was evaluated via simulation. We applied the method to our real study sample recruited through the Mexican-American Coronary Artery Disease study. We found 3 genes (SORCS1, AMPD1, and PPARα) to be associated with the development of both IGT and IFG, while 5 genes (AMPD2, PRKAA2, C5, TCF7L2, and ITR) with the IGT mechanism only and 6 genes (CAPN10, IL4, NOS3, CD14, GCG, and SORT1) with the IFG mechanism only. These data suggest that IGT and IFG may indicate different physiological mechanism to prediabetes, via different genetic determinants.

1. Introduction

Genome-wide association studies (GWASs) examine genetic variants across the entire genome to improve the understanding of genetic components underlying complex human disease. With whole-genome genotyping techniques that allow GWAS to involve hundreds of thousands of single nucleotide polymorphisms (SNPs), many studies have successfully identified novel genetic components for many diseases or related quantitative traits. However, the sample size is often limited due to the difficulty of recruiting patients and/or the cost of research. This leads to the situation that the number of SNPs (𝑚) is much larger than the samples (𝑛) available, that is, 𝑚𝑛, and makes the traditional statistical methods unsuitable for analyzing multiple SNPs simultaneously. Most of current GWASs deal with this shortcoming by performing single SNP association test, which analyzes one SNP at a time and results in a huge multiple testing problem. These motivated us to develop methods that can avoid the multiple testing problem, in other words, methods that can evaluate multiple SNPs simultaneously when 𝑚𝑛. We first introduced the iterative Bayesian variable selection (IBVS) method [1], which analyzes all SNPs simultaneously when 𝑚𝑛 and uses the Bayesian variable selection [2] iteratively to find SNPs that are associated with disease. The method was successfully applied to the simulated rheumatoid arthritis data provided by the Genetic Analysis Workshop 15 (GAW15). We later introduced the Bayesian classification with singular value decomposition (BCSVD) method [3]. The method applies the singular value decomposition (SVD) to the covariate matrix, which is usually the genotype data in GWAS and reduces the dimension of parameters to be estimated to the number of samples. This makes the method feasible to handle multiple SNPs simultaneously when 𝑚𝑛. The validation of the method was demonstrated by applying to the simulated data provided by GAW16. We now extend our method from binary disease to the analysis of polytomous ordinal response variables. We propose here a multinomial ordinal probit model with singular value decomposition method. We show the validity of the newly developed method by applying it to simulated data sets as well as to a real study sample to identify genes contributing to two different mechanisms for prediabetes, namely, impaired glucose tolerance (IGT) and impaired fasting glucose (IFG). With the simulated data sets, we demonstrate that this new method is superior to single SNP analysis method and, with the real data, identify different genes for each mechanism.

2. Method and Materials

2.1. Multinomial Probit Model with Singular Value Decomposition

Logit and probit models are statistical models that are widely used for the analysis of categorical (ordinal/nominal) data. The difference between these two models is the choice of the link function relating the linear predictor to the expected value; the probit model uses the inverse normal cumulative distribution, and the logit model uses the logit transformation. As discussed by Greene [4], in most cases, the choice of the link function is largely a matter of taste. We utilized the probit model here to analyze data with polytomous ordinal response variables. In general, the multinomial ordinal probit model can be expressed by latent (unobserved) continuous variables associated with categorical responses. Let us assume that responses 𝑦1,𝑦2,,𝑦𝑛 are observed, where 𝑦𝑖 takes one of the 𝐽-ordered categories and 𝜃1,,𝜃𝐽 are real numbers of bin boundaries, which satisfy that =𝜃1𝜃𝐽=. As discussed by Albert and Chib [5], we denote that 𝑧1,𝑧2,,𝑧𝑛 are latent continuous random variables. We assume that the latent variable (𝑧𝑖) associated with a categorical outcome (𝑦𝑖) can be explained in terms of an underlying linear model, and that the observed response 𝑦𝑖 has category 𝑗 if and only if 𝑧𝑖 falls between 𝜃𝑗1 and 𝜃𝑗. The multinomial ordinal probit model is equivalent to the following model: 𝑧𝑖=𝑥𝑖𝛽+𝜖𝑖,𝜖𝑖𝑦𝑁(0,1),𝑖=1,,𝑛,𝑖=𝑗𝜃𝑗1<𝑧𝑖𝜃𝑗,𝑗=1,,𝐽,(2.1) where 𝑥𝑖 is a 1×𝑚 vector of the explanatory variables for the 𝑖th sample, and 𝛽 is a 𝑚×1 vector of parameters to be estimated. In vector-matrix notation, we can have the multinomial ordinal probit model 𝑧=𝑋𝛽+𝜖,(2.2) where 𝑧 is the 𝑛×1 vector of latent variables, 𝑋 is the 𝑛×𝑚 matrix of the explanatory variables, 𝛽 is the 𝑚×1 vector of unknown regression coefficients, and 𝜖 follows an independent standard multivariate normal distribution,𝜖𝑁(0,𝐼𝑛). By applying SVD to the matrix𝑋 in (2.2), when rank(𝑋)=𝑛, the matrix can be expressed as 𝑋=𝐴𝐷𝐹, where 𝐴 is the 𝑚×𝑛 singular value factor loading matrix with orthonormal columns so that 𝐴𝐴=𝐼𝑛,𝐹 is the 𝑛×𝑛 SVD orthogonal factor matrix with 𝐹𝐹=𝐹𝐹=𝐼𝑛, and𝐷=diag(𝑑1,,𝑑𝑛), the diagonal matrix of positive singular values, ordered as 𝑑1𝑑𝑛>0. When rank(𝑋)=𝑟<𝑛, the smallest 𝑛𝑟 singular values in 𝐷 are replaced with 0, that is, 𝑑1𝑑𝑟>𝑑𝑟+1==𝑑𝑛=0. Therefore, in the product 𝑋=𝐴𝐷𝐹, the last 𝑛𝑟 columns of both 𝐴 and 𝐹 for which 𝑑𝑟+1==𝑑𝑛=0 are ignored since they interact with the block of zeros in 𝐷. Hence, this leads to another form of SVD, 𝑋=𝐴𝑟𝐷𝑟𝐹𝑟, that is, the product of the first 𝑟 columns of 𝐴, the upper 𝑟×𝑟 block of 𝐷, and the first 𝑟 columns of 𝐹. Since the difference between the both scenario is only in dimension of matrices in SVD, we assume that rank(𝑋)=𝑛 in the rest part of the paper for convenience. Thus, the model in (2.2) with the SVD of 𝑋 can be written as follows: 𝑧=𝑋𝛽+𝜖=𝐴𝐷𝐹𝛽+𝜖=𝐹𝐷𝐴𝛽+𝜖=𝐿𝛾+𝜖,(2.3) where 𝐿=𝐹𝐷 and 𝛾𝑛×1=𝐴𝑛×𝑚𝛽𝑚×1. Therefore, 𝑧, the 𝑛×1 vector of latent variables in (2.3), has a multivariate normal distribution, that is, 𝑧𝑁(𝐿𝛾,𝐼𝑛). As shown in (2.3), 𝛾 is expressed by a linear combination of the original parameters (𝛽). Hence, we call 𝛾 as the vector of superfactors. The model in (2.3) represents a massive dimension reduction from 𝑚 to 𝑛 parameters. The regression model with 𝑚 parameters reduced to that with 𝑛 parameters derived from the SVD of the covariate matrix 𝑋. Therefore, the statistical inference on the original parameter turns into the superfactors. Let 𝑝𝑖=(𝑝𝑖1,,𝑝𝑖𝐽) denote the vector of probabilities associated with the assignment of the 𝑖th sample into categories 1,,𝐽, where 𝑝𝑖𝑗 denote the probability that a sample falls into category 𝑗. From (2.1) and (2.3), it follows that 𝑝𝑖𝑗=𝜃𝑗𝜃𝑗1𝜙𝑧𝑙𝑖𝛾𝜃𝑑𝑧=Pr𝑗1<𝑍𝑖<𝜃𝑗𝜃=Φ𝑗𝑙𝑖𝛾𝜃Φ𝑗1𝑙𝑖𝛾,(2.4) where 𝜙() and Φ() denote the probability density function and the cumulative density function of the standard normal distribution, respectively, and 𝑙𝑖 is the 𝑖th row of matrix 𝐿=𝐹𝐷. Let 𝑦=(𝑦1,,𝑦𝑛) denote the vector of responses observed for all samples. Then, the probability of observing data 𝑦 is given as follow: Pr𝑦𝑝𝑖=𝑛𝐽𝑖=1𝑗=1𝑝𝐼{𝑦𝑖=𝑗}𝑖𝑗.(2.5) From (2.4) and (2.5), the log likelihood function for (𝛾,𝜃) can be written as ln𝐿(𝛾,𝜃)=𝑛𝐽𝑖=1𝑗=1𝐼𝑦𝑖Φ𝜃=𝑗ln𝑗𝑙𝑖𝛾𝜃Φ𝑗1𝑙𝑖𝛾.(2.6)

2.2. Model Fitting with Maximum Likelihood Estimation

The maximum likelihood estimates (MLEs) of the superfactors, 𝛾, in (2.3) can be obtained by the iteratively reweighted least squares (IRLSs) procedure [6] using the log likelihood function for (𝛾,𝜃) in (2.6). The procedure can be briefly described as follows. Let 𝜂 denote the vector of all model parameters, that is, 𝜂=(𝜃2,,𝜃𝐽1,𝛾1,,𝛾𝑛𝐽+2). Note that 𝜃1 and 𝜃𝐽 are not included in this vector because their values are assumed to be 0 and , respectively, for the purpose of model identifiability. Also note that the (𝐽2) smallest singular values together with their corresponding factors are dropped from the parameters since the number of parameters must not exceed the number of samples. Assuming that𝐽=4, define that 𝒞𝑖=100110011001,𝑖=100𝑙𝑖010𝑙𝑖001𝑙𝑖,(2.7) and 𝑖=diag(𝑓𝑖1,,𝑓𝑖𝐽1), where f𝑖𝑗 denotes the derivative of the standard normal cumulative distribution function at 𝜃𝑗𝑙𝑖𝛾. Take 𝑊𝑖=diag(𝑝𝑖), where 𝑝𝑖 is the 𝐽×1 vector of probabilities that the 𝑖th individual falls in each category, that is, 𝑝𝑖=(𝑝𝑖1,,𝑝𝑖𝐽), and let 𝒩𝑖 be a 𝐽×1 vector of observation, that is, 𝒩𝑖=(𝐼{𝑦𝑖=1},,𝐼{𝑦𝑖=𝐽}). After initialization of all elements, the iteration 𝑠+1(𝑠=1,2,) can be written as 𝜂𝑠+1=𝜂𝑠+𝑛𝑖=1𝑖𝑖(𝑠)𝒞𝑖𝑊𝑖1(𝑠)𝒞𝑖𝑖(𝑠)𝑖𝑛1𝑖=1𝑖𝑖(𝑠)𝒞𝑖𝑊𝑖1(𝑠)𝒩𝑖𝑝𝑖(𝑠).(2.8) The MLE of 𝜂 can be found by performing the process recursively until the change between 𝜂𝑠+1 and 𝜂𝑠 is negligible.

2.3. General Solution for the Original Parameters

We have discussed how to estimate the superfactor (𝛾) in (2.3) thus far. Since the primary interest is to find SNPs that are significantly associated with a disease, it is necessary to transform the superfactor (𝛾) to the original parameters (𝛽) in (2.1). The equation 𝛾=𝐴𝛽 in (2.3) can be utilized for the transformation even though 𝐴 is 𝑚×𝑛 nonsquare matrix. As discussed by Graybill [7], the unique solution for 𝛽 can be achieved by taking the generalized inverse matrix of 𝐴 as 𝐴 since 𝐴𝐴=𝐼𝑛. Therefore, the unique solution for SNP effect (𝛽) can be calculated by 𝛽=𝐴𝛾.

2.4. Selection of Significant SNPs

Finding significant SNPs is the same as testing if each SNP effect (𝛽𝑖,𝑖=1,,𝑚) is statistically significant, that is, testing the hypothesis: 𝐻0: 𝛽𝑖=0 versus 𝐻1: 𝛽𝑖0, 𝑖=1,,𝑚. The simple method is to use Wald’s test statistic, which forms (̂̂𝛽𝛽)/se(𝛽) and assumes a normal distribution. However, when 𝑚𝑛, it is hard to calculate ̂se(𝛽) directly from the data. We therefore utilized permutation test to select significant SNPs. The rationale behind the test is that, under the null hypothesis, the estimate of 𝛽 obtained from the raw (unpermuted) data is similar to the estimate of 𝛽 obtained from the permuted data. That is, the difference between two estimates is closed to zero under 𝐻0. With this idea, we can construct Wald’s test statistic as follows. Let ̂𝛽𝑖 (𝑖=1,,𝑚) be the estimate of the 𝑖th SNP effect from the raw data and ̂𝛽𝑘𝑖 (𝑘=1,,𝐾) be the estimate of the 𝑖th SNP effect from the 𝑘th-permuted data. Let us define 𝛽𝑑𝑘𝑖 as the difference between ̂𝛽𝑖 and ̂𝛽𝑘𝑖, that is, 𝛽𝑑𝑘𝑖=̂𝛽𝑖̂𝛽𝑘𝑖. Then, Wald’s test statistic can be as follows: Λ𝑖=𝛽𝑑𝑖se𝛽𝑑𝑖,𝑖=1,,𝑚,(2.9) where 𝛽𝑑𝑖 is the sample mean of 𝛽𝑑𝑘𝑖’s, which is 𝛽𝑑𝑖=(1/𝐾)𝐾𝑘=1𝛽𝑑𝑘𝑖, and se(𝛽𝑑𝑖) is the standard error of 𝛽𝑑𝑖. Under the null hypothesis, the statistic Λ𝑖 defined in (2.9) follows approximately standard normal distribution when 𝑘 is large. 𝑃 value for rejecting the null hypothesis at a significance level 𝛼=0.05 can be utilized to identify significant SNPs.

2.5. Application of the Multinomial Probit Model with SVD
2.5.1. Simulated Multinomial Ordinal Data

The validity of the proposed method was evaluated using simulated data sets. The procedure of data generation was composed of three steps: generating genotype data with certain genetic model, generating the latent variable, and defining the disease status variable by applying the predefined bin boundaries. The brief scheme of each step is as follows: we first generated 10 sets of the simulated genotype data under an additive genetic model, each set consists of 100 samples and 1000 SNPs. From (2.1), we can notice that the latent variable (𝑧𝑖) consists of two parts: the expected value (𝑥𝑖𝛽) and the random error (𝜖𝑖). In order to generate the expected value, we assumed that, for each sample, 9 out of the 1000 SNPs (every 101th SNP, except the last one) contribute to disease status 𝑥𝑖𝛽=𝛽1SNP101+𝛽2SNP201++𝛽1SNP801+𝛽2SNP901, where 𝛽1 and 𝛽2 are set as −1 and 1, respectively. Hence, the latent variable can be obtained from the sum of the expected values (𝑥𝑖𝛽) and the random error generated from standard normal distribution. We then generated disease status variable (𝑦𝑖) assuming 3 disease development stages. Therefore, when applying the proposed method to the simulated data sets, we would expect 9 strong signals corresponding to each of the 9 disease-associated SNPs. We also compared results obtained from the proposed method with that from single SNP analysis with multinomial ordinal probit model.

2.6. Mexican-American Coronary Artery Disease (MACAD) Study

We also applied the proposed method to study sample recruited through the Mexican-American Coronary Artery Disease (MACAD) study [8, 9]. The study population consists of probands who are Mexican American aged between 45 and 75 with coronary artery disease: spouses of probands, adult offspring (18), and their spouses. For the offspring generation, we performed oral glucose tolerance test and genotyped 132 SNPs in 32 genes selected based on a prior relationship to insulin physiology. The goal of the study herein was to identify genes involved in the development of IGT and/or IFG, where IGT was defined as a 2 hr glucose level between 140 and 199 mg/dL and IFG defined as a fasting glucose level between 100 and 125 mg/dL. In order to identify and compare genes affecting the development of IGT and/or IFG, we generated two study samples, for which each sample has 3 disease stages (D1) both 2 hr and fasting glucoses normal (𝑁/𝑁)(𝑛1=60), IGT only (IGT/𝑁) (𝑛2=31) and IGT and IFG (IGT/IFG) (𝑛3=15) (D2) both 2 hr and fasting glucoses normal (𝑁/𝑁)(𝑛1=60), IFG only (𝑁/IFG) (𝑛2=34) and IGT and IFG (IGT/IFG) (𝑛3=15).

3. Results and Discussion

3.1. Simulated Multinomial Data

Figure 1 summarizes the results of association analyses when applying the multinomial ordinal probit model with SVD to the simulated data sets. All numbers shown in the figures are the average of the estimates obtained from the 10 simulated data sets. As mentioned previously, we expected 9 strong signals corresponding to the 9 SNPs designed to be associated with disease development when generating the simulated data sets, and 9 were observed in our analysis. Similar results from the single SNP analysis were shown in Figure 2.

Figure 1(a) summarizes MLEs of SNP effects calculated with the multinomial ordinal probit model with SVD. The figure shows that almost all MLEs except 9 were between −0.1 and 0.1, while there were 9 large MLEs (4 around 0.3, 5 around −0.3) corresponding to the 9 SNPs contributed to disease status. Figure 1(b) gives 𝑃 values in log10 scale for testing SNP effects. The line in Figure 1(b) corresponds to significance level 𝛼=0.05. 9 SNPs were clearly separated from the rest and had log10(𝑃value)>1.3.

Figure 2(a) summarizes MLEs of SNP effects obtained by the single SNP analysis and shows that no signal was strong enough to be distinguished from all other signals. The 𝑃 values are given in Figure 2(b) in log10 scale. SNP501 in the middle of the figure had a relatively strong signal compared to all others. However, the log10(𝑃 value) was much less than 1.3, which corresponds to significance level 𝛼=0.05. Thus, no SNPs were identified as statistically significant from the single SNP analysis method. In contrast to the fact that no SNP was identified as statistically significant by the single SNP analysis, the multinomial ordinal probit model with SVD method was able to identify all 9 SNPs contributing to disease status as statistically significant at significance level 𝛼=0.05. These results indicated that the proposed method should be reliable for the analysis of large-scale genome-wide association data that have polytomous ordinal responses when 𝑚𝑛.

3.2. Mexican-American Coronary Artery Disease (MACAD) Study

We analyzed the data sets D1 and D2 (see methods) generated from a subsample of subjects recruited through a coronary artery disease proband in the Mexican-American Coronary Artery Disease Project as described in the method section, using both the multinomial ordinal probit model with SVD method and the single SNP analysis method.

Figure 3 summarizes the analysis results with the data D1 (N/N-IGT/N-IGT/IFG) using the single SNP analysis. Figure 3(a) gives MLEs of SNP effects. Figure 3(b) plots 𝑃 values of association analysis in log10 scale. With Sidak correction, which is often used to correct multiple testing problem, the adjusted significance level should be 1(1𝛼)1/𝑚, where 𝛼 is significance level, and 𝑚 represents the number of tests. Thus, the corrected log10(𝑃 value) threshold for significance level 𝛼=0.05 is 3.4, which corresponds to the line in Figure 3(b). We applied the adjusted significance level to the 𝑃 values in Figure 3(b) since the 𝑃 values are before correcting multiple testing problem. No SNP was identified as statistically significant (Figure 3(b)).

The data set D2 (N/N-N/IFG-IGT/IFG) was analyzed with the same method, and analysis results are given in Figure 4. Figure 4(a) plots MLEs of the SNP effects. Since 𝑃 values in Figure 4(b) are before the the multiple testing correction, we used 3.4 as the log10(𝑃 value) threshold corresponding to 0.05 significance level after the multiple testing correction. Two SNPs corresponding to SORC1 and SORT1 were found significant.

We then also analyzed D1 and D2 with the multinomial ordinal probit model with SVD method. Figure 5 summarizes the analysis results for data D1. Figure 5(a) plots MLEs of SNP effects. Figure 5(b) plots 𝑃 values in log10 scale for testing SNP effects. The multiple testing correction does not need to be applied now since the method tests all SNPs simultaneously. With the 1.3 𝑃 value threshold, which corresponds to 0.05 significance level, we identified that 8 out of the 32 candidate genes (SORCS1, AMPD1, PPAR𝛼, AMPD2, PRKAA2, C5, TCF7L2, and ITR) were associated with the disease path defined in D1.

The multinomial ordinal probit model with SVD method was applied to data set D2 as well. The results are shown in Figure 6. In Figure 6(a), MLEs of the SNP effects were summarized. Figure 6(b) plots the 𝑃 values in log10 scale for testing the SNP effects. It showed that 11 SNPs corresponding to 9 out of 32 candidate genes (SORCS1, AMPD1, PPAR𝛼, CAPN10, IL4, NOS3, CD14, GCG, and SORT1) have log10(𝑃 value) greater than the 1.3 𝑃 value threshold. From the analyses of D1 and D2, we found that SNPs in 3 genes (SORCS1, AMPD, and PPAR𝛼) were associated with both IGT and IFG; SNPs in 5 genes (AMPD2, PRKAA2, C5, TCF7L2, and ITR) were associated with IGT only; SNPs in 6 genes (CAPN, IL4, NOS3, CD14, GCG, and SORT1) were associated with IFG only. These results suggest that IGT and IFG may indicate different pathways to diabetes, with different genetic determinants.

Thus, using both simulated data and a real study sample, we demonstrated that multinomial ordinal probit model with SVD method can be utilized to identify associated markers involved in disease development when multidisease stages are considered. For relatively small size of data set used in the paper, which is 100 samples and 1000 SNPs for the simulation study, the computation took about less than 10 minutes to complete. However, the computation time might be a concern when applying this method to large data set, such as GAWS with millions of SNPs and thousands of samples.

Acknowledgments

This research was partly supported by Mexican American Coronary Artery Disease (MACAD) study Grant NIH-NHLBI HL 088457, Diabetes Endocrinology Research Center (DERC) Grant NIH-NIDDK DK063491, and UCLA Clinical and Translational Science Institute (CTSI) Grant NIH-NCATS UL1TR000124.