Abstract

A key component to understanding etiology of complex diseases, such as cancer, diabetes, alcohol dependence, is to investigate gene-environment interactions. This work is motivated by the following two concerns in the analysis of gene-environment interactions. First, multiple genetic markers in moderate linkage disequilibrium may be involved in susceptibility to a complex disease. Second, environmental factors may be subject to misclassification. We develop a genotype based Bayesian pseudolikelihood approach that accommodates linkage disequilibrium in genetic markers and misclassification in environmental factors. Since our approach is genotype based, it allows the observed genetic information to enter the model directly thus eliminating the need to infer haplotype phase and simplifying computations. Bayesian approach allows shrinking parameter estimates towards prior distribution to improve estimation and inference when environmental factors are subject to misclassification. Simulation experiments demonstrated that our method produced parameter estimates that are nearly unbiased even for small sample sizes. An application of our method is illustrated using a case-control study of interaction between early onset of drinking and genes involved in dopamine pathway.

1. Introduction

A key component to prevention and control of complex diseases, such as cancer, hypertension, diabetes, and alcoholism, is to study the independent, cumulative, and interactive effects of genetic and environmental factors. This analysis has the potential to impact the understanding of the role of genetic influences under various environmental exposures, thus providing valuable information to (1) better understand the biological pathways involved in the disease and its progression, thus providing major clues to the underlying causes of alcohol dependence; (2) design personalized interventions targeted to individuals with enhanced vulnerability to the disease (the risk genes may help identify patients at higher risk long before any symptoms occur); (3) gain critical understanding for drug discovery.

This work is motivated by the following two concerns in the analysis of gene-environment interactions. First, complex diseases are caused by multiple variants with small-to-moderate effect sizes working in concert [1]. Most of the results of published genome-wide association studies are based on single nucleotide polymorphism (SNP) analysis [2]. This approach may suffer from low power due to a large number of tests and small effect sizes of individual SNPs. Furthermore, the true causal genetic marker is often not genotyped, rather is captured through linkage disequilibrium (LD) with the typed markers. Since each SNP has only partial linkage disequilibrium with the causal SNP, the observed effect size of the typed SNP is lower than the effect size of the causal SNP. In light of this concern, we propose to use a risk function that allows the genetic markers in linkage disequilibrium to enter the model directly [3]. This model eliminates the need to estimate haplotype phase and hence protects against bias due to the uncertainty that may arise due to the haplotype phase ambiguity [48]. In addition, the computation burden can be significantly reduced since the proposed approach uses genotype data directly.  Second, many variables that are of interest to biomedical researchers are subject to misclassification, for example, due to uncertainty associated with a recall or a measurement at an individual level. Misclassification may result in bias and loss of power to detect gene-environment interactions [9]. Oftentimes uncertainty associated with these variables may not be avoided in practice. The loss of power prevents the ability to discover gene-environment interactions in small studies or studies involving analysis of subtypes of complex diseases.

An example of biomedical problem of gene-environment interactions is the analysis of role of age when first got drunk in the etiology of alcohol dependence. The age at which a person gets drunk for the first time may influence genes linked to alcoholism, making the youngest drinkers most susceptible to severe problems [10]. Twin study found that when twins started drinking early (age < 13 years old), genetic factors contributed greatly to risk for alcohol dependence, at rates as high as 90 percent in the youngest drinkers [10]. Some early-onset drinkers do not develop alcohol problems and some late-onset drinkers do, hence it is important to investigate genetic and environmental influences that predispose for or protect against alcohol dependence in these two groups. However, the definition of early age of getting drunk is subject to misclassification due to uncertainty associated with the recall.

In light of these concerns, we develop a Bayesian methodology for analysis of gene-environment interactions in case-controls studies. Estimation and inference are based on a pseudolikelihood function [3, 11, 12]. This pseudolikelihood function offers the following advantages. One is that environmental variables measured exactly are modeled completely nonparametrically. Furthermore, a priori information about the probability of disease can be incorporated directly. The pseudolikelihood function exploits gene-environment independence assumption which is a reasonable assumption in many practical applications. If the gene-environment interaction is not significantly present in the population, then the distribution of genotype can be specified within strata defined by an environmental covariate. The proposed analysis is based on a pseudolikelihood function hence conventional Bayesian techniques may not be applied directly. Validity of Bayesian techniques need to be examined when the likelihood function is not a proper likelihood [13]. We followed Monahan and Boos [13] and Lobach et al. [3] to validate our Bayesian approach under this pseudolikelihood function. Our Bayesian approach has the ability to shrink the parameter estimates towards prior and hence reduce variability in parameter estimates. This property is essential when environmental exposure is subject to misclassification, especially in studies with smaller sample sizes, for example, of subtypes of complex disease. On the other hand, if sample size is large enough, estimation and inference can be based on the asymptotic posterior distribution that we derived which will ease the computational burden.

An outline of this paper is as follows. In Section 2 we introduce notation and formally state the problem. In Section 2 we present the Bayesian model under various scenarios. Section 3 describes asymptotic posterior distribution. Section 4 describes simulation experiment. Section 5 describes application of the Bayesian model to the analysis of alcoholism study. Section 6 gives concluding remarks.

2. Bayesian Model Based on Pseudolikelihood

2.1. Notation and Risk Function

Consider a sample consisting of 𝑛0 controls and 𝑛𝑑 cases at disease stage or type 𝑑=1,,𝐾. Define 𝐷 as the disease status. Following Lobach et al. [11], we pretend that this case-control sample is collected using a simple Bernoulli scheme, where the selection probability of a subject given disease status is proportional to 𝑛𝑑/pr(𝐷=𝑑),𝑑=0,1,,𝐾. Let 𝑅=1 denote the indicator of whether or not a subject is selected into the case-control sample. All participants of the study will have this selection status 𝑅=1. The observed genetic data consist of unphased genotypes 𝐺=(𝐺1,,𝐺𝐼) at 𝐼 loci. Let 𝑄(𝐺;𝜃) be a model describing Hardy-Weinberg equilibrium (HWE).

Let (𝑇,𝑍) denote all nongenetic variables of interest. Suppose 𝑇 is the set of factors subject to misclassification, and 𝑍 is the set of variables observed exactly. We assume that the observed genetic data does not contain any additional information on disease status and the true environmental covariate given the genetic variable of interest. Let 𝑋 denote the error-prone version of 𝑇. Suppose the misclassification process is defined by the following parametric structure 𝑝miss(𝑥𝑇,𝐺,𝑍,𝐷,𝜉). This model is general enough to capture differential misclassification. The joint distribution of the environmental factors in the underlying population can be specified in the following form 𝑝𝑇𝑍(𝑡𝑧,𝜉)𝑓𝑍(𝑧). While 𝑇 may be a vector of factors, for simplicity of presentation in what follows we suppose that 𝑇 is a factor.

Given the environmental covariates 𝑇 and 𝑍, genotype data 𝐺, the risk of disease in the underlying population is given by the following polytomous logistic model: 𝛽pr(𝐷=𝑘1𝐆,𝑇,𝑍)=exp𝑘0+𝑚𝑘(𝐆,𝑇,𝑍;𝛽)1+𝐾𝑗=1𝛽exp𝑗0+𝑚𝑗,(𝐆,𝑇,𝑍;𝛽)(2.1) where 𝑚() is a function of known form parameterizing the risk of disease in terms of parameters 𝛽. For the 𝑖th marker, denote the two alleles by 𝑀𝑖 and 𝑚𝑖, with frequencies 𝑃𝑀𝑖 and 𝑃𝑚𝑖, respectively. Following Lobach et al. [3], we define the following dummy variables and two risk models: genotype effect model and additive effect model.

Define the following dummy variables: 𝐴𝑖=1,if𝐺𝑖=𝑀𝑖𝑀𝑖,0,if𝐺𝑖=𝑀𝑖𝑚𝑖,1,if𝐺𝑖=𝑚𝑖𝑚𝑖,𝐵𝑖=𝑃2𝑚𝑖,if𝐺𝑖=𝑀𝑖𝑀𝑖,𝑃𝑀𝑖𝑃𝑚𝑖,if𝐺𝑖=𝑀𝑖𝑚𝑖,𝑃2𝑀𝑖,if𝐺𝑖=𝑚𝑖𝑚𝑖.(2.2)

Notice that 𝐴𝑖+1 is the number of allele 𝑀𝑖 at the 𝑖th marker, and hence 𝐴𝑖 can be used to model the allele or additive effect of 𝑀𝑖. Let pr(𝑔;𝜃) be a parametric form of the joint distribution of the observed genetic markers. In the following, we provide two examples of function 𝑚𝑘() using the genotype information 𝐆=(𝐺1,𝐺2,,𝐺𝐼).

2.1.1. Genotype Effect Model (GEM)

The following specification of the risk function incorporates both additive and dominance effects of genotype, as well as the multiplicative gene-environment interactions 𝑚𝑘(𝐆,𝑇,𝑍;𝛽)=𝑚𝑘(𝒜,,𝑇,𝑍;𝛽)=𝑇𝛽𝑘𝑇+𝑍𝛽𝑘𝑍+𝐼𝑖=1𝐴𝑖𝛽𝑘𝐴𝑖+𝐼𝑖=1𝑇𝐴𝑖𝛽𝑘𝐴𝑇𝑖+𝐼𝑖=1𝑍𝐴𝑖𝛽𝑘𝐴𝑍𝑖+𝐼𝑖=1𝐵𝑖𝛽𝑘𝐷𝑖+𝐼𝑖=1𝑇𝐵𝑖𝛽𝑘𝐷𝑇𝑖+𝐼𝑖=1𝑍𝐵𝑖𝛽𝑘𝐷𝑍𝑖.(2.3) In this formulation, the regression coefficients 𝛽𝑘𝐴𝑖 and 𝛽𝑘𝐷𝑖 model risk due to the additive and dominance effect, respectively [14, 15]. The remaining terms capture the multiplicative gene-environmental interaction.

2.1.2. Additive Effect Model (AEM)

Suppose that the dominance effect is not significantly present in the model (2.3). In this situation, the risk function takes the following form: 𝑚𝑘(𝐆,𝑇,𝑍;𝛽)=𝑚𝑘(𝒜,𝑇,𝑍;𝛽)=𝑇𝛽𝑘𝑇+𝑍𝛽𝑘𝑍+𝐼𝑖=1𝐴𝑖𝛽𝑘𝐴𝑖+𝐼𝑖=1𝑇𝐴𝑖𝛽𝑘𝐴𝑇𝑖+𝐼𝑖=1𝑍𝐴𝑖𝛽𝑘𝐴𝑍𝑖.(2.4)

2.2. Pseudolikelihood

Let us denote 𝜅𝑘=𝛽𝑘0+log(𝑛𝑘/𝑛0)log(𝜋𝑘/𝜋0),𝑘=1,2,,𝐾, and 𝜅=(𝜅1,,𝜅𝐾)𝑇. In addition, let ̃𝛽0=(𝛽10,,𝛽𝐾0)𝑇̃𝛽,Ω=(𝑇0,𝛽𝑇,Θ𝑇,𝜅𝑇)𝑇, =(Ω𝑇,𝜂𝑇)𝑇, and 𝜐=(𝜂𝑇,𝜉𝑇)𝑇. Define 1𝑆(𝑘,𝐠,𝑡,𝑧;Ω)=exp(𝑘1)𝜅(𝑘)𝑘+𝑚𝑘(𝐠,𝑡,𝑧;𝛽)1+𝐾𝑗=1𝛽exp𝑗0+𝑚𝑗(𝐠,𝑡,𝑧;𝛽)pr(𝐠;Θ).(2.5)

We assume that 𝐺 and (𝑋,𝑍) are independently distributed in the underlying population. Only changes in notation are needed to model genotype and environment within strata thus relaxing gene-environment independence assumption. An example of gene-environment dependence is polymorphisms in nicotine metabolism pathway that may regulate the degree of addiction to nicotine, thus creating gene-environment interaction. Furthermore, these polymorphisms may interact with smoking status while being involved in lung cancer [16]. We suppose that the type of genetic covariate measured does not depend on the individual's true genetic covariate, given disease status, environmental covariates and the measured genetic information. Furthermore, we suppose that the observed genetic variable does not contain any additional information on disease status and true environmental covariate given the genetic variable of interest.

Similarly to Lobach et al. [11], we propose to use the following pseudolikelihood function in place of the likelihood function to estimate the parameters: 𝐿Pseudo=(𝑘,𝐠,𝑥,𝑧;Ω,𝜂,𝜉)pr(𝐷=𝑘,𝐆=𝐠,𝑋=𝑥𝑍=𝑧,𝑅=1)𝑡𝑆𝑘,𝐠,𝑡𝑝,𝑧;Ωmiss𝑥𝑘,𝐠,𝑡𝑓,𝑧;𝜉𝑇(𝑡𝑧;𝜂)𝐾𝑘=0𝑡𝐠𝒢𝑆𝑘,𝐠,𝑡𝑓,𝑧;Ω𝑇(𝑡,𝑧;𝜂)(2.6) where 𝒢 is the set of all possible genotypes in the population. Lobach et al. [12] proved that maximization of 𝐿Pseudo, although not the actual retrospective likelihood for case-control data, leads to consistent and asymptotically normal parameter estimates. Observe that conditioning on 𝑍 in 𝐿Pseudo allows it to be free of the nonparametric density function 𝑓𝑍(𝑧), thus avoiding the difficulty of estimating potentially high-dimensional nuisance parameters.

2.3. Bayesian Analysis Based on Pseudolikelihood

Since in our setting the retrospectively collected data is analyzed as if they were coming from a random sample, function (2.6) is not a real likelihood function and hence the traditional Bayesian analysis is not technically correct. Conventional approaches to validity of posterior probability statements follow from the definition of the likelihood as the joint density of observations.

For simplicity of presentation we introduce new notation for this section only.

Monahan and Boos [13] introduced a definition based on coverage of posterior sets that are constructed to contain the correct probability of including a parameter 𝜏, if the underlying distribution of 𝜏 is the prior 𝑝(𝜏) and the model 𝑓(𝑌𝜏) of data 𝑌 is correct. This approach has been used in gene-environment interaction setting [3]. For example, in the one-dimensional case, the natural posterior coverage set functions are the one-sided intervals 𝐼𝛼=𝑅𝛼(𝑌)=(,𝜏𝛼), where 𝜏𝛼 is 𝛼-percentile of the posterior 𝑓(𝑌𝜏). Validity for such a posterior means that all these intervals 𝐼𝛼 have the correct coverage 𝛼. In practice, it is often challenging to verify the required probability analytically. Monahan and Boos [13] proposed a convenient numerical method. Briefly, define 𝜏𝑘, 𝑘=1,,𝑚 to be a sample generated independently from a continuous prior 𝑝(𝜏). For each 𝜏𝑘, let 𝑌𝑘 denote a value generated from 𝑓(𝑌𝜏𝑘). In addition, for each 𝑘, define 𝐻𝑘 to be a variable in the following form: 𝐻𝑘=𝜏𝑘𝑓𝜏𝑌𝑘𝑑𝜏.(2.7) This corresponds to posterior coverage set functions of the form (,𝜏𝑘𝛼), where 𝜏𝑘𝛼 is the 𝛼th percentile point of posterior density 𝑓(𝜏𝑌𝑘). Monahan and Boos [13] argued that if the distribution of 𝐻𝑘 fails to follow the uniform distribution for any prior, then the likelihood function cannot be a coverage proper Bayesian likelihood.

We propose to use the methodology described above to validate the likelihood function and apply conventional MCMC techniques to estimate parameters. We note that the method developed by Monahan and Boos is devised to invalidate a pseudolikelihood. Therefore to validate a pseudolikelihood, we propose to consider a comprehensive set of scenarios to examine coverage probabilities of posterior sets, and if these scenarios fail to invalidate a pseudolikelihood, we suppose that it is valid.

2.4. Fully Bayesian Model

We consider the case when the environmental covariates (𝑇,𝑋), genetic variant 𝐺, and disease status 𝐷 are binary. Let pr(𝐺=1)=𝜃, pr(𝑇=1)=𝜂. For simplicity of presentation, consider an additive model. Define the vector of risk parameters =(𝛽𝑡,𝛽𝐴,𝛽𝐵,𝛽𝑡𝐴,𝛽𝑡𝐵)𝑇. Consider a multiplicative interaction and let 𝑚(𝑡,𝑔,)=𝛽𝑡𝑡+𝛽𝐴𝐴+𝛽𝑡𝐴𝑡𝐴+𝛽𝐵𝐵+𝛽𝑡𝐵𝑡𝐵. Make the following definition: 𝐼𝑆(𝑑,𝑔,𝑡,,𝜃)=exp(𝑑1)𝜅(𝑑)𝑑+𝑚(𝑡,𝑔,)𝛽1+exp0𝜃+𝑚(𝑡,𝑔,)𝑔(1𝜃)1𝑔.(2.8) If 𝑋 is an observed environmental covariate prone to misclassification, denote the misclassification probabilities as pr(𝑋=1𝑇=0)=𝜉1 and pr(𝑋=0𝑇=1)=𝜉0. Hence, the distribution of misclassification process is 𝑓mem(𝑥𝑡,𝜉0,𝜉1)={𝑥𝜉1+(1𝑥)(1𝜉1)}(1𝑡)+{𝑥(1𝜉0)+(1𝑥)𝜉0}𝑡.

On the risk parameters, we impose a normal prior with mean 𝜇 and covariance matrix Σ.

Similarly to the appendix in Fan and Xiong [14] and Lobach et al. [3], the following expectations, variances, and covariances can be derived. 𝐸(𝐴𝑖)=𝑃𝑀𝑖𝑃𝑚𝑖,  𝐸(𝐵𝑖)=0,Var(𝐴𝑖)=2𝑃𝑀𝑖𝑃𝑚𝑖, Var(𝐵𝑖)=𝑃2𝑀𝑖𝑃2𝑚𝑖,  Cov(𝐴𝑖,𝐴𝑗)=2Δ𝑀𝑖𝑀𝑗, Var(𝐵𝑖,𝐵𝑗)=Δ2𝑀𝑖𝑀𝑗,𝑖𝑗. And Cov(𝐴𝑖,𝐵𝑖)=0 for all 𝑖 and 𝑗; 𝐕𝐀𝑃=2𝑀1𝑃𝑚1Δ𝑀1𝑀2Δ𝑀1𝑀𝐼Δ𝑀1𝑀2𝑃𝑀2𝑃𝑚2Δ𝑀2𝑀𝐼Δ𝑀1𝑀𝐼Δ𝑀2𝑀𝐼𝑃𝑀𝐼𝑃𝑚𝐼,𝐕𝐃=𝑃2𝑀1𝑃2𝑚1Δ2𝑀1𝑀2Δ2𝑀1𝑀𝐼Δ2𝑀1𝑀2𝑃2𝑀2𝑃2𝑚2Δ2𝑀2𝑀𝐼Δ2𝑀1𝑀𝐼Δ2𝑀2𝑀𝐼𝑃2𝑀𝐼𝑃2𝑚𝐼.(2.9) Define 𝒜=(𝐴1,,𝐴𝐼) and =(𝐵1,,𝐵𝐼). Let 𝐎𝐈 be a 𝐼×𝐼 matrix with zero elements. Based on the expectations and covariances described above, we have Cov(𝒜,)=𝐕𝐀𝐎𝐈𝐎𝐈𝐕𝐃.

In the case when misclassification is large, the sampling distribution of risk parameter estimates is likely to be skewed [11, 17]. However, because the shape of the normal distribution is symmetric, this prior is likely to bring the sampling distribution of the risk parameter estimates closer to normal. For the frequency parameters 𝜂 and 𝜃, we use noninformative uniform (0,1) priors. In this setting, the prior information imposed on 𝜃 is noninformative. If a priori information is available about the genotype frequencies, it can be specified using a corresponding distribution or HWE.

Then, the joint posterior distribution for the model unknowns is proportional to 𝑛𝑖=11𝑡=0𝑆𝑑𝑖,𝑔𝑖,𝑡𝑝,,𝜃miss𝑥𝑖𝑡,𝜉0,𝜉1𝜂𝑡(1𝜂)1𝑡1𝑡=01𝑑=01𝑔=0𝑆𝑑𝑖,𝑔𝑖,𝑡𝜂,,𝜃𝑡(1𝜂)1𝑡×||Σ||1/21exp2𝜇𝑇Σ1𝜇×𝐼(0,1)(𝜂)×𝐼(0,1)(𝜃).(2.10) Note that in this formulation, we specify a known misclassification process. We recommend performing sensitivity analysis to see whether parameter estimates change when misclassification probabilities are specified slightly differently. Furthermore, we recommend conservative setting when LD is set to be zero as a priori.

3. Asymptotic Posterior Distribution

We now consider properties of an asymptotic posterior distribution based on the pseudolikelihood (2.6). MCMC techniques can be computationally challenging. Knowing the form of an asymptotic posterior distribution would ease the computational burden.

For simplicity, we suppose that the parameter 𝜉 that controls misclassification error distribution is known, although this is not required. Denote Θ0 and Θ𝑛 to be values that maximize prior and pseudolikelihood, respectively. Let Ψ(𝑑,𝑔,𝑥,𝑧,Θ,𝜉) be the derivative of log{𝐿𝑖(𝑑,𝑔,𝑥,𝑧,Θ,𝜉)} with respect to Θ and Λ=𝑑𝑛𝑑𝑛𝐸{Ψ(𝐷,𝐺,𝑋,𝑍,Ω,𝜂,𝜉)𝐷=𝑑}×𝐸{Ψ(𝐷,𝐺,𝑋,𝑍,Ω,𝜂,𝜉)𝐷=𝑑}𝑇.(3.1) Furthermore, if 𝑝(Θ) is the prior distribution of the vector of parameters, define 𝑙(Θ) to be the derivative of log{𝑝(Θ)} with respect to Θ. Then define 𝑛(Θ,𝜉)=𝑛𝑖=1Ψ𝐷𝑖,𝐺𝑖,𝑋𝑖,𝑍𝑖,Θ,𝜉(3.2) and matrices 𝜕(Θ)=𝐸𝑛(Θ,𝜉)𝜕(Θ);𝒥(Θ)=𝐸𝜕{𝑙(Θ)}𝜕(Θ).(3.3) Bernardo and Smith [18] showed that under suitable regularity conditions the posterior distribution of vector of parameters Θ𝑛 converges to normal 𝒩(,Σ) distribution. Mean vector and covariance matrix can be consistently estimated as follows: 𝑛=Σ𝑛1Θ𝑛Θ𝑛Θ+𝒥0Θ0,Σ𝑛=Θ(𝑛)+𝐽(Θ0)1.(3.4) It can be easily seen that 𝑛1𝜕{𝑛(,𝜉)}/𝜕𝑇 is a consistent estimate of (Θ). Alternatively, if Σ is the sample covariance matrix of the terms Ψ(𝐷𝑖,𝐺𝑖,𝑋𝑖,𝑍𝑖,,𝜉), then ΛΣ+ consistently estimates (Θ).

Note that the posterior distribution has precision equal to the sum of precision provided by the observed data and the prior precision matrix. This formulation suggests an approximation, namely, that for large 𝑛, prior is small compared to the one provided by the observed data. Hence, with a large sample size, one can reduce computational burden by using the asymptotic distribution and using precision provided by the observed data while specifying the posterior distribution.

4. Simulation Experiments

We investigated the case of small 𝑛0=𝑛1=350 and large (𝑛0=𝑛1=1,500) sample sizes.

We validated the pseudolikelihood function using methodology described by Monahan and Boss [13] in a few scenarios by varying sample size, effect size, and misclassification probabilities. In 96% of cases that we considered, the Kolmagorov-Smirnov test failed to reject the null hypothesis that the sample of 𝐻𝑘 (2.7) comes from the uniform (0,1) distribution at the 0.05 significance level. Hence, we concluded that the pseudolikelihood is valid for subsequent analysis. Hence, we proceeded to estimating parameters.

We implemented Metropolis-Hastings algorithm in the following setting. On the risk parameters , we imposed a normal 𝒩(mean,Σ) prior, where mean is equal to the pseudo-MLE estimates. To examine sensitivity of the estimates to this specification, we considered a case when mean is a vector of zero values. Covariance matrix was specified as a diagonal matrix with diagonal elements equal to 32. Alternatively, we specified the corresponding matrix according to the known structure that is a function of LD. In all of these scenarios, the results we obtained were comparable. Table 1 presents results based on mean=(0,0,0) and covariance matrix with diagonal elements equal to 32.

To examine performance of our approach, we performed two simulation experiments. In the first experiment, we investigated performance of Bayesian method compared to pseudo-MLE. The goal of this experiment was to examine the ability of Bayesian approach to shrink the parameter estimates towards prior when misclassification causes the estimates to have skewed distribution. In the second experiment, we examined performance of the asymptotic posterior distribution.

Experiment 1. We generated the true environmental variables 𝑇 from a binomial distribution with pr(𝑇=1)=0.5. The misclassification probabilities are pr(𝑋=0𝑇=1)=0.20 and pr(𝑋=1𝑇=0)=0.25. We simulated three genetic markers in LD corresponding to Δ=0.03 and 𝑃𝑀𝑖=0.25. In the study with 1,500 cases and 1,500 controls, we generated a binary disease status according to the following logistic model: logit{pr(𝐷=1𝐺,𝑋)}=𝛽0+𝛽𝑡𝑡+3𝑗=1𝛽𝐴𝑗𝐴𝑗+3𝑗=1𝛽𝐵𝑗𝐵𝑗+3𝑗=1𝛽𝐴𝑇𝑗𝐴𝑗𝑇+3𝑗=1𝛽𝑇𝐵𝑗𝐵𝑗𝑇.(4.1) To examine the case when genetic data is missing, we simulated a similar set of 1,500 cases and 1,500 controls with 50% of genetic information missing completely at random. To investigate a smaller study, we simulated 350 cases and 350 controls with the disease status defined by the risk model with all 𝛽𝐵𝑗 and 𝛽𝐵𝑇𝑗 set to 0. Results presented in Tables 1 and 2 illustrate that the proposed Bayesian approach produced parameter estimates that are less variable and less biased. We examine the empirical distribution of parameter estimates based on a small sample and found that it is skewed, which may be due to small sample size and presence of misclassification. We observed this phenomena in our previous work [3, 11]. The Bayesian solution brings the advantage, that is, a symmetric prior can shrink parameter estimates towards normal distribution. Furthermore, we presented performance of the naive approach that ignores existence of misclassification.

Experiment 2. We examined performance of estimation based on the derived asymptotic posterior in the simulation setup described in Experiment 1 corresponding to 𝑛1=𝑛2=1,500. Results presented in Table 3 illustrate that the parameter estimates are nearly unbiased. Moreover, estimated variances of parameter estimates are very close to the observed variability with one exception, namely, 𝛽𝑥. Variability of 𝛽𝑥 may be inflated due to the misclassification in environmental exposure.

5. Analysis of Alcohol Dependence

The Collaborative Studies on the Genetics of Alcoholism (COGA) is a nine-center nation-wide study that was initiated in 1989 and has had as its primary aim the identification of genes that contribute to alcoholism susceptibility and related characteristics [1921]. COGA is funded through the National Institute on Alcohol Abuse and Alcoholism (NIAAA). The focus of this study is a case-control design of unrelated individuals for a genetic association analysis of addiction. Analyses that include incorporation of important demographic and environmental factors such as age when first got drunk, sex, income, and education into association studies are pursued. Our project involves analysis of 40 SNPs residing in genes involved in dopamine pathways. Specifically, we consider D2 dopamine receptor gene (DRD2) encoding a protein which plays a central role in reward-mediating mesocorticolimbic pathways; a member of the immunoglobulin gene superfamily NCAM1 encoding protein involved in various neural functions; tetratricopeptide repeat domain 12 gene (TTC12); CHRNA3 gene shown to be involved in higher craving after quitting and increased withdrawal symptoms over time. Cases are defined as individuals with DSM-IV alcohol dependence (lifetime). Controls are defined as individuals who have been exposed to alcohol, but have never met lifetime diagnosis for alcohol dependence or dependence on other illicit substances. The sample consists of 50.7% of male and 49.3% female participants; 60% report their race as Caucasian and 40% are non-Caucasian. We categorized age when first got drunk as “Early” if it is less or equal to 13 (EAD=1, 45.2% of all participants) and people with low income are the ones who make less than 30 K per year (LI=1, 45% of all participants).

Define 𝑇 to be the true unobserved indicator of early drinking, that is, 𝑇=1 corresponds to the early onset of drinking, 𝑇=0 to the late onset. Let X be the observed value of the early onset of drinking. Because we do not have external data or internal replicates to estimate misclassification probability, we performed sensitivity analysis for various values of misclassification.

We used the following risk model: logit{pr(𝐷=1𝐺=(𝐴,𝐵),𝑇)}=𝛽0+𝛽𝑡𝑇+𝛽𝐴𝐴+𝛽𝐵𝐵+𝛽𝐴𝑇𝐴𝑇+𝛽𝐵𝑇𝐵𝑇.(5.1)

The results of sensitivity analysis (not shown) suggest that when pr(𝑋=0𝑇=1) is ignored or underestimated, the interaction effect is not significant. The setting corresponds to the case when exposed subjects are defined as nonexposed, thus reducing the association signal. However, the estimation procedure appears to be robust to underestimation of pr(𝑋=1𝑇=0). This scenario corresponds to the case when a nonexposed subject is considered to be exposed.

Parameter estimates obtained using our method corresponding to pr(𝑋=0𝑇=𝑡)=0.25 and pr(𝑋=1𝑇=0)=0.25 are presented in Table 4 demonstrating significant interaction between various genetic markers and early onset of drinking.

6. Discussion

Motivated by concerns in the analysis of gene-environment interactions, we proposed a genotype-based Bayesian approach for the analysis of case-control studies when environmental exposure cannot be observed directly and is subject to misclassification. The formulation of risk functions and the estimation procedure are along the lines of our previous work: genotype and additive effect models [14, 15] and pseudolikelihood approach [3, 11, 12]. The risk function of genotype effect model involves both the additive and dominance effect while taking into account possible interactions between genes expressed in terms of interaction between their additive and dominance components, while the additive effect model only involves the additive effect and possible interactions. The additive effect model contains less parameters than the genotype effect model. In applications, the additive effect models should be used in analyzing data as the first step. If the dominance effect is strong enough to compensate the increase of the number of the parameters in the genotype effect models, one may use the genotype effect models.

The proposed method has several unique advantages. First, the observed genetic information enters the model directly and the LD structure is captured in the regression coefficients. This aspect offers advantages from the practical point of view, the computational burden is less demanding because haplotype phase need not to be estimated. In the cases when LD is moderate, which is the focus of our work, the computational demands can be substantial even with the current state of technology. Furthermore, the risk due to uncertainty associated with the haplotype phase estimation can be avoided. Second, the estimating procedure is based on a pseudolikelihood model, similarly to the method investigated previously, that allows efficient estimation of parameters, models environmental covariates completely nonparametrically, and incorporates information about the probability of disease [3, 11, 12]. In epidemiologic studies, the vector of environmental covariates measured exactly is often, high dimensional, and a good estimate about probability of disease in a population is known. Additionally, the Bayesian formulation of the proposed method allows shrinking parameter estimates towards prior which offers advantage in cases when misclassification is present.

Because of the Bayesian formulation and the need to validate posterior sets obtained using a pseudolikelihood, the proposed method can be highly computationally intensive. Moreover, the validation of pseudolikelihood requires evaluation of ratio of two likelihood functions. For example, in our simulation experiments and data analysis, this part required us to obtain a precise value of ratios similar to exp(3000)/exp(2908). Hence, we employed GNU Multiple Precision Arithmetic Library (http://gmplib.org/).

The form of our pseudolikelihood function is complex and it does not seem feasible to validate a pseudolikelihood function algebraically. Instead, we propose to apply Monahan and Boos method to examine coverage probabilities of posterior sets. If a comprehensive set of scenarios fails to invalidate a pseudolikelihood function, we suppose that the pseudolikelihood is valid. This reasoning may be similar to the conventional hypothesis testing where the null hypothesis is assumed to be true (pseudolikelihood is valid), and the observed data is used to quantify evidence in favor of the alternative hypothesis (pseudolikelihood is not valid). Of course, a strong basis for validity of a pseudolikelihood is needed. We employ the following arguments. Our previous research approach [3, 11, 12] demonstrated validity of this pseudolikelihood in frequentist sense, that is, we have shown that estimation and inferences are correct when this pseudolikelihood is used in place of a real likelihood function. Hence, posterior distribution based on a pseudolikelihood may be invalid only for certain prior distributions. Therefore, to invalidate a pseudolikelihood, one should find a prior distribution for which the posterior is not valid. However, in our setting, the number of possible prior settings is narrow, because what we advocate is the use of symmetry of prior distribution as a way to improve precision of estimation and inference. We are restricting the prior of regression coefficients to be Gaussian and advocate mean zero and large variance. While one can try other priors for other parameters, the number of possible prior settings is still reasonable and it is practically feasible to look at their performance in terms of probability of coverage sets.

While the major motivation of the proposed work is dictated by the need of a symmetric prior on risk coefficients, other types of a priori information can enter our model. For example, if a priori information about the LD structure is available, it can be modeled in the a priori distribution. Furthermore, if misclassification probabilities are not known precisely, one can specify uncertainty associated with values of misclassification.

A major practical advantage of this proposed work is that it allows the model to exploit recent advances in genotyping technology. Specifically, with the recent advances genetic markers become more and more densely typed and multiple markers are likely to be observed in a functional unit of interest. These units of interest may be defined in terms of LD blocks using information available in linkage maps. While in situations when linkage disequilibrium is strong, the haplotype-based analysis is advantageous; in more common scenarios when linkage disequilibrium is moderate, our approach provides advantages.

However, in the context when the number of genetic markers in a functional unit of interest is large our methodology may require model averaging and model selection component. Hence, behavior of this pseudolikelihood needs to be examined in this setting. A practical strategy can be that one starts with screening analysis first to get interesting genetic variants and SNPs using traditional methods which is computationally less demanding. Then, one may apply the proposed approaches for possible gene-environment interactions and further investigations by focusing on these important and interesting genetic variants and SNPs.

Acknowledgments

R. Fan was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Maryland, USA. Genetics and Environment (SAGE) was provided through the NIH Genes, Environment and Health Initiative (GEI) (U01 HG004422). SAGE is one of the genomewide association studies funded as part of the Gene Environment Association Studies (GENEVA) under GEI. Assistance with phenotype harmonization and genotype cleaning as well as with general study coordination was provided by the GENEVA Coordinating Center (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Support for collection of datasets and samples was provided by the Collaborative Study on the Genetics of Alcoholism (COGA; U10 AA008401), the Collaborative Genetic Study of Nicotine Dependence (COGEND; P01 CA089392), and the Family Study of Cocaine Dependence (FSCD; R01 DA013423). Funding support for genotyping, which was performed at the Johns Hopkins University Center for Inherited Disease Research, was provided by the NIH GEI (U01HG004438), the National Institute on Alcohol Abuse and Alcoholism, the National Institute on Drug Abuse, and the NIH contract “High throughput genotyping for studying the genetic contributions to human disease" (HHSN268200782096C). The datasets used for the analyses described in this paper were obtained from dbGaP at http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000092.v1.p1. This work has utilized computing resources at the High Performance Computing Facility of the Center for Health Informatics and Bioinformatics at New York University Langone Medical Center.