Abstract

Huntington's disease (HD) is a progressive neurodegenerative disorder caused by an expansion of CAG repeats in the IT15 gene. The age-at-onset (AAO) of HD is inversely related to the CAG repeat length and the minimum length thought to cause HD is 36. Accurate estimation of the AAO distribution based on CAG repeat length is important for genetic counseling and the design of clinical trials. In the Cooperative Huntington's Observational Research Trial (COHORT) study, the CAG repeat length is known for the proband participants. However, whether a family member shares the huntingtin gene status (CAG expanded or not) with the proband is unknown. In this work, we use the expectation-maximization (EM) algorithm to handle the missing huntingtin gene information in first-degree family members in COHORT, assuming that a family member has the same CAG length as the proband if the family member carries a huntingtin gene mutation. We perform simulation studies to examine performance of the proposed method and apply the methods to analyze COHORT proband and family combined data. Our analyses reveal that the estimated cumulative risk of HD symptom onset obtained from the combined data is slightly lower than the risk estimated from the proband data alone.

1. Introduction

Huntington’s disease (HD) is a severe, autosomal dominantly inherited neurodegenerative disorder that affects motor, cognitive, and psychiatric function and is uniformly fatal. HD is caused by the expansion of CAG trinucleotide repeats at the huntingtin gene (IT15) [1, 2]. Affected individuals typically begin to show motor signs around 30–50 years of age and typically die 15–20 years after the disease onset [3]. Despite identification of the causative gene, there is currently no treatment that modifies disease progression.

One large genetic epidemiological study of HD, the Cooperative Huntington’s Observational Research Trial (COHORT), including 42 Huntington study group research centers in North America and Australia, was initiated in 2005 and concluded in 2011 [46]. Participants in COHORT (probands) underwent a clinical evaluation and DNA from whole blood was genotyped for the length of the CAG-repeat huntingtin mutation. Since 2005, COHORT probands from sites with IRB approval have participated in family history interviews and have provided information on HD affection status in their family members. While CAG repeat length is ascertained in probands, the high cost of conducting in-person interviews of family members prevents the collection of all family members’ blood samples. However, family members’ age-at-onset (AAO) of HD and vital status are obtained through systematic interviews of the probands or the family members themselves. Although a relative’s HD genotype is unavailable, the corresponding distribution of the HD gene can be estimated based on the relative’s relationship with the proband, the proband’s mutation status, and assumptions regarding within-family similarity of CAG length [7, 8].

In a genetic counseling setting, subjects with CAG repeats of 36 or greater are defined as carrying the HD mutation (carrier; [9]), and CAG less than 36 is defined as screened negative, or noncarrier [9]. It is known that there is an inverse association between the CAG repeat length and AAO of HD, that is, the longer the repeat length, the earlier the motor onset [10]. Modeling such a relationship as well as the conditional distribution of HD onset given CAG repeat length accurately and precisely is important for genetic counseling and the design of clinical trials for HD. The AAO of HD onset is subject to right censoring by constraints of the observation periods. Carriers who have not been diagnosed with HD are right-censored for AAO. Several formulae were proposed in the literature to estimate the survival function of age at HD diagnosis given CAG repeat length (e.g., [911]). Langbehn et al. [10] have shown that the standard semiparametric survival models, such as the Cox proportional hazards model, do not fit the HD data and proposed a new logistic-exponential parametric model. Specifically, the conditional distribution of HD onset given the CAG repeat length is modeled as a logistic function, with a location and a scale parameter both depending on CAG through nonlinear relationships. Using a large clinical data set, they observed that separate exponential relationships with CAG length gave excellent empirical goodness of fit to both the mean AAO and its variance. Other parametric models, such as Gamma distribution, have also been proposed in the literature [12, 13]. Langbehn et al. [14] examine several AAO models in the literature and show the superior performance of Langbehn et al. [10] in terms of predicting the two-year probability of new HD diagnosis with independent prospective data.

None of the aforementioned existing methods can be directly used to analyze COHORT family data because family members are not always genotyped and their HD mutation status is unknown. The inclusion of family data contributes additional information; however, the unobserved HD mutation sharing status in family members (CAG-elongated or not) complicates the analysis. To see this, note that the affected parent carrying huntingtin mutation has a 50% chance of transmitting the mutation to an offspring. An added complexity is that the likelihood of the offspring having a higher CAG repeat than the parent is higher if the parent is the father. Since the offspring is not genotyped, whether he or she carries expanded CAG repeats is unknown. In this work, we treat the unknown huntingtin gene sharing status in first-degree family members (CAG-elongated or not) as missing data and use the EM algorithm to carry out the maximum likelihood estimation of the proband and family data jointly. Conditionally on the transmission status in family members, we use the logistic-exponential model in Langbehn et al. [14] to model the AAO as a function of CAG repeat length. We perform simulation studies to examine finite sample performances of the proposed methods. Finally, we apply these methods to analyze the COHORT proband and family combined data. Our results show a slightly lower estimated cumulative risk of HD symptom onset using the combined data compared to using proband data alone.

2. Methods

We start by introducing some notations. For the 𝑖th subject, let 𝑇𝑖 denote the age-at-onset of HD, let 𝛿𝑖 be the event indicator, let 𝐶𝑖 denote the censoring time, and let 𝑋𝑖=min(𝑇𝑖,𝐶𝑖). Let 𝐴𝑖 denote the CAG repeat length. Langbehn et al. [10] model distribution of 𝑇𝑖 given 𝐴𝑖 by a logistic function. The cumulative distribution function (CDF) given 𝐴𝑖 is 𝐹𝑡𝐴𝑖𝑇=Pr𝑖𝑡𝐴𝑖=11+𝑒[𝑡𝜇(𝐴𝑖)]/𝑠(𝐴𝑖),(2.1) and the density function is 𝑓𝑡𝐴𝑖=𝑒[𝑡𝜇(𝐴𝑖)]/𝑠(𝐴𝑖)𝑠𝐴𝑖1+𝑒[𝑡𝜇(𝐴𝑖)]/𝑠(𝐴𝑖)2.(2.2) Here 𝜇(𝐴𝑖) is a location parameter depending on the covariate 𝐴𝑖 and 𝑠(𝐴𝑖) is a scale parameter depending on 𝐴𝑖. Let 𝑆(𝑡𝐴𝑖)=1𝐹(𝑡𝐴𝑖) denote the survival function of HD onset. The location and scale parameters have the following relationship with the mean and variance of 𝑇𝑖 given 𝐴𝑖: 𝐸𝑇𝑖𝐴𝑖𝐴=𝜇𝑖𝑇,var𝑖𝐴𝑖=𝜋23𝑠2𝐴𝑖.(2.3) Various parametric functions for the location and scale parameters were compared in Langbehn et al. [10, 14], and the exponential function provides the best fit. Therefore, we use the same model where 𝜇𝐴𝑖=𝜇1𝜇+exp2𝜇3𝐴𝑖,𝐴var𝑖=𝜎1𝜎+exp2𝜎3𝐴𝑖.(2.4) Substitute these into 𝐹(𝑡𝐴𝑖) and 𝑓(𝑡𝐴𝑖) to obtain a parametric model for the distribution of AAO of HD with six parameters, 𝛽=(𝜇1,𝜇2,𝜇3,𝜎1,𝜎2,𝜎3)𝑇. Langbehn et al. [10] fitted estimates of 𝛽=(21.54,9.56,0.146,35.55,17.72,0.327)𝑇.

2.1. Proband-Only Analysis

First, consider probands’ data where all 𝐴𝑖’s are observed. Since a subject’s AAO of HD is subject to the right censoring, the likelihood function is 𝐿(𝛽)=𝑛𝑖=1𝑓𝛿𝑖𝑋𝑖𝐴𝑖𝑆;𝛽1𝛿𝑖𝑋𝑖𝐴𝑖;𝛽,(2.5) and the log-likelihood is 𝑙(𝛽)=𝑛𝑖=1𝛿𝑖𝑠𝐴log𝑖𝑋𝑖𝐴𝜇𝑖𝑠𝐴𝑖1+𝛿𝑖log1+𝑒(𝑋𝑖𝜇(𝐴𝑖))/𝑠(𝐴𝑖).(2.6) The maximum likelihood estimate (MLE) of the parameters, ̂𝛽, can be obtained via a general-purpose optimization algorithm such as Newton-Raphson or Nelder-Mead implemented in the R program version 2.13.1. The variance-covariance matrix of ̂𝛽 is estimated by the inverse of the estimated Hessian matrix ̂𝛽=𝐻̂𝛽cov1.(2.7) The standard error of the estimated survival function, 𝑆(𝑡𝐴𝑖), is then estimated by the Delta method, that is, 𝑆var𝑡𝐴𝑖=𝐺𝑇̂𝛽̂𝛽𝐺̂𝛽var,(2.8) where the gradient vector 𝐺̂𝛽=𝜕𝑆𝑡𝐴𝑖||||𝜕𝛽𝛽=̂𝛽.(2.9) Since the parameters are estimated by maximum likelihood, it is straightforward to carry out likelihood ratio tests (LRTs) to compare the model fit from the COHORT data with the one obtained by applying parameters from other studies such as Langbehn et al. [10] to the COHORT data. Here, twice the difference in the log-likelihood follows an asymptotic chi-square distribution with 6 degrees of freedom.

2.2. Incorporating Family Members

Next, we consider incorporating family members’ AAO data. We do not directly observe whether a family member shares the huntingtin mutation with the proband, but we do have data regarding family members’ age-at-onset of the first symptoms, as well as the family members’ current ages. When we incorporate the additional family data, the likelihood for the survival takes a mixture form. Let 𝑝𝑖 denote the probability of the 𝑖th subject sharing a deleterious allele with a proband and therefore becoming a carrier. Such probabilities are calculated based on Mendelian transmission and a family member’s relationship to the proband [8]. For example, offspring and siblings of a carrier proband have a probability of 50% of receiving the huntingtin allele that contains the CAG expansion (Homozygotes for HD are extremely rare since prevalence of HD in general population is rare). We assume that, conditioning on a family member receiving the expanded huntingtin allele, the CAG repeat length is the same as observed in the proband, although this is a simplification [7]. For subjects who receive a wild-type allele (CAG < 36), their probability of developing HD is zero, thus 𝑓(𝑡𝐴𝑖<36)=0, and 𝑆(𝑡𝐴𝑖<36)=1,forall𝑡. For the family members, the likelihood is 𝐿(𝛽)=𝑛𝑖=1𝑝𝑖𝑓𝛿𝑖𝑋𝑖𝐴𝑖𝑆;𝛽1𝛿𝑖𝑋𝑖𝐴𝑖+;𝛽1𝑝𝑖1𝛿𝑖,(2.10) where the above second term follows from the assumption that noncarriers do not develop HD. Note that for all carrier probands we observe 𝑝𝑖=1, thus the likelihood reduces to (2.5).

The above likelihood can be maximized by a combination of EM and Newton-Raphson algorithms. Let 𝐺𝑖 denote the unobserved carrier status indicator for the 𝑖th family member (i.e., 𝐺𝑖=1 indicates a family member receives a mutation and 𝐺𝑖=0 indicates otherwise). Then the complete data log-likelihood is 𝑛𝑖=1𝐼𝐺𝑖𝛿=1𝑖𝑓𝑋log𝑖𝐴𝑖+;𝛽1𝛿𝑖𝑆𝑋log𝑖𝐴𝑖;𝛽.(2.11) At the (𝑘+1)th iteration of the E-step, we compute the conditional expectation of the complete data log-likelihood, given the observed data. Essentially, we compute 𝑤𝑖(𝑘+1)𝐼𝐺=𝐸𝑖=1𝑋𝑖,𝛿𝑖,𝛽(𝑘)=𝑝𝑖𝑓𝛿𝑖𝑋𝑖𝐴𝑖;𝛽(𝑘)𝑆1𝛿𝑖𝑋𝑖𝐴𝑖;𝛽(𝑘)𝑝𝑖𝑓𝛿𝑖𝑋𝑖𝐴𝑖;𝛽(𝑘)𝑆1𝛿𝑖𝑋𝑖𝐴𝑖;𝛽(𝑘)+1𝑝𝑖1𝛿𝑖.(2.12) In the M-step, we update 𝛽(𝑘+1) by maximizing the weighted log-likelihood 𝑛𝑖=1𝑤𝑖(𝑘+1)𝛿𝑖𝑓𝑋log𝑖𝐴𝑖+;𝛽1𝛿𝑖𝑆𝑋log𝑖𝐴𝑖;𝛽(2.13) using the Newton-Raphson algorithm developed for the proband data.

Since for the combined analysis, the parameters are estimated by maximizing the likelihood through an EM algorithm, the standard asymptotic theory applies and the standard errors of parameters can be estimated by inverting the expected or observed information matrix based on the log-likelihood of the observed data. When there is missing data and an EM algorithm is used to obtain the MLE, the information matrix based on the observed data likelihood can be difficult to compute analytically or computationally. In such situations, Louis [15] proposed to compute the observed information matrix in terms of the conditional moments of the first and second derivatives of the complete data log likelihood which can be obtained easily under the EM algorithm framework. In some cases, these moments are easier to compute than the corresponding derivatives of the incomplete, observed data log-likelihood.

However, in our application, the derivatives of the observed data log likelihood are easy to compute. Thus, we computed the gradient and Hessian matrix of the observed data log-likelihood directly and estimated the standard errors of ̂𝛽 by the inverse of the Hessian matrix and estimated the standard errors of 𝐹(𝑡) by the Delta method similar to the proband-only analysis. Simulation studies in the next section show satisfactory performance of this direct and relatively simpler approach.

3. Simulation Studies

We conducted two simulation studies closely related to the observed COHORT data to illustrate the performance of the Newton-Raphson optimization and the EM algorithm [16]. In all our optimization procedures, we centered both 𝐴𝑖 and 𝑋𝑖. Since the direct optimization and EM algorithm need reasonable initial values, we fitted two nonlinear least square (NLS) to the observed sample mean and variance of the AAO on subjects with 𝛿𝑖=1. To be specific, we fit 𝑚1𝑎𝑖=𝜇1𝜇+exp2𝜇3𝑎𝑖,𝑠21𝑎𝑖=𝜎1𝜎+exp2𝜎3𝑎𝑖,(3.1) where 𝑚1(𝑎𝑖) and 𝑠21(𝑎𝑖) are the sample mean and variance for all subjects with 𝐴𝑖=𝑎𝑖, respectively. The six NLS estimators were used as the initial values for further optimization. We denoted the estimated 𝛽 from the centered data as ̂𝛽𝑐. For each simulation, the uncentered ̂𝛽 were then calculated based on ̂𝛽𝑐 and the sample mean of 𝐴𝑖 and 𝑋𝑖.

We restricted simulations to CAG repeat lengths between 41 and 56 to guard against sensitivity to the extremely high or low CAG repeats to be consistent with Langbehn et al. [10]. For the analysis of proband data, we generated a sample of 2000 subjects, each with a CAG length ranging from 41 to 56 that follows a multinomial distribution in which the probability pr(𝐴𝑖=𝑎) equals to the observed proportion of 𝐴𝑖=𝑎 in the COHORT proband data set. The failure times 𝑇𝑖 were simulated from the distribution (2.1), where the parameters 𝛽 were fixed at the values fitted from the COHORT proband data (see next section for their values). The censoring times, 𝐶𝑖, were generated from a rescaled Beta distribution with a scale and shape parameter of four. The parameters for the Beta distribution were chosen so that the proportion of censored subjects is the same in the simulated data and the observed COHORT proband data.

For the analysis of the combined proband and family data, we generated a sample of 4000 subjects. We assume the same proportion of the probands and relatives as observed in the combined COHORT data. For the family members, the probabilities 𝑝𝑖 were generated by resampling the observed 𝑝𝑖’s in the COHORT data. With a given 𝑝𝑖 for each subject, we simulated his or her huntingtin carrier status from a Bernoulli distribution with success probability 𝑝𝑖. For family members simulated to receive an expanded CAG repeat (carriers), their CAG repeats 𝐴𝑖 were set to be the same as the probands and their failure times were simulated from (2.5) with 𝛽 fixed at estimates from the COHORT combined data. For noncarrier family members, their failure times were set to be infinity and their 𝑋𝑖=𝐶𝑖. We used the same censoring distribution for generating 𝐶𝑖 as in the first simulation study.

We provide simulation results of the proband only and combined analyses in Tables 1 and 2. We present mean 𝐹(𝑡𝐴𝑖), empirical standard deviation of 𝐹(𝑡𝐴𝑖), and the mean estimated standard error of 𝐹(𝑡𝐴𝑖) at various ages in. We see from these tables that mean 𝐹(𝑡𝐴𝑖) is very close to true 𝐹(𝑡𝐴𝑖) in both studies. The mean estimated standard errors of 𝐹(𝑡𝐴𝑖) are close to the empirical standard deviations, indicating that the estimation of variability is appropriate. Figures 1 and 2 present three curves of 𝐹(𝑡𝐴𝑖) at 𝐴𝑖 = 41, 46, 50 and their 95% empirical confidence intervals for the proband data and combined data, respectively. We see that 𝐹(𝑡𝐴𝑖) coincide with the circles representing true 𝐹(𝑡𝐴𝑖) at various ages.

4. COHORT Data Analysis Results

COHORT is a multicenter observational study of individuals in the HD community. COHORT recruitment is open to subjects who have HD symptoms and signs (manifest HD), subjects who have an expanded CAG repeat but have not yet developed symptoms of HD (presymptomatic), subjects who have an HD affected parent but have not been tested and do not have symptoms (at risk), subjects who have an affected grandparent (secondary risk), and control subjects who are not at risk for HD. Information available on participating probands include genetic status (whether or not they carry HD mutation, and the number of CAG repeats), clinical diagnosis of HD, and the timing of symptom onset and timing of diagnosis. In our analyses, only probands with expanded CAG (CAG36) and their family members were included. Details of the cohort are cited in a publication in press [6].

We first describe the proband and family data in the COHORT study. Information on CAG repeat length and age was available for 1357 probands with CAG repeats varying from 36 to 100 (Table 3). There were 3409 first-degree relatives available from 675 probands. We do not have information on whether some of the probands are from the same family. We show the descriptive statistics for the relatives stratified by relationship type in Table 4. Each proband potentially has three versions of age-at-the-first-symptom (rater’s report, subject’s self-report, and a family member’s report). We gave the rater reported AAO of symptom the highest priority. If the rater reported version is not available, we then used subject report. If neither rater nor subject’s self-report is available, we then used the family member’s report. Twenty-one subjects whose self-reported and rater-reported AAO of symptom differed by greater than 15 years were removed. Our proband data set has 1151 subjects with CAG length between 41 and 56 and was used for the proband-only analysis. Similar to Langbehn et al. [10], we restricted the analysis to CAG repeat lengths between 41 and 56 to guard against sensitivity to the extremely high or low CAG repeats and against bias due to likely under ascertainment (relative to the population) of subjects with CAG length between 36 and 40.

Information on CAG repeat length, age at time of evaluation and the probability of being a carrier (receiving huntingtin mutation from the proband) was available for 2851 family members of 1151 probands. In the proband data set, both individuals with manifest HD and presymptomatic carriers (24%) are included. Their age-at-diagnosis and age-at-first- motor sign were recorded. Among 1151 probands, 876 (76%) subjects had experienced HD onset and the average AAO of the HD diagnosis was 44 years of age (standard deviation: 10.7). There were 54% females and 94% Caucasians. Our combined proband and family data set has 4002 subjects. In this combined data set, 51% were females and 35% subjects had experienced HD onset. Among the 4002 subjects, 467 are singletons (probands with no family member included). The other 3535 subjects belong to 623 pedigrees with an average size of 5.674 (sd = 2.609) members. In the combined data, there are two different probabilities of being a carrier: 𝑝𝑖=1 (1199 subjects with known CAG expansions or known HD onset) or 𝑝𝑖=0.5 (2803 subjects). Among the 2851 family members, 966 are parents of the probands, 1095 are siblings of the probands, and 790 are children of the probands.

When using the age-at-diagnosis in our proband data as 𝑇𝑖, the estimated cumulative risk of HD is 𝐹𝑡𝐴𝑖=𝜋1+exp3𝑡16.284exp8.3250.111𝐴𝑖22.379+exp15.6570.284𝐴𝑖1.(4.1) The estimated parameters for the CDF from the proband-only analysis are slightly different from the ones obtained from Langbehn et al. [10]. Our estimated mean and standard deviation of the AAO of HD is about 1 to 3 years later than the ones obtained in Langbehn et al. [10], and the standard deviation (SD) is slightly smaller (Table 5). In addition, the estimated CDF is smaller for most 𝐴𝑖 values using COHORT data. We ran a joint likelihood ratio test on the goodness-of-fit of parameters obtained in Langbehn et al. [10] and the 𝑃 value was less than 0.001 (test statistic = 66.0). When analyzing the age-at-first-symptom in our proband data, the estimated cumulative risk of HD is 𝐹𝑡𝐴𝑖=𝜋1+exp3𝑡14.266exp7.9870.104𝐴𝑖28.933+exp17.1300.312𝐴𝑖1.(4.2) We present 𝐹(𝑡𝐴𝑖) curves for age-at-diagnosis and age-at-symptom at various CAG lengths and their 95% confidence intervals for the proband data in Figure 3. It can be seen that with a given 𝐴𝑖, the estimated probability of having the first symptoms of HD is higher than the probability of a diagnosis of HD at the same age. This is consistent with the intuition that symptoms of HD will be observed before a diagnosis. The mean AAO of first symptom is estimated to be about 2 years earlier than AAO of diagnosis (Table 5) and the standard deviation of the former is slightly larger, indicating that reported age-at-first-symptom is more variable. It is unclear to what extent this difference represents true physical variability in illness development versus possibly lower reliability in the retrospective reporting of symptom onset [17].

As a sensitivity analysis, we compared the estimated CDF based on the parametric model with a nonparametric Kaplan-Meier estimator for subjects with a given 𝐴𝑖. Figure 4 presents this comparison using probands’ age-at-diagnosis data. We show in the figure that the parametric model fit is consistent with the Kaplan-Meier fit. However, as expected, the confidence interval for the parametric model estimate at a given age is narrower than the Kaplan-Meier estimate (results not shown). The figure comparing age-at-symptom models is similar and therefore omitted.

We reanalyzed only the AAO of the first symptom using the combined proband and family data, since the age-at-diagnosis was not available for family members who are not seen in person. The estimated cumulative risk of HD at age 𝑡 is 𝐹𝑡𝐴𝑖=𝜋1+exp3𝑡18.832exp8.4610.118𝐴𝑖32.365+exp14.8230.248𝐴𝑖1.(4.3) The corresponding 𝐹(𝑡𝐴𝑖) curves at various CAG lengths and their 95% confidence intervals are shown in Figure 5. In Table 5, we compare the estimated mean and SD of the AAO from the proband and combined data. We can see that the estimated mean AAOs for several CAGs are similar regardless of whether family members are included. The SD estimated from the model is larger for the combined data. This is a reflection of the observed data in that there is a wider range of AAO in the combined data than in the proband data. For example, the SD for CAG = 41 of the former is 11 years, whereas it is 10 years in the probands, and the SD for CAG = 42 is 10 in the combined and 8 in the probands.

One of the utilities of the estimated curves is to estimate the conditional probability of having an HD onset (or staying HD free) in the next five or ten years, given a subject has not had an onset by a given age. Similar to Langbehn et al. [10], in Table 6, we present such conditional probabilities in five-year intervals for a subject without HD at age 40 and with given CAG repeats. For example, a 40-year presymptomatic subject with a CAG of 42 has a probability of 34% (CI: 32%, 36%) for developing HD in the next 10 years (by age 50), while for a subject with a CAG of 50 this probability increases to 0.93 (CI: 0.91, 0.95).

5. Discussion

We propose methods to predict disease risk from a known mutation (or to estimate the penetrance function). For most complex diseases, predicting the AAO of a disease from genetic markers such as single-nucleotide polymorphisms (SNPs) continue to be a challenging issue [18]. Even with diseases like HD where the gene is identified, the predictive model can be complicated: a special feature of HD is that the mutation severity is quantifiable and varies significantly among the affected population. This contrasts with the typical categorical approach needed, for example, in genome-wide association studies. The proposed methods are also applicable to other expanded trinucleotide repeat diseases similar to HD.

One of the contributions of this work is to use the family data as well as the proband data to maximize available information in building a model. Our results reveal that the estimated risk obtained from the combined proband and family data is slightly lower than the risk estimated from the proband data alone. It is possible that the proband data consists of a biased clinical sample of gene positive or HD-affected subjects (e.g., subjects with more severe disease or with earlier onset may be more likely to participate; presymptomatic subjects might be undersampled) and is therefore not a fair representative sample of the entire HD population, especially underrepresenting subjects at risk. The plausibility of such underascertainment is so strong for CAG lengths of 40 or less [7] that we did exclude observations within that range from analysis. The family data may be a better representative of the population since the family members are included in the analysis only through the inclusion of the probands. Although proband may participate the study because they had HD or they had more severe symptoms of HD, the relatives were not included based on their CAG repeat lengths or affection status. Of course, some of the family members will not share an expanded CAG repeat huntingtin with the probands and therefore are noncarriers who will never develop HD.

Note that our estimated cumulative risk of onset of a positive HD diagnosis in the proband data is also slightly lower than Langbehn et al. [10] which also examined age-at-HD diagnosis. We estimated later mean AAO for each CAG repeat length shorter than 54 than did Langbehn et al. [10]. For example, the mean AAO of HD diagnosis for probands with a CAG of 42 in the former data was 3 years later and, for a CAG of 43, it was 4 years later (Table 3). On average, for all subjects with a CAG between 41 and 50, the mean AAO in Langbehn data was 2 years earlier than in the COHORT data. More detailed comparisons are presented in Table 5. There are several possible reasons for these differences. The model end point, AAO, should probably be considered to be slightly different in the two models. The outcome in Langbehn et al. [10] was earliest age at which a clinician documented an irreversible objective sign of the illness. This may occur earlier than the point at which an actual diagnosis of manifest HD is given. (Many clinicians wait until there are several such signs.) This may also occur, however, at a point that is later than the proband’s or family’s first report of subjective symptoms or their first perception of disease signs. In the CAG range of 41–49, the Langbehn et al. means are very close to the symptom onset means in the current data. For longer CAG lengths, the Langbehn et al. estimates more closely resemble the current models for disease diagnosis. Possible systematic variability between the clinicians in the two studies may also account for the differences in the estimates.

Other potential differences between the data sources include potential research-center-specific heterogeneity in diagnostic and rating conventions and slight variations in the methods used to determine CAG repeat length. In the Langbehn study, these were measured by a variety of laboratories while in the COHORT they were all measured in the same laboratory.

We do note that the differences between the fitted models here and those in Langbehn et al. are substantially smaller than differences among other formulae in the literature [14]. AAO probabilities, conditioned on current age, are especially similar. In HD research and genetic counseling, these conditional probabilities are perhaps the most commonly used statistic deriving from these formulae. Finally, the logistic-exponential form of the parametric model proposed in Langbehn et al. [10] does indeed fit the empirical AAO distributions quite well in the COHORT data. This validates use of this relatively complicated survival model for HD AAO research and may encourage considerations of quantitative biological mechanisms that would generate exponential relationships between CAG and both AAO and its variance.

There has often been ambiguity in the modeling literature concerning the exact meaning of HD “onset.” The first onset of observable signs or reportable symptoms of HD generally occurs before the actual diagnosis of clinically manifest HD is given. Much of the earlier modeling literature, reviewed in Langbehn et al. [14], does not clearly address this distinction, although the resultant formulas have often been used for subsequent prediction of HD diagnosis [14]. The event modeled in Langbehn et al. [10] was “the first time that neurological signs representing a permanent change from the normal state was identified in a patient.” This might be considered to the concept of “subject’s first noted symptom" rather than age of diagnosis. Nonetheless, this model has been used frequently as a predictor of future diagnosis in HD [14]. In the current study, we do distinguish between first symptom onset and diagnosis.

Here, we assumed Mendelian transmission of huntingtin without interference so that the CAG length does not change from parents to offspring. There are several possible violations of these assumptions. CAG lengths do, in reality, vary somewhat among family members, and those inheriting the gene from their father have, on average, a slightly longer CAG repeat length than their father. The probability of this occurring is much lower if inheritance is from the mother [19]. An explanation is that there are many more biological opportunities for the CAG length to change in the father’s process of sperm formation than in the mother’s process of egg formation. These processes and their dynamics have been studied extensively in vitro [7, 20], but we know of no well-verified in vivo dynamic population genetics models. Assuming the CAG length does not change from father to offspring may lead to a slightly lower estimated risk for affected fathers of probands.

Consistent with Langbehn et al. [10] and other studies [20, 21], we estimated reduced penetrance for lower CAG repeat lengths (≤40). We point out that the parameter estimates from the current model do not include subjects with CAG less than 41; therefore, the risk estimates for these subjects are extrapolations. However, it is conceivable that as long as the inverse relationship between AAO and CAG still holds for the lower CAGs, the life time disease risk for these subjects will be less than 100%, since the life time risk for a CAG of 41 is about 100%.

In the literature, no proportional odds model has been fitted to model the age-at-onset of HD. Proportional odds model, or along a similar line, transformation model, belongs to the semiparametric model framework and is beyond the scope of this paper. We are currently investigating semiparametric models other than the Cox proportional hazards model.

Finally, we stress that our current model does not include other observed covariates, such as additional genetic polymorphisms. In addition, we assumed conditional independence of family members’ age-at-onset (AAO) of HD given their CAG repeats. This assumption implies that we do not account for residual correlation among family members’ AAO caused by factors other than the CAG repeats, such as life style factors. When there exists such residual correlation, point estimates from our current approach are still consistent hence still valid, although the standard error estimates are no longer correct. A practical limitation of using family members’ AAO data is that they may be less reliable than the data directly collected from the probands. This limitation applies to all other diseases, especially those with late onset. This limitation can be more pronounce when there is incomplete penetrance and variability of phenotype. Future work would consider incorporating such measurement error in the analysis. Lastly, the proposed methods do not include possible unobserved effects that may be site or clinician-specific and perhaps related to the interpretation of the point of “onset.” Future research will focus on incorporating observed covariates and adding family-specific random effects to account for residual familial aggregation.

Acknowledgments

Y. Wang’s research is supported by NIH Grants R03AG031113-01A2 and R01NS073671-01. Samples and/or data from the COHORT study, which receives support from HP Therapeutics, Inc., were used in this study. The authors thank the Huntington Study Group COHORT investigators and coordinators who collected data and/or samples used in this study, as well as participants and their families, who made this work possible.