Abstract
Under cohort sampling designs, additional covariate data are collected on cases of a specific type and a randomly selected subset of noncases, primarily for the purpose of studying associations with a time-to-event response of interest. With such data available, an interest may arise to reuse them for studying associations between the additional covariate data and a secondary non-time-to-event response variable, usually collected for the whole study cohort at the outset of the study. Following earlier literature, we refer to such a situation as secondary analysis. We outline a general conditional likelihood approach for secondary analysis under cohort sampling designs and discuss the specific situations of case-cohort and nested case-control designs. We also review alternative methods based on full likelihood and inverse probability weighting. We compare the alternative methods for secondary analysis in two simulated settings and apply them in a real-data example.
1. Introduction
Cohort sampling designs are two-phase epidemiological study designs where information on time-to-event outcomes of interest over a followup period and some basic covariate data are collected on the whole first-phase study group, referred to as a cohort, and in the second phase, more expensive or difficult-to-obtain additional covariate data are collected only on a subset of the study cohort. This usually comprises the cases, that is, individuals with a disease event of interest during the followup, and a randomly selected subset of noncases. Examples are the case-cohort [1โ3] and nested case-control [4, 5] designs. Primarily, such designs are applied for the purpose of studying associations between the time-to-event outcomes and the covariates collected in the second phase. However, with such data having been collected, an interest frequently arises to reuse it for studying associations between the second-phase covariates and the other available covariate data. For instance, the covariates collected in the second phase could be genotypes, while the other covariates may be various phenotype measurements carried out at the outset of the followup period for the whole cohort. The interest would then be to explain a phenotypic response with the genetic covariates. Following Jiang et al. [6] and Lin and Zeng [7], we refer to such a situation as secondary analysis. Here, we concentrate specifically on non-time-to-event secondary outcomes. Analysis of secondary time-to-event outcomes under the nested case-control design has been considered previously by Saarela et al. [8] and Salim et al. [9].
As our motivating example, we consider here a single cohort which was used in a larger meta-analysis of association between the European lactase persistence genotype and body mass index (BMI) [10], the latter being a secondary outcome in the cohort study in question. The cohort consists of 5073 men aged 55โ77 years from southern and western Finland, who originally formed the placebo group of the ATBC cancer prevention study [11]. Whole blood samples of the participants were taken between 1992 and 1993, which is here considered as the baseline of the cohort, with followup for cardiovascular disease events and all-cause mortality available until the end of year 1999. There is no loss to followup, so the only censoring present is of type I due to end of the followup period. This cohort is a part of MORGAM project, an international pooling of cardiovascular cohorts [12]. Genotype data (including the lactase persistence SNP rs4988235) under this project have been collected under a case-cohort design described in detail by [13] and herein in Section 4.3.1. Given such data, our aim is to estimate the association between the lactase persistence genotype and BMI making use of genotype data collected on both the random subcohort and cases of all-cause mortality.
Secondary analysis of case-control data has been studied previously, using profile likelihood [14], inverse selection probability weighting methods [15โ17], or retrospective likelihood [6, 7]. However, to the best of our knowledge, a systematic discussion on secondary analysis under cohort sampling designs has been lacking, which we will aim to rectify here by discussing alternative approaches for such an analysis under a generic two-phase study design. We will briefly review the full likelihood approach which utilizes all observed data (Section 2), as well as pseudolikelihoods based on inverse selection probability weighting (Section 3). For these approaches, we propose a conditional likelihood-based alternative (Section 4), restricted to the fully observed second-phase study group. Conditional likelihood inference under cohort sampling designs has been studied previously for the analysis of the primary time-to-event outcome by Langholz and Goldstein [18] and Saarela and Kulathinal [19]; here, we extend these methods to the secondary analysis setting. The main interest is in continuous secondary outcomes, though the approach would also be valid for categorical responses. As special cases of the general setting, we consider case-cohort and nested case-control designs. As extensions to the basic setting, we consider treatment of missing second-phase covariate data and adjustment for left truncation in the case of incident time-to-event outcomes (Section 5). In Section 6, we present two simulation studies, first comparing the efficiencies of the alternative approaches and then demonstrating the potential adverse effects of small sampling fraction in full likelihood inference. We also carry out the analysis in the real-data example using all three alternative methods. As the model for the continuous secondary response variable, in addition to the customary normal distribution, we consider more flexible model specifications, thus aiming to incorporate residual-based model fit diagnostics into the model itself. While other generalizations for the normal distribution have been proposed, we adopt here the four-parameter normal-polynomial quantile mixture [20], which includes the normal distribution as a special case.
2. Notation, Assumptions, and Full Likelihood
To cover our motivating example and also the general case, we introduce first some notation. Let the set represent the individuals in the cohort. Primary outcome in the study is a time-to-event outcome characterized by random variables , , where corresponds to event time and to the event type of individual , with indicating a censoring event and a death due to any cause. Extension to incident nonfatal outcomes and multiple outcome types is considered separately in Section 5.2. A secondary outcome of interest , here BMI, is observed on all study participants at the outset of the study. In addition, there may be other covariates , available on all , relevant to be included in the analysis. In the present example, these comprise only the age at the start of the followup. Additional covariate data (here the lactase persistence genotype) are collected only on the second-phase study group , specified by the inclusion indicators , analogously to the survey response/nonresponse setting of Rubin [21]. We will henceforth use vector notations of the type to represent data obtained on different subsets of individuals. We will not make a distinction between random variables and their realized values in the notation when this is clear from the context. The observed data as a whole are then represented as . We are interested in the model for the secondary outcome, more precisely, the parameter describing the association between and , which in our case correspond to BMI and the genotypic covariate.
We assume that the first-phase sampling mechanism has been unconfounded in the sense of Rubin [21, page 36], so that we may ignore the first-phase sampling mechanism in all subsequent analyses. This means that the cohort recruitment (possibly through survey sampling) and possible nonresponse depend only on the -covariates; then if all subsequent analyses are conditional on , the selection mechanism may be ignored. In contrast, the second phase sampling may be outcome dependent; the second-phase sampling mechanism is specified by the joint probability distribution for the set of indicator variables . This is assumed to be unconfounded with , that is, . What follows could be further generalized by assuming the sampling mechanism to be ignorable so that , but since most common cohort sampling mechanisms go under the former assumption, and it will simplify the exposition, we will proceed with that. In addition, we assume the random vectors either to be (infinitely) exchangeable over unit indices (cf. [21, page 40]), or equivalently, conditionally independent given the collection of all relevant parameters . It should be noted that the exchangeability assumption needs not to be extended to the inclusion indicators . Now, following, for example, Saarela et al. [8], we can write a full (observed data) likelihood expression In the secondary analysis situation, only the parameters are of interest, while and remain present as nuisance parameters. There are potential drawbacks related to the above likelihood expression; it requires integration over the unobserved covariate data in the set (which may be large compared to ), as well as modeling of the population distribution of . If possible, we would like to avoid this due to the required estimation of the nuisance parameters and the risk of model misspecification. Furthermore, observed data likelihoods may become sensitive to misspecification of the model for the response variable; the missing data can act similarly to extra parameters, and the actual model parameters may lose their intended interpretation. This is a real problem especially in cohort sampling designs with a rare event of interest, since the proportion of uncollected covariate data in the study cohort may then be very high. We demonstrate such a situation with simplified simulation example in Section 6.2.
3. Methods Based on Inverse Probability Weighting
Valid estimates for the parameters of the secondary outcome model could alternatively be obtained by using inverses of the first-order inclusion probabilities (assumed here to be known) as weights (e.g., [22, 23]). The weighted log-likelihood function, approximating the corresponding complete data log-likelihood, to be used for the secondary analysis is In a typical cohort sampling design, all cases would receive unit weights, while noncases selected to the set would receive weights greater than one, inverse to their probability to be included in the set of controls/subcohort. While this kind of weighting results in unbiased estimation of the parameters of interest, it will potentially result in reduced efficiency compared to the full and conditional likelihood approaches, since in (3.1) cases receive smaller weights compared to noncases irrespective of whether the case status is actually associated to the secondary outcome or the covariate of interest, and thus the information in these observations may not be fully utilized. We will demonstrate this in a simulation study in Section 6.1. Theoretical justification for estimating function (3.1), as well as variance estimation is discussed in Appendix A. A variation of the above Horvitz-Thompson [24] type of weighting would be poststratification-based estimation (e.g., [15, 25]), with the stratification carried out over the relevant first-phase variables.
4. Conditional Likelihood Inference under Cohort Sampling Designs
4.1. Definition
A very general definition for conditional likelihood is given by Cox and Hinkley [26, pages 16-17] and Cox [27, page 269] as follows. Let be a random vector corresponding to all observed data, and let this be partitioned into relevant subsets , the transformation not depending on unknown parameters . The conditional likelihood given the realization is then the conditional probability distribution with respect to . Few generally accepted rules for choosing the partitioning can be found from the literature, one of these being conditioning on an ancillary statistic or something close to that in order to eliminate nuisance parameters (e.g., [28, 29]). In contrast, although working under the same general definition, here we condition on a sampling mechanism, in order to restrict the analysis into a subgroup for which a useful likelihood expression can be written. Generally such a conditioning may lose information on , but will nevertheless produce valid estimates. The rules we set out for choosing the partitioning to construct a conditional likelihood under the two-phase study setting are as follows. (1) Condition on the sampling mechanism, that is, the set of inclusion indicators , which produced the second-phase study group onto which the analysis is to be restricted. Do not condition on any further information on the sampling mechanism, such as inclusion in the subcohort or the set of controls, since this can easily lead to overconditioning which will lose a lot of information. In our notation, such additional information on the sampling mechanism is implicitly included in and, given the assumptions stated in Section 2, will cancel out of the resulting likelihood expression. (2)Other observed variables may be placed into or at will, depending on the parameters of interest. For instance, if the parameters are not of interest, we may condition on by placing it in . If the parameters need to be estimated, may be placed in .(3)We must have , that is, all the relevant observed variables must either be modeled or conditioned upon.
Applying these conditioning rules will reproduce the conditional likelihood expressions obtained previously in special cases of the current framework by Langholz and Goldstein [18] and Saarela and Kulathinal [19]. The proposed conditional likelihood framework also includes the familiar retrospective likelihood, often suggested for analysis under case-control designs (e.g., [30, 31]), as a special case (see Appendix B). Whereas Langholz and Goldstein [18] considered only the special case of logistic likelihoods, here we aim to first derive the likelihood expressions in the general case before substituting in any specific models.
4.2. Conditional Likelihood Expression: The General Case
Following the stated rules, and making the same general assumptions as in Section 2, we proceed by partitioning the observed data into relevant subsets and and, using a shorthand notation of for all outcome variables, work with a conditional likelihood where the proportionality follows from the assumption on unconfounded second-phase sampling mechanism. In Appendix C, we show that such a conditional likelihood indeed has the properties of a likelihood, that is, a score function with zero mean and variance equal to the Fisher information. Asymptotic normality of the maximum likelihood estimators based on the conditional likelihood follows obviously from these results in the special case of Bernoulli sampling (Section 4.3.1), although in the general case, this may require further assumptions on the sampling mechanism, such as asymptotic independence of the inclusion indicators .
The ratio of the numerator and the second term in the denominator can be further written as Here, the product forms follow from the exchangeability assumption for the random vectors . Thus, we have obtained where the numerator factors into the parametric models . The denominator is the conditional likelihood correction term (sometimes called ascertainment correction, e.g., Ma et al. [32]). Its specific form depends on the second-phase sampling mechanism, and in the general case, it does not reduce into a product form. It depends on the model parameters, since it is not conditioned on all of . Generally, the challenge in conditional likelihood correction terms is in representing them in terms of parameters estimable from the data. Here, the term can be written as where is given by (4.2), and we have Here, the first term inside the integral specifies the sampling mechanism and is assumed to be known. The obtained likelihood expression utilizes all observed data on covariates , and it can be easily seen that if there is no association between and and between and , we lose no information relevant to learning about the association between and , when the conditional likelihood (4.3) is used instead of the corresponding full likelihood (2.1). This was also demonstrated in the simulation study of Saarela and Kulathinal [19]. On the other hand, if the other observed data do give significant information on the unobserved covariate values , efficiency could potentially be improved by using the full likelihood which utilizes all observed data, with the cost of having to specify a model for . In Section 6.1, the efficiencies of expressions (2.1) and (4.3) are compared in a simulated setting.
4.3. Special Cases: Case-Cohort and Nested Case-Control Designs
4.3.1. Case-Cohort/Bernoulli Sampling
Here, we are mainly interested in a variation of the โefficient case-cohort designโ suggested by Kim and De Gruttola [33, pages 155-156] as an alternative to sampling within strata. To improve efficiency in sampling of the subcohort, here the distribution of some key covariates (in the present notation either or both ) in the subcohort is approximately matched to that of the cases by first fitting a logistic regression model, say, , and then selecting the subcohort using Bernoulli sampling with the probabilities , independently of the case status. We then have , where in practice, we make the approximation and will subsequently suppress the sampling mechanism parameters from the notation. The selection probabilities may also be rescaled to give a desired expected subcohort size as . More generally, the subcohort selection probability may be any known function of , that is, , with the sampling design specified by . The special case of Bernoulli sampling is of interest, because here the product of the first-order inclusion probabilities specifies the joint distribution of the inclusion indicators (see, e.g., [34, pages 62-63]), which considerably simplifies the analysis using conditional likelihood. In the case of stratified without replacement sampling, the following could only be interpreted as a first-order approximation when the sampling fractions are small. In the Bernoulli sampling case, the conditional likelihood correction term reduces into the product form Let denote the set of individuals who died during the followup, and let . In the case-cohort design discussed in Kulathinal and Arjas [35], Kulathinal et al. [13], and Saarela and Kulathinal [19], all cases are selected to the case-cohort set, so that , , while to increase the efficiency of the design, the subcohort is selected with probabilities which depend on the age at the start of the followup (included in the covariates) through a logistic model as discussed above, giving the inclusion probability for non-cases as , . Under Bernoulli sampling, the realized sample size is random, with the expected sample size in the present example given by . In the case of a mortality outcome and type censoring at predetermined times , with the observed time given by , where is the latent time of death, (4.6) further simplifies into (see also [19, pages 12-13]) Hence, (4.6) could be represented in terms of the survival function. Suppose now that in addition to the type I censoring due to the end of the followup period, there may be random censoring during the followup; for instance, this may be the case if the outcome in the case-cohort design () is not all-cause mortality, but, say, mortality due to cardiovascular diseases. Deaths due to other causes, denoted as , then appear as censoring. As before, let denote the latent time of death, with indicating the type of death or end of followup period. Similarly as above, we get Here, the probabilities , , are given by the cumulative incidence functions and by the joint survival function of a competing risks survival model for the two types of death (e.g., [36, pages 251-252]). These are in principle identifiable from the observed data for the set , since the subcohort has been selected independently of the case status and thus will include a number of other deaths. On the other hand, if this number is very small, the middle term in the above sum contributes little to the correction term, and thus unstable estimation of the corresponding parameters does not necessarily hinder the estimation of the parameters of interest.
4.3.2. Risk Set Sampling
Consider now a nested case-control sampling mechanism in which all cases are selected to the case-control set with probability one, and controls per case are selected using without replacement sampling from the risk set (of size ), where is the at-risk indicator for subject at event time . This is carried out independently for all cases . Let the sampled set of time-matched controls for case be denoted as , some of which may also be future cases since the sampling is carried out without regard to the future case status of the individuals in the risk set. Let denote the collection of the sampled risk sets and the pooled set of sampled controls. Noting that knowing all of specifies the risk sets as well as the order of the events, the risk set sampling mechanism is specified by the joint probability distribution If the sampling has been carried out within strata specified by the covariates and/or through some function , the risk sets would be defined within strata as , and the above reasoning would apply similarly with the redefined risk sets. The conditional likelihood correction term now becomes which does not reduce into a product form. Exact numerical evaluation of this term would require enumeration of all possible combinations of sampled risk sets of size which would result in the observed case-control set . This is unlikely to be feasible in practice with realistic sample sizes, which is why consideration of approximations may be necessary. Samuelsen [23] showed the inclusion indicators under risk set sampling to be asymptotically pairwise uncorrelated, with the first-order inclusion probabilities for the noncases given by . Thus, it might be tempting to replace (4.9) with . However, the properties of such an approximation will require further research.
5. Extensions
5.1. Missing Second-Phase Covariate Data
Further issue to be considered in practical applications is possible missing covariate data within the set . For instance, if the covariates to be collected under the cohort sampling design are genotypes, after selection of subjects to be genotyped, it may turn out that the DNA amount or concentration is too low or the genotyping is otherwise unsuccessful. Let be a set of indicator variables indicating whether the measurement of turned out to be unsuccessful after selection of subject into the set , and let . Now is observed only on the set . We consider the implications of this separately for all of the three approaches discussed above.
5.1.1. Full Likelihood
If the missingness can be assumed missing at random, that is, , the missingness indicators can again be included in the proportionality constant, and the full likelihood expression becomes
5.1.2. Inverse Probability Weighting
If the missingness mechanism would be known (and missing at random), a weighted pseudolikelihood expression for the set could be obtained by multiplying the original selection probabilities by the probabilities of observing the value of after selection In practice, however, the missingness mechanism would have to be modeled to obtain estimates for these probabilities.
5.1.3. Conditional Likelihood
By partitioning the observed data into and , we obtain a conditional likelihood where is obtained from (4.5) by replacing the set with . Unlike the second-phase sampling mechanism, the missingness mechanism is generally unknown. From (5.3), it can be seen that if we are willing to assume that are either missing completely at random or that depends only on , the terms involving the missingness indicators cancel out of the likelihood. On the other hand, if the missingness may depend on the response variables , the missingness mechanism would have to be modeled. However, in such a case, it may be easier to work with the partitioning and and model the population distribution of rather than trying to estimate parameters describing the missingness mechanism. This in fact corresponds to what was done by Saarela and Kulathinal [19] and was required there because haplotypes are only partially identifiable from unphased genotype data.
5.2. Incident Outcomes and Left Truncation
Previous discussion was specific to a primary mortality outcome using time on study as the main time scale. In this section, we discuss separately how the different methods can accommodate cohort sampling for incident nonfatal primary outcomes. In the analysis of secondary, non-time-to-event outcomes, the presence of left truncation due to exclusion of cases of prevalent disease presents an additional complication. If the parameters of the secondary outcome model correspond to the background population alive at the cohort baseline (rather than to the disease-free population), this additional selection factor requires further adjustment. If the primary outcome is a mortality endpoint, this is not an issue, since then there is no further selection due to prevalent conditions. In likelihood-based adjustment for left truncation, the main time scale of the analysis has to be chosen as age instead of time on study. In survival modeling, it is well known that conditioning on event-free survival until the age at study baseline corresponds to exclusion of the followup time before that (e.g., [37, page 580], and the references therein). However, this is no longer true in the case of missing covariate data (e.g., [8, pages 5997-5998]), or indeed in the analysis of secondary outcomes, as will be demonstrated below. The set notations and are here taken to refer to the disease free cohort and second-phase study group.
It should also be noted that under case-cohort designs it is common to collect second-phase covariate data for more than a single outcome, since the case-cohort design naturally enables the analysis of multiple outcomes using a single subcohort selection. This is also the case in our example cohort discussed in Sections 1 and 6.3, where, in addition to cases of all-cause mortality, genotype data has been collected also on cases of nonfatal incident cardiovascular disease events. To keep the example simple, in the data analysis, we consider only the case-cohort set for all-cause mortality. However, it is in principle straightforward to accommodate multiple outcome types in the likelihood expressions discussed below by using competing risks type notation where denotes the observed time of the first incident disease event, death, or censoring, with indicating censoring and the different types of outcome events for which the second-phase covariate data is collected. Since utilizing the likelihood expressions does not necessitate estimation of different hazard functions, the endpoint definitions may be pooled as seen suitable.
5.2.1. Full Likelihood
The full likelihood expression (2.1) is now conditioned on the selection rule for all , that is, event-free survival until the age at the cohort baseline. The likelihood expression becomes
5.2.2. Inverse Probability Weighting
The weighted pseudolikelihood approach does not readily take into account additional selection which occurs in studies of incident outcomes. This was also pointed out by Reilly et al. [15], who in their discussion note that their reweighting approach for case-control studies is not valid for incident or nested case-control studies, since the incident cases and healthy controls available cannot be reweighted to represent the general population. Although this is true for the basic estimating function (3.1), similarly as in the missing data case, a double weighting mechanism can be devised to account for the additional selection step as Since this weighting requires estimation of the time-to-event model parameters , resampling-based variance estimation would be required to obtain valid standard errors for the estimates of to account for the uncertainty in estimation of the weights.
5.2.3. Conditional Likelihood
As in Section 4.2, we get where the correction term becomes In the case-cohort design discussed in Section 4.3.1, with type I censoring, this further simplifies into
6. Illustrations
6.1. Simulation Study
In order to compare the efficiency of the alternative estimation methods discussed above, namely, full likelihood, conditional likelihood, and weighted pseudolikelihood, we supplemented the cohort data described in Section 1 with a simulated covariate, following the real-data-based simulation approach described by Saarela et al. in [8, pages 5998โ6000], in order to have the simulation setting resemble as closely as possible the real data analysis setting of Section 6.3. This Bayesian procedure corresponds to multiple imputation [21], where given the observed data and predetermined associations with the primary and secondary outcomes, an additional โmissingโ covariate is simulated from the posterior predictive distribution . Case-cohort sampling was then simulated in the complete datasets thus obtained. The objective of this approach, as compared to using completely simulated data, was to obtain simulation results directly relevant to the real study of interest. We included a set of individuals with a BMI measurement available and eligible for the case-cohort study. The primary (time-to-event) outcome was taken to be all-cause mortality, with . Given the primary and secondary outcome data and age at baseline observed for the cohort, and the current parameter values, additional binary covariate values were drawn from the conditional distributions where the model for the survival outcome, , was specified as a proportional hazards model , with time on study as the time scale and age at baseline as a covariate. Here, the the baseline hazard function was specified in terms of a piecewise constant function using 15 time bins of equal length over the followup period of seven years. Flat improper priors were used for all parameters (those not fixed to selected values), on log scale for the nonnegative parameters.
Normal model was used for the secondary outcome, while the population distribution of the additional covariate was specified as . Here, parameters , , and were fixed at selected values while the other parameters were allowed to be determined by the data and integrated out by drawing from their respective full conditional posterior distributions using Markov chain Monte Carlo sampling. Variables and were centered but otherwise untransformed. 1000 values for each , , were obtained by running the MCMC sampler for 25000 rounds after a 10000-round burn-in with each set of fixed values of given in Table 1 and saving every 25th state of the chain. In each of the complete datasets obtained by combining the observed data and the simulated covariate values , case-cohort selection was carried out by matching the age distribution of the subcohort to that of the cases by first fitting a logistic regression model with age as a covariate to the observed survival status, as detailed in Section 4.3.1. The predictive probabilities from the logistic model were scaled to give an expected subcohort size of 1000, and the subcohort was selected using the obtained probabilities in Bernoulli sampling. This procedure gave an expected case-cohort set size of .
In fitting models to the datasets so obtained, we used the same model specifications as above, with the exception of the proportional hazards model, where we fitted a โmisspecifiedโ Weibull model . Even though this is a different model than the one specified for the simulation step, we did not simulate the time-to-event outcome data, which was taken from the example cohort; a comparison between fitted piecewise constant and Weibull hazards (results not shown) indicated that the Weibull model is adequate for modeling these data, accounting for the increasing baseline mortality rate due to ageing of the cohort. Full likelihood function (2.1), conditional likelihood function (4.3), and weighted pseudolikelihood function (3.1) were maximized with respect to the parameters , , and , respectively, substituting in the above parametric models. The numerical optimization was carried out using the optim function of the statistical software, applying the BFGS optimization algorithm [38]. The integrals over in (4.7) were evaluated using numerical (quadrature) integration, using the gsl_integration_qagi function of GNU scientific library [39]. The standard errors of the maximum likelihood estimators were evaluated by inverting the numerically differentiated Hessian matrix at the maximum likelihood point. The standard errors of the inverse-probability-weighted estimates were evaluated using the robust variance formula of Appendix A with the observed information (A.4) and observed score (A.5). The results from the 1000 replications are presented in Table 1. Here, the parameters , corresponding to the simulated covariate, are of main interest; the other parameters reflect the observed data, and thus the Monte Carlo variances for these are not relevant. The standard errors for the corresponding parameters are given in the real-data example of Section 6.3. The results indicate that conditional likelihood estimation gives similar efficiency compared to the corresponding full likelihood estimates irrespective of whether the covariate of interest was associated with the survival outcome. In contrast, both likelihood-based approaches gave better efficiency compared to inverse probability weighting. This is expected, as the weighted pseudolikelihood gives smaller weights to cases, whereas the conditional likelihood weights cases and noncases differentially only to the extent they actually differ. Both the inverse observed information and robust variance estimates agreed well with the Monte Carlo variances.
6.2. Multimodality under Full Likelihood When the Sampling Fraction Is Small
Let now the observed data be only , and let the sampling mechanism be simple random sampling without replacement; , where the sample size is fixed. The specific aim of the following example is to demonstrate the vulnerability of observed data likelihoods, integrated over the missing covariate data on the set , to misspecification of the response model. The full likelihood expression now becomes . Due to the simple random sampling, the corresponding conditional likelihoods simplify into and .
With , we simulated covariate values from and two different sets of response values (both independently of ) from and (same mean and variance, but the latter distribution is skewed to the right). We fit two alternative models to these data, in both models , with the models for the response specified as and , where are the collections of all model parameters. The latter model is the normal-polynomial quantile mixture distribution proposed by Karvanen ([20, pages 950โ953]; see also package Lmoments), which we apply here for regression modeling. The parametrization here can be expressed in terms of the first four L-moments as , (L-scale), (L-skewness), and (L-kurtosis). This distribution is suitable for regression modeling as the first L-moment is the mean of the distribution, while the two additional shape parameters allow for more flexible modeling of the residual distribution. The NPQM distribution includes normal distribution as a special case when and . The full and conditional log-likelihood expressions under these model specifications were maximized with respect to and as described in the previous section. Initial values for the algorithm were set at the sample moments calculated from the marginal distribution of the response variable, with the regression coefficient set to and .
Table 2 shows the mean maximum likelihood estimates over 1000 replications for all 8 combinations of data generating model, fitted model, and estimation method. With the response simulated from normal distribution, both models estimated using either full or conditional likelihood give the expected results (Figures 1 and 2). However, with the response simulated from gamma distribution and the sampling fraction small, the misspecified normal model estimated using full likelihood indicates a spurious association between the response and the covariate. This is because the missing data act as extra parameters, allowing the optimization procedure to obtain a better fit to the skewed data. The corresponding sampling distributions for shown in Figure 3 have become bimodal. It should be noted that the initial values given to the optimization algorithm corresponded always to the โcorrect solutionโ of no covariate effect, thus enabling the algorithm to find the correct mode. However, bimodality in the sampling distribution does not necessarily indicate that the likelihood functions given a single realized dataset would be bimodal.
Using the NPQM model which allows a skewed residual distribution does not correct the situation either when combined with full likelihood estimation (Figure 4). This is not unexpected since adding more parameters to an already overparameterized situation is not necessarily helpful in solving identification problems. In contrast, conditional likelihood continues to give reasonable results in all cases, with the skewness of the data correctly reflected by the parameter of the NPQM model. Some degree of misspecification of the response model is required for the multimodality to appear since when we simulated the response variable from NPQM distribution, the full likelihood fit of the NPQM model indicated no problems, while the full likelihood fit of the normal model showed the same problems as with gamma data (Figures 5 and 6). We also repeated all the simulation alternatives with ( and ), with the conclusions essentially unchanged (Table 3 and Figures 7, 8, 9, and 10).
6.3. An Example with Real Data
The case-cohort set for all-cause mortality in the example cohort () is of size , the union of a subcohort of size 1068 and 996 deaths due to any cause (since the subcohort has been selected independently of the case status, it includes a number of cases). The case-cohort design applied in this cohort was described in Section 4.3.1. Due to sample unavailability or unsuccessful genotyping, the genotype for the lactase persistence SNP rs4988235 was unavailable for 156 individuals, giving . The model for the data described in Section 1 was specified as follows: Hardy-Weinberg model was used for the number of lactase persistence alleles. The model for BMI, which is the response variable of interest, given age at baseline and the genotype , was taken to be . The model for the survival outcome was taken to be , where , and using again the Weibull form for the hazard function . In both regression models, the lactase persistence allele noncarriers are compared to the hetero- and homozygote carriers of the allele. Under the case-cohort design where the subcohort sampling probabilities depend on , the conditional likelihood expression takes the form or depending on whether we condition upon the observed genotype data , or whether this is modeled as part of the likelihood (see Section 5.1.3). The difference between these two approaches is that the former requires more assumptions on the missingness mechanism. Therefore, we compare here the two approaches to see whether there is any observable difference between the results. In addition, expression (6.3) enables estimation of the population allele frequency . However, it should be noted that while the summation over the single SNP genotype variable is not a computational problem in the present example, this is not necessarily the case generally, with multiple continuous covariates in the model. Therefore, the potential advantage of expression such as (6.2) is that it avoids specification of the covariate distribution and the integration over the missing covariate values.
A remaining issue to be considered is the numerical evaluation of the integrals over in the correction terms of (6.2) and (6.3). It should be noted that these have to be evaluated at each parameter value tried at the hill-climbing numerical optimization. Although numerical (quadrature) integration would be feasible in one dimension, here we propose evaluation of such integrals using Monte Carlo integration by repeated sampling from the distribution . This has an added advantage that while the numerical evaluation of the density of the NPQM model is computationally expensive, random variates can be easily simulated from this distribution since it is defined through the quantile function. This proceeds by drawing random variates from -uniform distribution and applying formula (6.2) of Karvanen [20]. With random variates , , drawn for each given the current parameter values, the integrals in (6.2) are then approximated by the means .
Maximum likelihood estimates obtained by numerical maximization of the expressions (6.2) and (6.3) with respect to , , and in the latter case also , are presented in Table 4 ( and ) and Table 5 (). For comparison, we also maximised a full likelihood of the form (5.1) with respect to and a weighted pseudolikelihood of the form (5.2) with respect to . In the latter case, we accounted for the missingness within the case-cohort set by fitting a logistic regression model in the set and calculated the adjusted weights as inverses of . Due to the extra estimation step, we used bootstrap with 5000 replications to obtain standard errors for the weighted pseudolikelihood estimates. Both conditional likelihood expressions gave very similar results, suggesting that the missing data within the case-cohort set is not a major issue in the present case. The comparison between different number of Monte Carlo replicates in numerical evaluation of the conditional likelihood correction term suggests that the estimates do no longer appreciably change when is increased from 1000. However, the change in the estimates when increasing from 100 to 1000 is already well within the standard errors.
Full likelihood and weighted pseudolikelihood estimates agreed well with the conditional likelihood ones, although the latter had higher standard errors, as was the case also in the simulations. As noted by Kettunen et al. [10], the absence of the lactase persistence allele shows association with lower body mass index. The residuals from the model for BMI are significantly skewed to the right, as indicated by the estimates of .
7. Discussion
Although conditional logistic likelihood is well known in the context of risk set sampling designs (e.g., [5, 18, 40]), its connection to the general concept of conditional likelihood has rarely been emphasized by authors. Furthermore, use of the conditional likelihood approach is not limited to the binary event outcome response situation. In this paper we have attempted to highlight these issues in the context of modeling a continuous secondary response variable under cohort sampling designs. Naturally, the conditional likelihood approach is also valid for the primary time-to-event outcome. Although here we considered only parametric specifications for the time-to-event outcome, semiparametric maximum likelihood techniques could also be utilized here. A potential disadvantage of the likelihood-based methods for secondary analysis is that they require specification of a model for the primary time-to-event outcome even though it is not of main interest in the secondary analysis setting, whereas specification of this model is avoided in the weighted pseudolikelihood approach.
Compared to the full likelihood approach which would be applicable under any cohort sampling design, the advantage of the conditional likelihood is that modeling of the population distribution of the covariates collected in the second phase can be avoided, if this is not of primary interest. In addition, as we demonstrated in a simulation example, full likelihood expressions with most of the covariate data unobserved may no longer be well behaved, a problem which the conditional likelihood approach does not have. The disadvantage of the conditional likelihood approach is that the second-phase sampling mechanism, that is, the joint distribution of the inclusion indicators, needs to be specified. In the case-cohort/Bernoulli sampling situation, this is straightforward, but in mechanisms such as risk set sampling, where the sampling probabilities are specified only implicitly, resolving the joint sampling probability can be computationally nontrivial, and approximations may be needed in practice. This is avoided in the full likelihood approach, since it does not require specification of the sampling mechanism. The inverse-probability-weighted estimating function only requires specification of the first-order selection probabilities, but the possible dependencies induced by the sampling mechanism need to be addressed in the variance estimation step. The full and conditional likelihood approaches gave equivalent efficiencies in our simulated setting, although the result might be different if the first-phase data involves covariates which are highly predictive of the second-phase covariate of interest. In any case, both likelihood-based methods gave a clear improvement in efficiency in the secondary analysis setting compared to the inverse-probability-weighted method.
Appendices
A. On Inverse-Probability-Weighted Pseudolikelihood Estimators
The use of the expression (3.1) in approximating the corresponding complete data log-likelihood , where we denote , is justified by considering the expectation of the pseudoscore function with respect to the joint distribution , which is assumed to be the correct data generating mechanism. This becomes the last form being the expectation of the complete data score function for parameters , which becomes zero through the inner expectation if the parametric model is correct, that is, corresponds to the data generating mechanism. Thus, the expression (3.1) gives a valid estimating equation for parameters . However, since the variance of the pseudoscore function does not equal the Fisher information, a robust/sandwich type variance estimator must be used instead of the inverse of the observed information. As an example, below we derive this in the special case of a Bernoulli sampling mechanism , where this is straightforward. For consideration of asymptotic variances of inverse-probability-weighted estimators under stratified without replacement sampling or risk set sampling, we refer to Breslow and Wellner [41] and Cai and Zheng [42].
Denoting and , the latter is given as a solution to the estimating equation , or approximately, by the first-order Taylor expansion at the true value , as a solution to , which in turn, by substituting expected to observed pseudoinformation, gives the approximate relationship . Taking covariance of both sides then gives , where the pseudo-Fisher information is given by Since the pseudoscore function has zero mean and now , , its covariance is given by In practice, these quantities are evaluated at the maximum likelihood point, replacing the expected quantities by their observed counterparts as
Alternatively, bootstrap variance estimation might be utilized, but it should be noted that standard bootstrap would be valid only under Bernoulli type sampling designs, which can be interpreted as sampling from infinite population, whereas dependencies induced by sampling without replacement would require application of finite population bootstrap methods (e.g., [43, 44]).
B. Relationship to Retrospective Likelihood
Consider a special case where the observed data are only , where , , are binary indicators for the case status. This corresponds to the case-control situation considered by Jiang et al. [6] and Lin and Zeng [7]. The set now represents a study base from which the cases and controls have been selected. Now choosing a partitioning and , and supposing only the case status information has been used in the selection of cases and controls, we obtain a conditional likelihood The last form is the retrospective form suggested by Lin and Zeng [7, page 257] for secondary analysis under case-control designs. Thus, retrospective likelihood is a special case of conditional likelihood. The advantage of the above expression is that the terms specifying the sampling mechanism cancel out. The drawbacks are that the baseline risk level (part of ) is not identifiable from this expression (see, e.g., [30, page 1076]) and that the parameters have to be estimated or otherwise made to disappear (Lin and Zeng [7] applied profile likelihood). However, nothing prevents us from choosing a more useful partitioning of the observed data. For instance, with and , we obtain or, with and , analogously to Section 4.2, The baseline risk level may be identifiable from this last expression, depending on the sampling mechanism used (cf. [18]). The misconception that case-control design necessitates the use of retrospective likelihood has been recently criticized by Langholz [45] and results from equating likelihood function to a data generating mechanism. However, the order of the data collection does not need to determine how the likelihood is factorized, as long as the sampling mechanism is properly taken into account.
C. Mean and Variance of the Conditional Likelihood Score Function
In the following, we suppress the covariates from the notation since these are always conditioned upon and do not affect the derivations. Also, for notational simplicity, the parameter is taken to be a scalar, and the notation is compressed so that, for example, . The probability measures indexed with respect to are assumed to possess the regularity properties allowing interchanging the order of integration with respect to the random variables and differentiation with respect to . The conditional likelihood function of interest is now and the corresponding score function becomes The expectation of the score function is considered with respect to the full model . The expectation of the first term in (C.2) becomes In the above expressions, it should be noted that the index set is fixed by . Due to the assumption on unconfounded sampling mechanism, the term does not depend on , and we get Similarly, the expectation of the second term in (C.2) becomes Thus, we have and the conditional likelihood score function has zero expectation. The Fisher information is given by By writing open the expectations as above, it is easy to see that Thus, the Fisher information becomes that is, equal to the variance of the conditional likelihood score function.
Acknowledgments
The majority of this work was carried out when the first author was working at the Department of Chronic Disease Prevention of the National Institute for Health and Welfare, Helsinki, Finland. The work was partly supported by the European Commission through the Seventh Framework Programme CHANCES Project [HEALTH-F3-2010-242244]. The authors would like to thank Professor Jarmo Virtamo and Professor Markus Perola of the National Institute for Health and Welfare for permission to use the ATBC data in our illustration.