About this Journal Submit a Manuscript Table of Contents
Journal of Probability and Statistics
Volume 2012 (2012), Article ID 931416, 37 pages
Research Article

Secondary Analysis under Cohort Sampling Designs Using Conditional Likelihood

1Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada H3A 1A2
2Indic Society for Education and Development (INSEED), Nashik, Maharashtra 422 011, India
3Department of Vaccines, National Institute for Health and Welfare, 00271 Helsinki, Finland
4Department of Mathematics and Statistics, University of Tampere, 33014 Tampere, Finland
5Department of Mathematics and Statistics, University of Helsinki, 00014 Helsinki, Finland

Received 28 July 2011; Revised 29 December 2011; Accepted 24 January 2012

Academic Editor: Kari Auranen

Copyright © 2012 Olli Saarela et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


Under cohort sampling designs, additional covariate data are collected on cases of a specific type and a randomly selected subset of noncases, primarily for the purpose of studying associations with a time-to-event response of interest. With such data available, an interest may arise to reuse them for studying associations between the additional covariate data and a secondary non-time-to-event response variable, usually collected for the whole study cohort at the outset of the study. Following earlier literature, we refer to such a situation as secondary analysis. We outline a general conditional likelihood approach for secondary analysis under cohort sampling designs and discuss the specific situations of case-cohort and nested case-control designs. We also review alternative methods based on full likelihood and inverse probability weighting. We compare the alternative methods for secondary analysis in two simulated settings and apply them in a real-data example.

1. Introduction

Cohort sampling designs are two-phase epidemiological study designs where information on time-to-event outcomes of interest over a followup period and some basic covariate data are collected on the whole first-phase study group, referred to as a cohort, and in the second phase, more expensive or difficult-to-obtain additional covariate data are collected only on a subset of the study cohort. This usually comprises the cases, that is, individuals with a disease event of interest during the followup, and a randomly selected subset of noncases. Examples are the case-cohort [13] and nested case-control [4, 5] designs. Primarily, such designs are applied for the purpose of studying associations between the time-to-event outcomes and the covariates collected in the second phase. However, with such data having been collected, an interest frequently arises to reuse it for studying associations between the second-phase covariates and the other available covariate data. For instance, the covariates collected in the second phase could be genotypes, while the other covariates may be various phenotype measurements carried out at the outset of the followup period for the whole cohort. The interest would then be to explain a phenotypic response with the genetic covariates. Following Jiang et al. [6] and Lin and Zeng [7], we refer to such a situation as secondary analysis. Here, we concentrate specifically on non-time-to-event secondary outcomes. Analysis of secondary time-to-event outcomes under the nested case-control design has been considered previously by Saarela et al. [8] and Salim et al. [9].

As our motivating example, we consider here a single cohort which was used in a larger meta-analysis of association between the European lactase persistence genotype and body mass index (BMI) [10], the latter being a secondary outcome in the cohort study in question. The cohort consists of 5073 men aged 55–77 years from southern and western Finland, who originally formed the placebo group of the ATBC cancer prevention study [11]. Whole blood samples of the participants were taken between 1992 and 1993, which is here considered as the baseline of the cohort, with followup for cardiovascular disease events and all-cause mortality available until the end of year 1999. There is no loss to followup, so the only censoring present is of type I due to end of the followup period. This cohort is a part of MORGAM project, an international pooling of cardiovascular cohorts [12]. Genotype data (including the lactase persistence SNP rs4988235) under this project have been collected under a case-cohort design described in detail by [13] and herein in Section 4.3.1. Given such data, our aim is to estimate the association between the lactase persistence genotype and BMI making use of genotype data collected on both the random subcohort and cases of all-cause mortality.

Secondary analysis of case-control data has been studied previously, using profile likelihood [14], inverse selection probability weighting methods [1517], or retrospective likelihood [6, 7]. However, to the best of our knowledge, a systematic discussion on secondary analysis under cohort sampling designs has been lacking, which we will aim to rectify here by discussing alternative approaches for such an analysis under a generic two-phase study design. We will briefly review the full likelihood approach which utilizes all observed data (Section 2), as well as pseudolikelihoods based on inverse selection probability weighting (Section 3). For these approaches, we propose a conditional likelihood-based alternative (Section 4), restricted to the fully observed second-phase study group. Conditional likelihood inference under cohort sampling designs has been studied previously for the analysis of the primary time-to-event outcome by Langholz and Goldstein [18] and Saarela and Kulathinal [19]; here, we extend these methods to the secondary analysis setting. The main interest is in continuous secondary outcomes, though the approach would also be valid for categorical responses. As special cases of the general setting, we consider case-cohort and nested case-control designs. As extensions to the basic setting, we consider treatment of missing second-phase covariate data and adjustment for left truncation in the case of incident time-to-event outcomes (Section 5). In Section 6, we present two simulation studies, first comparing the efficiencies of the alternative approaches and then demonstrating the potential adverse effects of small sampling fraction in full likelihood inference. We also carry out the analysis in the real-data example using all three alternative methods. As the model for the continuous secondary response variable, in addition to the customary normal distribution, we consider more flexible model specifications, thus aiming to incorporate residual-based model fit diagnostics into the model itself. While other generalizations for the normal distribution have been proposed, we adopt here the four-parameter normal-polynomial quantile mixture [20], which includes the normal distribution as a special case.

2. Notation, Assumptions, and Full Likelihood

To cover our motivating example and also the general case, we introduce first some notation. Let the set 𝒞{1,,𝑁} represent the individuals in the cohort. Primary outcome in the study is a time-to-event outcome characterized by random variables (𝑇𝑖,𝐸𝑖), 𝑖𝒞, where 𝑇𝑖 corresponds to event time and 𝐸𝑖 to the event type of individual 𝑖, with 𝐸𝑖=0 indicating a censoring event and 𝐸𝑖=1 a death due to any cause. Extension to incident nonfatal outcomes and multiple outcome types is considered separately in Section 5.2. A secondary outcome of interest 𝑌𝑖, here BMI, is observed on all study participants at the outset of the study. In addition, there may be other covariates 𝑋𝑖, available on all 𝑖𝒞, relevant to be included in the analysis. In the present example, these comprise only the age at the start of the followup. Additional covariate data (here the lactase persistence genotype) 𝑍𝑖 are collected only on the second-phase study group 𝒪{𝑖𝑅𝑖=1}𝒞, specified by the inclusion indicators 𝑅𝑖{0,1}, analogously to the survey response/nonresponse setting of Rubin [21]. We will henceforth use vector notations of the type 𝑍𝒪{𝑍𝑖𝑖𝒪} to represent data obtained on different subsets of individuals. We will not make a distinction between random variables and their realized values in the notation when this is clear from the context. The observed data as a whole are then represented as (𝑅𝒞,𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞,𝑍𝒪). We are interested in the model 𝑃(𝑌𝑖𝑋𝑖,𝑍𝑖,𝛽) for the secondary outcome, more precisely, the parameter describing the association between 𝑌𝑖 and 𝑍𝑖, which in our case correspond to BMI and the genotypic covariate.

We assume that the first-phase sampling mechanism has been unconfounded in the sense of Rubin [21, page 36], so that we may ignore the first-phase sampling mechanism in all subsequent analyses. This means that the cohort recruitment (possibly through survey sampling) and possible nonresponse depend only on the 𝑋𝑖-covariates; then if all subsequent analyses are conditional on 𝑋𝑖, the selection mechanism may be ignored. In contrast, the second phase sampling may be outcome dependent; the second-phase sampling mechanism is specified by the joint probability distribution for the set of indicator variables 𝑅𝒞. This is assumed to be unconfounded with 𝑍𝒞, that is, 𝑅𝒞(𝑍𝒞,𝜃)𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞. What follows could be further generalized by assuming the sampling mechanism to be ignorable so that 𝑅𝒞(𝑍𝒞𝒪,𝜃)𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞,𝑍𝒪, but since most common cohort sampling mechanisms go under the former assumption, and it will simplify the exposition, we will proceed with that. In addition, we assume the random vectors (𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖,𝑍𝑖) either to be (infinitely) exchangeable over unit indices 𝑖 (cf. [21, page 40]), or equivalently, conditionally independent given the collection of all relevant parameters 𝜃. It should be noted that the exchangeability assumption needs not to be extended to the inclusion indicators 𝑅𝑖. Now, following, for example, Saarela et al. [8], we can write a full (observed data) likelihood expression𝑃𝑅𝒞,𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑍𝒪𝑋𝒞,𝜃(𝛼,𝛽,𝛾)𝑖𝒪𝑃𝑇𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑋𝑖,𝑍𝑖𝑃𝑍,𝛽𝑖𝑋𝑖×,𝛾𝑖𝒞𝒪𝑧𝑖𝑃𝑇𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑧𝑖𝑃𝑌,𝛼𝑖𝑋𝑖,𝑧𝑖𝑃,𝛽𝑑𝑧𝑖𝑋𝑖.,𝛾(2.1) In the secondary analysis situation, only the parameters 𝛽 are of interest, while 𝛼 and 𝛾 remain present as nuisance parameters. There are potential drawbacks related to the above likelihood expression; it requires integration over the unobserved covariate data in the set 𝒞𝒪 (which may be large compared to 𝒪), as well as modeling of the population distribution of 𝑍𝑖. If possible, we would like to avoid this due to the required estimation of the nuisance parameters 𝛾 and the risk of model misspecification. Furthermore, observed data likelihoods may become sensitive to misspecification of the model for the response variable; the missing data can act similarly to extra parameters, and the actual model parameters may lose their intended interpretation. This is a real problem especially in cohort sampling designs with a rare event of interest, since the proportion of uncollected covariate data in the study cohort may then be very high. We demonstrate such a situation with simplified simulation example in Section 6.2.

3. Methods Based on Inverse Probability Weighting

Valid estimates for the parameters of the secondary outcome model could alternatively be obtained by using inverses of the first-order inclusion probabilities 𝑃(𝑅𝑖=1𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞) (assumed here to be known) as weights (e.g., [22, 23]). The weighted log-likelihood function, approximating the corresponding complete data log-likelihood, to be used for the secondary analysis is𝑖𝒪𝑌log𝑃𝑖𝑋𝑖,𝑍𝑖,𝛽𝑃𝑅𝑖=1𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞=𝑖𝒞𝟏{𝑅𝑖=1}𝑌log𝑃𝑖𝑋𝑖,𝑍𝑖,𝛽𝑃𝑅𝑖=1𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞.(3.1) In a typical cohort sampling design, all cases would receive unit weights, while noncases selected to the set 𝒪 would receive weights greater than one, inverse to their probability to be included in the set of controls/subcohort. While this kind of weighting results in unbiased estimation of the parameters of interest, it will potentially result in reduced efficiency compared to the full and conditional likelihood approaches, since in (3.1) cases receive smaller weights compared to noncases irrespective of whether the case status is actually associated to the secondary outcome or the covariate of interest, and thus the information in these observations may not be fully utilized. We will demonstrate this in a simulation study in Section 6.1. Theoretical justification for estimating function (3.1), as well as variance estimation is discussed in Appendix A. A variation of the above Horvitz-Thompson [24] type of weighting would be poststratification-based estimation (e.g., [15, 25]), with the stratification carried out over the relevant first-phase variables.

4. Conditional Likelihood Inference under Cohort Sampling Designs

4.1. Definition

A very general definition for conditional likelihood is given by Cox and Hinkley [26, pages 16-17] and Cox [27, page 269] as follows. Let 𝑈 be a random vector corresponding to all observed data, and let this be partitioned into relevant subsets 𝑈=(𝑉,𝑊), the transformation not depending on unknown parameters 𝜃. The conditional likelihood given the realization 𝑉=𝑣 is then the conditional probability distribution 𝑃(𝑊𝑑𝑤𝑉=𝑣,𝜃) with respect to 𝜃. Few generally accepted rules for choosing the partitioning can be found from the literature, one of these being conditioning on an ancillary statistic or something close to that in order to eliminate nuisance parameters (e.g., [28, 29]). In contrast, although working under the same general definition, here we condition on a sampling mechanism, in order to restrict the analysis into a subgroup for which a useful likelihood expression can be written. Generally such a conditioning may lose information on 𝜃, but will nevertheless produce valid estimates. The rules we set out for choosing the partitioning to construct a conditional likelihood under the two-phase study setting are as follows. (1) Condition on the sampling mechanism, that is, the set of inclusion indicators 𝑅𝒞, which produced the second-phase study group 𝒪 onto which the analysis is to be restricted. Do not condition on any further information on the sampling mechanism, such as inclusion in the subcohort or the set of controls, since this can easily lead to overconditioning which will lose a lot of information. In our notation, such additional information on the sampling mechanism is implicitly included in 𝑊 and, given the assumptions stated in Section 2, will cancel out of the resulting likelihood expression. (2)Other observed variables may be placed into 𝑉 or 𝑊 at will, depending on the parameters of interest. For instance, if the parameters 𝛾 are not of interest, we may condition on 𝑍𝒪 by placing it in 𝑉. If the parameters 𝛾 need to be estimated, 𝑍𝒪 may be placed in 𝑊.(3)We must have 𝑃(𝑊𝑑𝑤𝑉=𝑣,𝜃)𝑃(𝑉𝑑𝑣𝜃)=𝑃(𝑈𝑑𝑢𝜃), that is, all the relevant observed variables must either be modeled or conditioned upon.

Applying these conditioning rules will reproduce the conditional likelihood expressions obtained previously in special cases of the current framework by Langholz and Goldstein [18] and Saarela and Kulathinal [19]. The proposed conditional likelihood framework also includes the familiar retrospective likelihood, often suggested for analysis under case-control designs (e.g., [30, 31]), as a special case (see Appendix B). Whereas Langholz and Goldstein [18] considered only the special case of logistic likelihoods, here we aim to first derive the likelihood expressions in the general case before substituting in any specific models.

4.2. Conditional Likelihood Expression: The General Case

Following the stated rules, and making the same general assumptions as in Section 2, we proceed by partitioning the observed data into relevant subsets 𝑊(𝑇𝒪,𝐸𝒪,𝑌𝒪) and 𝑉(𝑅𝒞,𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝑍𝒪) and, using a shorthand notation of 𝑄𝑖(𝑇𝑖,𝐸𝑖,𝑌𝑖) for all outcome variables, work with a conditional likelihood 𝑃𝑄𝒪𝑅𝒞,𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪=𝑃𝑅,𝜃𝒞𝑄𝒞,𝑋𝒞,𝑍𝒪𝑃𝑄,𝜃𝒞,𝑍𝒪𝑋𝒞,𝜃𝑃𝑅𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪𝑃𝑄,𝜃𝒞𝒪,𝑍𝒪𝑋𝒞,𝜃𝜃𝑃𝑄𝒞,𝑍𝒪𝑋𝒞,𝜃𝑃𝑅𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪𝑃𝑄,𝜃𝒞𝒪,𝑍𝒪𝑋𝒞,,𝜃(4.1) where the proportionality follows from the assumption on unconfounded second-phase sampling mechanism. In Appendix C, we show that such a conditional likelihood indeed has the properties of a likelihood, that is, a score function with zero mean and variance equal to the Fisher information. Asymptotic normality of the maximum likelihood estimators based on the conditional likelihood follows obviously from these results in the special case of Bernoulli sampling (Section 4.3.1), although in the general case, this may require further assumptions on the sampling mechanism, such as asymptotic independence of the inclusion indicators 𝑅𝑖.

The ratio of the numerator and the second term in the denominator can be further written as 𝑃𝑄𝒞,𝑍𝒪𝑋𝒞,𝜃𝑃𝑄𝒞𝒪,𝑍𝒪𝑋𝒞=,𝜃𝑖𝒪𝑃𝑄𝑖,𝑍𝑖𝑋𝑖,𝜃𝑖𝒞𝒪𝑧𝑖𝑃𝑄𝑖,𝑍𝑖𝑑𝑧𝑖𝑋𝑖,𝜃𝑖𝒪𝑞𝑖𝑃𝑄𝑖𝑑𝑞𝑖,𝑍𝑖𝑋𝑖,𝜃𝑖𝒞𝒪𝑧𝑖𝑃𝑄𝑖,𝑍𝑖𝑑𝑧𝑖𝑋𝑖=,𝜃𝑖𝒪𝑃𝑄𝑖𝑋𝑖,𝑍𝑖.,𝜃(4.2) Here, the product forms follow from the exchangeability assumption for the random vectors (𝑄𝑖,𝑋𝑖,𝑍𝑖). Thus, we have obtained 𝑃𝑄𝒪𝑅𝒞,𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪=,𝜃𝑖𝒪𝑃𝑄𝑖𝑋𝑖,𝑍𝑖,𝜃𝑃𝑅𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃,(4.3) where the numerator factors into the parametric models 𝑃(𝑄𝑖𝑋𝑖,𝑍𝑖,𝜃)=𝑃(𝑇𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑍𝑖,𝛼)𝑃(𝑌𝑖𝑋𝑖,𝑍𝑖,𝛽). The denominator is the conditional likelihood correction term (sometimes called ascertainment correction, e.g., Ma et al. [32]). Its specific form depends on the second-phase sampling mechanism, and in the general case, it does not reduce into a product form. It depends on the model parameters, since it is not conditioned on all of (𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑋𝒞). Generally, the challenge in conditional likelihood correction terms is in representing them in terms of parameters estimable from the data. Here, the term can be written as 𝑃𝑅𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪=,𝜃𝑞𝒪𝑃𝑅𝒞𝑄𝒞,𝑋𝒞𝑃𝑄𝒪𝑑𝑞𝒪𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃,(4.4) where 𝑃(𝑄𝒪𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃) is given by (4.2), and we have 𝑃𝑅𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪=,𝜃𝑦𝑖𝑖𝒪𝑡𝑖𝑖𝒪𝑒𝑖𝑖𝒪𝑃𝑅𝒞𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑋𝒞×𝑖𝒪𝑃𝑇𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑋𝑖,𝑍𝑖.,𝛽(4.5) Here, the first term inside the integral specifies the sampling mechanism and is assumed to be known. The obtained likelihood expression utilizes all observed data on covariates 𝑍𝑖, and it can be easily seen that if there is no association between (𝑇𝑖,𝐸𝑖) and 𝑍𝑖 and between 𝑍𝑖 and 𝑋𝑖, we lose no information relevant to learning about the association between 𝑌𝑖 and 𝑍𝑖, when the conditional likelihood (4.3) is used instead of the corresponding full likelihood (2.1). This was also demonstrated in the simulation study of Saarela and Kulathinal [19]. On the other hand, if the other observed data do give significant information on the unobserved covariate values 𝑍𝒞𝒪, efficiency could potentially be improved by using the full likelihood which utilizes all observed data, with the cost of having to specify a model for 𝑍𝑖. In Section 6.1, the efficiencies of expressions (2.1) and (4.3) are compared in a simulated setting.

4.3. Special Cases: Case-Cohort and Nested Case-Control Designs
4.3.1. Case-Cohort/Bernoulli Sampling

Here, we are mainly interested in a variation of the “efficient case-cohort design” suggested by Kim and De Gruttola [33, pages 155-156] as an alternative to sampling within strata. To improve efficiency in sampling of the subcohort, here the distribution of some key covariates (in the present notation either 𝑋𝒞 or both (𝑋𝒞,𝑌𝒞)) in the subcohort is approximately matched to that of the cases by first fitting a logistic regression model, say, logit{𝑃(𝐸𝑖=1𝑋𝑖,𝜇)}=𝜇𝑋𝑖, and then selecting the subcohort using Bernoulli sampling with the probabilities 𝜋𝑖1/(1+exp{𝜇𝑋𝑖}), independently of the case status. We then have 𝑃(𝑅𝑖𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑋𝒞,𝜇)=𝑃(𝑅𝑖𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖,𝜇), where in practice, we make the approximation 𝑃(𝑅𝑖𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖,𝜇)𝑃(𝑅𝑖𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖,𝜇) and will subsequently suppress the sampling mechanism parameters from the notation. The selection probabilities may also be rescaled to give a desired expected subcohort size 𝑚 as 𝜋𝑖𝑚𝜋𝑖/𝑖𝒞𝜋𝑖𝑚𝜋𝑖/(𝑁𝑃(𝐸𝑖=1)). More generally, the subcohort selection probability 𝜋𝑖 may be any known function of (𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖), that is, 𝜋𝑖𝜋(𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖), with the sampling design specified by 𝑃(𝑅𝒞𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑋𝒞)=𝑖𝒞𝑃(𝑅𝑖𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖)=𝑖𝒞𝜋(𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖)𝑅𝑖(1𝜋(𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖))1𝑅𝑖. The special case of Bernoulli sampling is of interest, because here the product of the first-order inclusion probabilities specifies the joint distribution of the inclusion indicators (see, e.g., [34, pages 62-63]), which considerably simplifies the analysis using conditional likelihood. In the case of stratified without replacement sampling, the following could only be interpreted as a first-order approximation when the sampling fractions are small. In the Bernoulli sampling case, the conditional likelihood correction term reduces into the product form 𝑃𝑅𝒞𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝑍𝒪=,𝜃𝑖𝒞𝒪𝑃𝑅𝑖=0𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖×𝑦𝑖𝑖𝒪𝑡𝑖𝑖𝒪𝑒𝑖𝑖𝒪𝑖𝒪𝑃𝑅𝑖=1𝑡𝑖,𝑒𝑖,𝑦𝑖,𝑋𝑖𝑇×𝑃𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑋𝑖,𝑍𝑖,𝛽(𝛼,𝛽)𝑖𝒪𝑦𝑖𝑡𝑖𝑒𝑖𝑃𝑅𝑖=1𝑡𝑖,𝑒𝑖,𝑦𝑖,𝑋𝑖𝑇×𝑃𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑋𝑖,𝑍𝑖.,𝛽(4.6) Let {𝑖𝐸𝑖=1} denote the set of individuals who died during the followup, and let 𝑑||. In the case-cohort design discussed in Kulathinal and Arjas [35], Kulathinal et al. [13], and Saarela and Kulathinal [19], all cases are selected to the case-cohort set, so that 𝑃(𝑅𝑖=1𝑇𝑖,𝐸𝑖=1,𝑌𝑖,𝑋𝑖)=1, 𝑖, while to increase the efficiency of the design, the subcohort is selected with probabilities which depend on the age 𝑏𝑖 at the start of the followup (included in the 𝑋𝑖 covariates) through a logistic model as discussed above, giving the inclusion probability for non-cases as 𝑃(𝑅𝑖=1𝑇𝑖,𝐸𝑖=0,𝑌𝑖,𝑋𝑖)=𝜋(𝑏𝑖), 𝑖𝒞. Under Bernoulli sampling, the realized sample size 𝑛|𝒪| is random, with the expected sample size in the present example given by 𝐸(𝑛𝐸𝒞,𝑋𝒞)=𝑑+𝑖𝒞𝜋(𝑏𝑖). In the case of a mortality outcome and type I censoring at predetermined times 𝑐𝑖, with the observed time given by 𝑇𝑖𝑇min(𝑖,𝑐𝑖), where 𝑇𝑖 is the latent time of death, (4.6) further simplifies into (see also [19, pages 12-13])𝑃𝑅𝒞𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃(𝛼,𝛽)𝑖𝒪𝑦𝑖𝑡𝑖[0,𝑐𝑖]𝑃𝑅𝑖=1𝑡𝑖,𝐸𝑖=1,𝑦𝑖,𝑋𝑖𝑃𝑑𝑡𝑖,𝐸𝑖=1𝑋𝑖,𝑦𝑖,𝑍𝑖𝑅,𝛼+𝑃𝑖=1𝑡𝑖,𝐸𝑖=0,𝑦𝑖,𝑋𝑖𝑃𝑑𝑡𝑖,𝐸𝑖=0𝑋𝑖,𝑦𝑖,𝑍𝑖,𝛼×𝑃𝑑𝑦𝑖𝑋𝑖,𝑍𝑖=,𝛽𝑖𝒪𝑦𝑖𝑃0𝑇𝑖<𝑐𝑖,𝐸𝑖=1𝑋𝑖,𝑦𝑖,𝑍𝑖𝑏,𝛼+𝜋𝑖𝑃𝑇𝑖𝑑𝑐𝑖,𝐸𝑖=0𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃,𝛼𝑑𝑦𝑖𝑋𝑖,𝑍𝑖=,𝛽𝑖𝒪𝑦𝑖𝑃𝑇0𝑖<𝑐𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑏,𝛼+𝜋𝑖𝑃𝑇𝑖𝑐𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃,𝛼𝑑𝑦𝑖𝑋𝑖,𝑍𝑖=,𝛽𝑖𝒪𝑦𝑖𝑏11𝜋𝑖𝑃𝑇𝑖𝑐𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃,𝛼𝑑𝑦𝑖𝑋𝑖,𝑍𝑖.,𝛽(4.7) Hence, (4.6) could be represented in terms of the survival function. Suppose now that in addition to the type I censoring due to the end of the followup period, there may be random censoring during the followup; for instance, this may be the case if the outcome in the case-cohort design (𝐸𝑖=1) is not all-cause mortality, but, say, mortality due to cardiovascular diseases. Deaths due to other causes, denoted as 𝐸𝑖=2, then appear as censoring. As before, let 𝑇𝑖 denote the latent time of death, with 𝐸𝑖 indicating the type of death or end of followup period. Similarly as above, we get 𝑃𝑅𝒞𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝑍𝒪=,𝜃𝑖𝒪𝑦𝑖𝑃0𝑇𝑖<𝑐𝑖,𝐸𝑖=1𝑋𝑖,𝑦𝑖,𝑍𝑖𝑏,𝛼+𝜋𝑖𝑃0𝑇𝑖<𝑐𝑖,𝐸𝑖=2𝑋𝑖,𝑦𝑖,𝑍𝑖𝑏,𝛼+𝜋𝑖𝑃𝑇𝑖𝑑𝑐𝑖,𝐸𝑖=0𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃,𝛼𝑑𝑦𝑖𝑋𝑖,𝑍𝑖=,𝛽𝑖𝒪𝑦𝑖𝑃𝑇0𝑖<𝑐𝑖,𝐸𝑖=1𝑋𝑖,𝑦𝑖,𝑍𝑖𝑏,𝛼+𝜋𝑖𝑃𝑇0𝑖<𝑐𝑖,𝐸𝑖=2𝑋𝑖,𝑦𝑖,𝑍𝑖𝑏,𝛼+𝜋𝑖𝑃𝑇𝑖𝑐𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃,𝛼𝑑𝑦𝑖𝑋𝑖,𝑍𝑖.,𝛽(4.8) Here, the probabilities 𝑇𝑃(0𝑖<𝑐𝑖,𝐸𝑖=𝑘𝑋𝑖,𝑌𝑖,𝑍𝑖,𝛼), 𝑘{1,2}, are given by the cumulative incidence functions and 𝑇𝑃(𝑖𝑐𝑖𝑋𝑖,𝑌𝑖,𝑍𝑖,𝛼) by the joint survival function of a competing risks survival model for the two types of death (e.g., [36, pages 251-252]). These are in principle identifiable from the observed data for the set 𝒪, since the subcohort has been selected independently of the case status and thus will include a number of other deaths. On the other hand, if this number is very small, the middle term in the above sum contributes little to the correction term, and thus unstable estimation of the corresponding parameters does not necessarily hinder the estimation of the parameters of interest.

4.3.2. Risk Set Sampling

Consider now a nested case-control sampling mechanism in which all cases are selected to the case-control set with probability one, and 𝑚 controls per case 𝑗 are selected using without replacement sampling from the risk set 𝑗{𝑖𝐴𝑖(𝑇𝑗)=1}{𝑗} (of size 𝑟𝑗|𝑗|=𝑖𝒞𝐴𝑖(𝑇𝑗)1), where 𝐴𝑖(𝑇𝑗) is the at-risk indicator for subject 𝑖 at event time 𝑇𝑗. This is carried out independently for all cases 𝑗. Let the sampled set of time-matched controls for case 𝑗 be denoted as 𝒮𝑗𝑗, some of which may also be future cases since the sampling is carried out without regard to the future case status of the individuals in the risk set. Let 𝒮{𝒮𝑗𝑗} denote the collection of the sampled risk sets and 𝒮𝑗𝒮𝑗 the pooled set of sampled controls. Noting that knowing all of (𝑇𝒞,𝐸𝒞) specifies the risk sets {𝑗𝑗} as well as the order of the events, the risk set sampling mechanism is specified by the joint probability distribution 𝑃𝑅𝒞𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑋𝒞𝑅=𝑃𝒞=,𝒮||𝒮𝑗||=𝑚,𝑗𝑃𝑅𝒞,𝒮𝑃𝒮,=,𝒮||𝒮𝑗||=𝑚,𝑗𝟏{𝒮=𝒪}𝑗𝑃𝒮𝑗𝑗=𝑗1𝑟𝑗𝑚𝒮||𝒮𝑗||=𝑚,𝑗𝟏{𝒮=𝒪}.(4.9) If the sampling has been carried out within strata specified by the covariates 𝑌𝒞 and/or 𝑋𝒞 through some function 𝑔(𝑋𝑖,𝑌𝑖), the risk sets would be defined within strata as 𝑗={𝑖𝐴𝑖(𝑇𝑗)=1,𝑔(𝑋𝑖,𝑌𝑖)=𝑔(𝑋𝑗,𝑌𝑗)}{𝑗}, and the above reasoning would apply similarly with the redefined risk sets. The conditional likelihood correction term now becomes 𝑃𝑅𝒞𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝑍𝒪=,𝜃𝑦𝑖𝑖𝒪𝑡𝑖𝑖𝒪𝑒𝑖𝑖𝒪𝑗1𝑟𝑗𝑚𝒮||𝒮𝑗||=𝑚,𝑗𝟏{𝒮=𝒪}×𝑖𝒪𝑃𝑇𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑋𝑖,𝑍𝑖,,𝛽(4.10) which does not reduce into a product form. Exact numerical evaluation of this term would require enumeration of all possible combinations of sampled risk sets of size 𝑚 which would result in the observed case-control set 𝒪. This is unlikely to be feasible in practice with realistic sample sizes, which is why consideration of approximations may be necessary. Samuelsen [23] showed the inclusion indicators under risk set sampling to be asymptotically pairwise uncorrelated, with the first-order inclusion probabilities for the noncases 𝑖𝒞 given by 𝑃(𝑅𝑖=1𝑇𝒞,𝐸𝒞{𝑖},𝐸𝑖=0)=1𝑗[1𝑚𝐴𝑖(𝑇𝑗)/𝑟𝑗]. Thus, it might be tempting to replace (4.9) with 𝑖𝒞𝑃(𝑅𝑖𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑋𝒞). However, the properties of such an approximation will require further research.

5. Extensions

5.1. Missing Second-Phase Covariate Data

Further issue to be considered in practical applications is possible missing covariate data within the set 𝒪. For instance, if the covariates to be collected under the cohort sampling design are genotypes, after selection of subjects to be genotyped, it may turn out that the DNA amount or concentration is too low or the genotyping is otherwise unsuccessful. Let 𝑀𝒪{𝑀𝑖𝑖𝒪} be a set of indicator variables indicating whether the measurement of 𝑍𝑖 turned out to be unsuccessful after selection of subject 𝑖 into the set 𝒪, and let {𝑖𝑀𝑖=1}𝒪. Now 𝑍𝑖 is observed only on the set 𝒪. We consider the implications of this separately for all of the three approaches discussed above.

5.1.1. Full Likelihood

If the missingness can be assumed missing at random, that is, 𝑀𝒪(𝑍,𝜃)𝑅𝒞,𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞,𝑍𝒪, the missingness indicators can again be included in the proportionality constant, and the full likelihood expression becomes 𝑃𝑅𝒞,𝑀𝒪,𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑍𝒪𝑋𝒞,𝜃(𝛼,𝛽,𝛾)𝑖𝒪𝑃𝑇𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑋𝑖,𝑍𝑖𝑃𝑍,𝛽𝑖𝑋𝑖×,𝛾𝑖𝒞𝒪𝑧𝑖𝑃𝑇𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑧𝑖𝑃𝑌,𝛼𝑖𝑋𝑖,𝑧𝑖𝑃,𝛽𝑑𝑧𝑖𝑋𝑖.,𝛾(5.1)

5.1.2. Inverse Probability Weighting

If the missingness mechanism would be known (and missing at random), a weighted pseudolikelihood expression for the set 𝒪 could be obtained by multiplying the original selection probabilities by the probabilities of observing the value of 𝑍𝑖 after selection 𝑖𝒪𝑌log𝑃𝑖𝑋𝑖,𝑍𝑖,𝛽𝑃𝑀𝑖=0𝑅𝒞,𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞,𝑍𝒪𝑃𝑅𝑖=1𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞.(5.2) In practice, however, the missingness mechanism would have to be modeled to obtain estimates for these probabilities.

5.1.3. Conditional Likelihood

By partitioning the observed data into 𝑊=(𝑇𝒪,𝐸𝒪,𝑌𝒪) and 𝑉=(𝑅𝒞,𝑀𝒪,𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝑍𝒪), we obtain a conditional likelihood 𝑃𝑄𝒪𝑅𝒞,𝑀𝒪,𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃𝜃𝑃𝑀𝒪𝑅𝒞,𝑄𝒞,𝑋𝒞,𝑍𝒪,𝜃𝑃𝑀𝒪𝑅𝒞,𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃𝑖𝒪𝑃𝑄𝑖𝑋𝑖,𝑍𝑖,𝜃𝑃𝑅𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪,,𝜃(5.3) where 𝑃(𝑅𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃) is obtained from (4.5) by replacing the set 𝒪 with 𝒪. Unlike the second-phase sampling mechanism, the missingness mechanism is generally unknown. From (5.3), it can be seen that if we are willing to assume that 𝑍 are either missing completely at random or that 𝑀𝒪 depends only on 𝑋𝒞, the terms involving the missingness indicators cancel out of the likelihood. On the other hand, if the missingness may depend on the response variables 𝑄𝒞, the missingness mechanism would have to be modeled. However, in such a case, it may be easier to work with the partitioning 𝑊=(𝑀𝒪,𝑇𝒪,𝐸𝒪,𝑌𝒪,𝑍𝒪) and 𝑉=(𝑅𝒞,𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞) and model the population distribution of 𝑍𝑖 rather than trying to estimate parameters describing the missingness mechanism. This in fact corresponds to what was done by Saarela and Kulathinal [19] and was required there because haplotypes are only partially identifiable from unphased genotype data.

5.2. Incident Outcomes and Left Truncation

Previous discussion was specific to a primary mortality outcome using time on study as the main time scale. In this section, we discuss separately how the different methods can accommodate cohort sampling for incident nonfatal primary outcomes. In the analysis of secondary, non-time-to-event outcomes, the presence of left truncation due to exclusion of cases of prevalent disease presents an additional complication. If the parameters of the secondary outcome model correspond to the background population alive at the cohort baseline (rather than to the disease-free population), this additional selection factor requires further adjustment. If the primary outcome is a mortality endpoint, this is not an issue, since then there is no further selection due to prevalent conditions. In likelihood-based adjustment for left truncation, the main time scale of the analysis has to be chosen as age instead of time on study. In survival modeling, it is well known that conditioning on event-free survival until the age at study baseline corresponds to exclusion of the followup time before that (e.g., [37, page 580], and the references therein). However, this is no longer true in the case of missing covariate data (e.g., [8, pages 5997-5998]), or indeed in the analysis of secondary outcomes, as will be demonstrated below. The set notations 𝒞 and 𝒪 are here taken to refer to the disease free cohort and second-phase study group.

It should also be noted that under case-cohort designs it is common to collect second-phase covariate data for more than a single outcome, since the case-cohort design naturally enables the analysis of multiple outcomes using a single subcohort selection. This is also the case in our example cohort discussed in Sections 1 and 6.3, where, in addition to cases of all-cause mortality, genotype data has been collected also on cases of nonfatal incident cardiovascular disease events. To keep the example simple, in the data analysis, we consider only the case-cohort set for all-cause mortality. However, it is in principle straightforward to accommodate multiple outcome types in the likelihood expressions discussed below by using competing risks type notation where 𝑇𝑖 denotes the observed time of the first incident disease event, death, or censoring, with 𝐸𝑖=0 indicating censoring and 𝐸𝑖{1,2,,𝐾} the different types of outcome events for which the second-phase covariate data is collected. Since utilizing the likelihood expressions does not necessitate estimation of 𝐾 different hazard functions, the endpoint definitions may be pooled as seen suitable.

5.2.1. Full Likelihood

The full likelihood expression (2.1) is now conditioned on the selection rule 𝑇𝒞𝑏𝒞𝑇𝑖𝑏𝑖 for all 𝑖𝒞, that is, event-free survival until the age at the cohort baseline. The likelihood expression becomes 𝑃𝑅𝒞,𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑍𝒪𝑇𝒞𝑏𝒞,𝑋𝒞,𝜃(𝛼,𝛽,𝛾)𝑖𝒪𝑃𝑇𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑋𝑖,𝑍𝑖𝑃𝑍,𝛽𝑖𝑋𝑖,𝛾𝑦𝑖𝑧𝑖𝑃𝑇𝑖𝑏𝑖𝑋𝑖,𝑦𝑖,𝑧𝑖𝑃,𝛼d𝑦𝑖𝑋𝑖,𝑧𝑖𝑃,𝛽𝑑𝑧𝑖𝑋𝑖×,𝛾𝑖𝒞𝒪𝑧𝑖𝑃𝑇𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑧𝑖𝑃𝑌,𝛼𝑖𝑋𝑖,𝑧𝑖𝑃,𝛽𝑑𝑧𝑖𝑋𝑖,𝛾𝑦𝑖𝑧𝑖𝑃𝑇𝑖𝑏𝑖𝑋𝑖,𝑦𝑖,𝑧𝑖𝑃,𝛼𝑑𝑦𝑖𝑋𝑖,𝑧𝑖𝑃,𝛽𝑑𝑧𝑖𝑋𝑖.,𝛾(5.4)

5.2.2. Inverse Probability Weighting

The weighted pseudolikelihood approach does not readily take into account additional selection which occurs in studies of incident outcomes. This was also pointed out by Reilly et al. [15], who in their discussion note that their reweighting approach for case-control studies is not valid for incident or nested case-control studies, since the incident cases and healthy controls available cannot be reweighted to represent the general population. Although this is true for the basic estimating function (3.1), similarly as in the missing data case, a double weighting mechanism can be devised to account for the additional selection step as 𝑖𝒪𝑌log𝑃𝑖𝑋𝑖,𝑍𝑖,𝛽𝑃𝑇𝑖𝑏𝑖𝑋𝑖,𝑌𝑖,𝑍𝑖𝑃𝑅,𝛼𝑖=1𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞.(5.5) Since this weighting requires estimation of the time-to-event model parameters 𝛼, resampling-based variance estimation would be required to obtain valid standard errors for the estimates of 𝛽 to account for the uncertainty in estimation of the weights.

5.2.3. Conditional Likelihood

As in Section 4.2, we get 𝑃𝑄𝒪𝑅𝒞,𝑇𝒞𝑏𝒞,𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪=,𝜃𝑖𝒪𝑃𝑄𝑖𝑋𝑖,𝑍𝑖,𝜃𝑃𝑅𝒞,𝑇𝒞𝑏𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃,(5.6) where the correction term becomes 𝑃𝑅𝒞,𝑇𝒞𝑏𝒞𝑄𝒞𝒪,𝑋𝒞,𝑍𝒪=,𝜃𝑦𝑖𝑖𝒪𝑡𝑖[𝑏𝑖,)𝑖𝒪𝑒𝑖𝑖𝒪𝑃𝑅𝒞𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑋𝒞×𝑖𝒪𝑇𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑋𝑖,𝑍𝑖.,𝛽(5.7) In the case-cohort design discussed in Section 4.3.1, with type I censoring, this further simplifies into 𝑃𝑅𝒞,𝑇𝒞𝑏𝒞𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃(𝛼,𝛽)𝑖𝒪𝑦𝑖𝑡𝑖[𝑏𝑖,𝑐𝑖]𝑒𝑖𝑃𝑅𝑖=1𝑡𝑖,𝑒𝑖,𝑦𝑖,𝑋𝑖𝑇×𝑃𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑋𝑖,𝑍𝑖=,𝛽𝑖𝒪𝑦𝑖𝑃𝑏𝑖𝑇𝑖<𝑐𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑏,𝛼+𝜋𝑖𝑃𝑇𝑖𝑐𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑌,𝛼×𝑃𝑖𝑑𝑦𝑖𝑋𝑖,𝑍𝑖=,𝛽𝑖𝒪𝑦𝑖𝑃𝑇𝑖𝑏𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑏,𝛼1𝜋𝑖𝑃𝑇𝑖𝑐𝑖𝑋𝑖,𝑦𝑖,𝑍𝑖𝑌,𝛼×𝑃𝑖𝑑𝑦𝑖𝑋𝑖,𝑍𝑖.,𝛽(5.8)

6. Illustrations

6.1. Simulation Study

In order to compare the efficiency of the alternative estimation methods discussed above, namely, full likelihood, conditional likelihood, and weighted pseudolikelihood, we supplemented the cohort data described in Section 1 with a simulated covariate, following the real-data-based simulation approach described by Saarela et al. in [8, pages 5998–6000], in order to have the simulation setting resemble as closely as possible the real data analysis setting of Section 6.3. This Bayesian procedure corresponds to multiple imputation [21], where given the observed data (𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞) and predetermined associations with the primary and secondary outcomes, an additional “missing” covariate 𝑍𝒞 is simulated from the posterior predictive distribution 𝑃(𝑍𝒞𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞). Case-cohort sampling was then simulated in the complete datasets (𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞,𝑍𝒞) thus obtained. The objective of this approach, as compared to using completely simulated data, was to obtain simulation results directly relevant to the real study of interest. We included a set of 𝑁=5039 individuals with a BMI measurement (𝑌𝑖) available and eligible for the case-cohort study. The primary (time-to-event) outcome (𝑇𝑖,𝐸𝑖) was taken to be all-cause mortality, with 𝑑=996. Given the primary and secondary outcome data and age at baseline (𝑋𝑖𝑏𝑖) observed for the cohort, and the current parameter values, additional binary covariate values were drawn from the conditional distributions 𝑃𝑍𝑖=𝑧𝑖𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖=𝑃𝑇,𝛼,𝛽,𝛾𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑧𝑖𝑃𝑌,𝛼𝑖𝑋𝑖,𝑧𝑖𝑃𝑍,𝛽𝑖=𝑧𝑖𝑋𝑖,𝛾𝑧𝑖{0,1}𝑃𝑇𝑖,𝐸𝑖𝑋𝑖,𝑌𝑖,𝑧𝑖𝑃𝑌,𝛼𝑖𝑋𝑖,𝑧𝑖𝑃𝑍,𝛽𝑖=𝑧𝑖𝑋𝑖,,𝛾(6.1) where the model for the survival outcome, 𝑃(𝑇𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑋𝑖,𝑌𝑖,𝑍𝑖,𝛼)𝛼[𝜆𝑖(𝑡𝑖)]𝟏𝑖{𝑒=1}𝑆𝑖(𝑡𝑖), was specified as a proportional hazards model 𝜆𝑖(𝑢)𝜆0(𝑢)exp{𝛼1𝑏𝑖+𝛼2𝑌𝑖+𝛼3𝑍𝑖}, with time on study as the time scale and age at baseline as a covariate. Here, the the baseline hazard function 𝜆0 was specified in terms of a piecewise constant function using 15 time bins of equal length over the followup period of seven years. Flat improper priors were used for all parameters (those not fixed to selected values), on log scale for the nonnegative parameters.

Normal model 𝑌𝑖𝑋𝑖,𝑍𝑖,𝛽𝑁(𝛽0+𝛽1𝑏𝑖+𝛽2𝑍𝑖,𝛽23) was used for the secondary outcome, while the population distribution of the additional covariate was specified as 𝑃(𝑍𝑖=𝑧𝑖𝑋𝑖,𝛾)=𝛾𝑧𝑖(1𝛾)1𝑧𝑖. Here, parameters 𝛼3, 𝛽2, and 𝛾 were fixed at selected values while the other parameters were allowed to be determined by the data and integrated out by drawing from their respective full conditional posterior distributions using Markov chain Monte Carlo sampling. Variables 𝑌𝑖 and 𝑏𝑖 were centered but otherwise untransformed. 1000 values for each 𝑍𝑖, 𝑖𝒞, were obtained by running the MCMC sampler for 25000 rounds after a 10000-round burn-in with each set of fixed values of (𝛼3,𝛽2,𝛾) given in Table 1 and saving every 25th state of the chain. In each of the complete datasets obtained by combining the observed data and the simulated covariate values 𝑍𝑖, case-cohort selection was carried out by matching the age distribution of the subcohort to that of the cases by first fitting a logistic regression model with age as a covariate to the observed survival status, as detailed in Section 4.3.1. The predictive probabilities from the logistic model were scaled to give an expected subcohort size of 1000, and the subcohort was selected using the obtained probabilities in Bernoulli sampling. This procedure gave an expected case-cohort set size of 𝐸(𝑛𝐸𝒞,𝑋𝒞)=1761.

Table 1: Maximum likelihood estimates of the parameters 𝛼 of the all-cause mortality model 𝜆𝑖(𝑡)=(𝛼4/𝛼0)(𝑡/𝛼0)𝛼41exp{𝛼1𝑏𝑖+𝛼2𝑌𝑖+𝛼3𝑍𝑖}, 𝛽 of the BMI model 𝑌𝑖𝑋𝑖,𝑍𝑖,𝛽𝑁(𝛽0+𝛽1𝑏𝑖+𝛽2𝑍𝑖,𝛽23), and 𝛾, the population frequency of the simulated covariate (true value 0.25). Parameters of main interest are 𝛼2 and 𝛽2, which characterize the association between the simulated covariate and primary (all-cause mortality) and secondary (BMI) outcomes, respectively. The values are means and standard deviations over 1000 replications. SE stands for the standard errors of the maximum likelihood estimates.

In fitting models to the datasets so obtained, we used the same model specifications as above, with the exception of the proportional hazards model, where we fitted a “misspecified” Weibull model 𝜆𝑖(𝑢)(𝛼4/𝛼0)(𝑢/𝛼0)𝛼41exp{𝛼1𝑏𝑖+𝛼2𝑌𝑖+𝛼3𝑍𝑖}. Even though this is a different model than the one specified for the simulation step, we did not simulate the time-to-event outcome data, which was taken from the example cohort; a comparison between fitted piecewise constant and Weibull hazards (results not shown) indicated that the Weibull model is adequate for modeling these data, accounting for the increasing baseline mortality rate due to ageing of the cohort. Full likelihood function (2.1), conditional likelihood function (4.3), and weighted pseudolikelihood function (3.1) were maximized with respect to the parameters (𝛼,𝛽,𝛾), (𝛼,𝛽), and 𝛽, respectively, substituting in the above parametric models. The numerical optimization was carried out using the optim function of the 𝑅 statistical software, applying the BFGS optimization algorithm [38]. The integrals over 𝑦𝑖 in (4.7) were evaluated using numerical (quadrature) integration, using the gsl_integration_qagi function of GNU scientific library [39]. The standard errors of the maximum likelihood estimators were evaluated by inverting the numerically differentiated Hessian matrix at the maximum likelihood point. The standard errors of the inverse-probability-weighted estimates were evaluated using the robust variance formula of Appendix A with the observed information (A.4) and observed score (A.5). The results from the 1000 replications are presented in Table 1. Here, the parameters (𝛼3,𝛽2,𝛾), corresponding to the simulated covariate, are of main interest; the other parameters reflect the observed data, and thus the Monte Carlo variances for these are not relevant. The standard errors for the corresponding parameters are given in the real-data example of Section 6.3. The results indicate that conditional likelihood estimation gives similar efficiency compared to the corresponding full likelihood estimates irrespective of whether the covariate of interest was associated with the survival outcome. In contrast, both likelihood-based approaches gave better efficiency compared to inverse probability weighting. This is expected, as the weighted pseudolikelihood gives smaller weights to cases, whereas the conditional likelihood weights cases and noncases differentially only to the extent they actually differ. Both the inverse observed information and robust variance estimates agreed well with the Monte Carlo variances.

6.2. Multimodality under Full Likelihood When the Sampling Fraction Is Small

Let now the observed data be only (𝑅𝒞,𝑌𝒞,𝑍𝒪), and let the sampling mechanism be simple random sampling without replacement; 𝑃(𝑅𝒞)=1/𝑁𝑛, where the sample size 𝑛 is fixed. The specific aim of the following example is to demonstrate the vulnerability of observed data likelihoods, integrated over the missing covariate data on the set 𝒞𝒪, to misspecification of the response model. The full likelihood expression now becomes 𝑃(𝑅𝒞,𝑌𝒞,𝑍𝒪𝜃)(𝛽,𝛾)𝑖𝒪[𝑃(𝑌𝑖𝑍𝑖,𝛽)𝑃(𝑍𝑖𝛾)]𝑖𝒞𝒪𝑧𝑖𝑃(𝑌𝑖𝑧𝑖,𝛽)𝑃(𝑍𝑖𝑑𝑧𝑖𝛾). Due to the simple random sampling, the corresponding conditional likelihoods simplify into 𝑃(𝑌𝒪𝑅𝒞,𝑌𝒞𝒪,𝑍𝒪,𝜃)=𝑖𝒪𝑃(𝑌𝑖𝑍𝑖,𝛽) and 𝑃(𝑌𝒪,𝑍𝒪𝑅𝒞,𝑌𝒞𝒪,𝜃)=𝑖𝒪[𝑃(𝑌𝑖𝑍𝑖,𝛽)𝑃(𝑍𝑖𝛾)].

With 𝑁=1000, we simulated covariate values from 𝑍𝑖Bernoulli(0.2) and two different sets of response values (both independently of 𝑍𝑖) from 𝑌𝑖𝑁(4,4) and 𝑌𝑖Gamma(4,1) (same mean and variance, but the latter distribution is skewed to the right). We fit two alternative models to these data, in both models 𝑍𝑖𝛾Bernoulli(𝛾), with the models for the response specified as 𝑌𝑖𝑍𝑖,𝛽𝑁(𝛽0+𝛽1𝑍𝑖,𝛽22) and 𝑌𝑖𝑍𝑖,𝛽NPQM(𝛽0+𝛽1𝑍𝑖,𝛽2,𝛽3,𝛽4), where 𝛽 are the collections of all model parameters. The latter model is the normal-polynomial quantile mixture distribution proposed by Karvanen ([20, pages 950–953]; see also 𝑅 package Lmoments), which we apply here for regression modeling. The parametrization here can be expressed in terms of the first four L-moments as 𝜆1=𝐸(𝑌𝑖𝑍𝑖,𝛽)=𝛽0+𝛽1𝑍𝑖, 𝜆2=𝛽2/𝜋 (L-scale), 𝜆3=𝛽3 (L-skewness), and 𝜆4=𝛽4 (L-kurtosis). This distribution is suitable for regression modeling as the first L-moment is the mean of the distribution, while the two additional shape parameters allow for more flexible modeling of the residual distribution. The NPQM distribution includes normal distribution as a special case when 𝜆3=0 and 𝜆4/𝜆2=0.1226. The full and conditional log-likelihood expressions under these model specifications were maximized with respect to 𝛽 and 𝛾 as described in the previous section. Initial values for the algorithm were set at the sample moments calculated from the marginal distribution of the response variable, with the regression coefficient set to 𝛽1=0 and 𝛾=0.5.

Table 2 shows the mean maximum likelihood estimates over 1000 replications for all 8 combinations of data generating model, fitted model, and estimation method. With the response simulated from normal distribution, both models estimated using either full or conditional likelihood give the expected results (Figures 1 and 2). However, with the response simulated from gamma distribution and the sampling fraction 𝑛/𝑁 small, the misspecified normal model estimated using full likelihood indicates a spurious association between the response and the covariate. This is because the missing data act as extra parameters, allowing the optimization procedure to obtain a better fit to the skewed data. The corresponding sampling distributions for 𝛽1 shown in Figure 3 have become bimodal. It should be noted that the initial values given to the optimization algorithm corresponded always to the “correct solution” of no covariate effect, thus enabling the algorithm to find the correct mode. However, bimodality in the sampling distribution does not necessarily indicate that the likelihood functions given a single realized dataset would be bimodal.

Table 2: Maximum likelihood estimates of parameters 𝛽 of the outcome models 𝑌𝑖𝑍𝑖,𝛽𝑁(𝛽0+𝛽1𝑍𝑖,𝛽22) and 𝑌𝑖𝑍𝑖,𝛽NPQM(𝛽0+𝛽1𝑍𝑖,𝛽2,𝛽3,𝛽4) and the population frequency 𝛾 of the binary covariate 𝑍𝑖 (means over 1000 replications). Outcome variable was simulated from 𝑌𝑖𝑁(4,4) and 𝑌𝑖Gamma(4,1).
Figure 1: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; normal model fitted to normal data (𝑌𝑖𝑁(4,4)).
Figure 2: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; NPQM model fitted to normal data (𝑌𝑖𝑁(4,4)).
Figure 3: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; normal model fitted to gamma data (𝑌𝑖Gamma(4,1)).

Using the NPQM model which allows a skewed residual distribution does not correct the situation either when combined with full likelihood estimation (Figure 4). This is not unexpected since adding more parameters to an already overparameterized situation is not necessarily helpful in solving identification problems. In contrast, conditional likelihood continues to give reasonable results in all cases, with the skewness of the data correctly reflected by the parameter 𝛽3 of the NPQM model. Some degree of misspecification of the response model is required for the multimodality to appear since when we simulated the response variable from NPQM distribution, the full likelihood fit of the NPQM model indicated no problems, while the full likelihood fit of the normal model showed the same problems as with gamma data (Figures 5 and 6). We also repeated all the simulation alternatives with 𝛽1=1 (𝑌𝑖𝑍𝑖+𝑁(4,4) and 𝑌𝑖𝑍𝑖+Gamma(4,1)), with the conclusions essentially unchanged (Table 3 and Figures 7, 8, 9, and 10).

Table 3: Maximum likelihood estimates of parameters 𝛽 of the outcome models 𝑌𝑖𝑍𝑖,𝛽𝑁(𝛽0+𝛽1𝑍𝑖,𝛽22) and 𝑌𝑖𝑍𝑖,𝛽NPQM(𝛽0+𝛽1𝑍𝑖,𝛽2,𝛽3,𝛽4) and the population frequency 𝛾 of the binary covariate 𝑍𝑖 (means over 1000 replications). Outcome variable was simulated from 𝑌𝑖𝑍𝑖𝑍𝑖+𝑁(4,4) and 𝑌𝑖𝑍𝑖𝑍𝑖+Gamma(4,1).
Figure 4: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; NPQM model fitted to gamma data (𝑌𝑖Gamma(4,1)).
Figure 5: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; normal model fitted to NPQM data (𝑌𝑖NPQM(4,2,0.15,0.14)).
Figure 6: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; NPQM model fitted to NPQM data (𝑌𝑖NPQM(4,2,0.15,0.14)).
Figure 7: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; normal model fitted to normal data (𝑌𝑖𝑍𝑖+𝑁(4,4)).
Figure 8: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; NPQM model fitted to normal data (𝑌𝑖𝑍𝑖+𝑁(4,4)).
Figure 9: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; normal model fitted to gamma data (𝑌𝑖𝑍𝑖+Gamma(4,1)).
Figure 10: Sampling distributions for maximum likelihood estimates of regression coefficient 𝛽1 from 1000 replications; NPQM model fitted to gamma data (𝑌𝑖𝑍𝑖+Gamma(4,1)).
6.3. An Example with Real Data

The case-cohort set for all-cause mortality in the example cohort (𝑁=5039) is of size 𝑛=1816, the union of a subcohort of size 1068 and 996 deaths due to any cause (since the subcohort has been selected independently of the case status, it includes a number of cases). The case-cohort design applied in this cohort was described in Section 4.3.1. Due to sample unavailability or unsuccessful genotyping, the genotype for the lactase persistence SNP rs4988235 was unavailable for 156 individuals, giving |𝒪|=1660. The model for the data described in Section 1 was specified as follows: Hardy-Weinberg model 𝑍𝑖𝛾Binom(2,𝛾) was used for the number of lactase persistence alleles. The model for BMI, which is the response variable of interest, given age at baseline 𝑋𝑖𝑏𝑖 and the genotype 𝑍𝑖, was taken to be 𝑌𝑖𝑏𝑖,𝑍𝑖,𝛽NPQM(𝛽0+𝛽1𝑏𝑖+𝛽2𝟏{𝑍𝑖=0},𝛽3,𝛽4,𝛽5). The model for the survival outcome was taken to be 𝑃(𝑇𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑌𝑖,𝑍𝑖,𝛼)𝛼[𝜆𝑖(𝑡𝑖)]𝟏𝑖{𝑒=1}𝑆𝑖(𝑡𝑖), where 𝑆𝑖(𝑡𝑖)exp{𝑡𝑖0𝜆𝑖(𝑢)𝑑𝑢}, and using again the Weibull form for the hazard function 𝜆𝑖(𝑢)(𝛼4/𝛼0)(𝑢/𝛼0)𝛼41exp{𝛼1𝑏𝑖+𝛼2𝑌𝑖+𝛼3𝟏{𝑍𝑖=0}}. In both regression models, the lactase persistence allele noncarriers are compared to the hetero- and homozygote carriers of the allele. Under the case-cohort design where the subcohort sampling probabilities depend on 𝑏𝑖, the conditional likelihood expression takes the form 𝑃𝑇𝒪,𝐸𝒪,𝑌𝒪𝑅𝒞,𝑀𝒪,𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝑍𝒪,𝜃(𝛼,𝛽)𝑖𝒪𝑃𝑇𝑖,𝐸𝑖𝑌𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑏𝑖,𝑍𝑖,𝛽𝑦𝑖𝑏11𝜋𝑖𝑆𝑖𝑐𝑖𝑃𝑑𝑦𝑖𝑏𝑖,𝑍𝑖,,𝛽(6.2) or 𝑃𝑀𝒪,𝑇𝒪,𝐸𝒪,𝑌𝒪,𝑍𝒪𝑅𝒞,𝑇𝒞𝒪,𝐸𝒞𝒪,𝑌𝒞𝒪,𝑋𝒞,𝜃(𝛼,𝛽,𝛾)𝑖𝒪𝑃𝑇𝑖,𝐸𝑖𝑌𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑏𝑖,𝑍𝑖𝑃𝑍,𝛽𝑖=𝑧𝑖𝛾𝑧𝑖𝑦𝑖𝑏11𝜋𝑖𝑆𝑖𝑐𝑖𝑃𝑑𝑦𝑖𝑏𝑖,𝑧𝑖𝑃𝑍,𝛽𝑖=𝑧𝑖×𝛾𝑖𝑧𝑖𝑃𝑇𝑖,𝐸𝑖𝑌𝑖,𝑧𝑖𝑃𝑌,𝛼𝑖𝑏𝑖,𝑧𝑖𝑃𝑍,𝛽𝑖=𝑧𝑖𝛾𝑧𝑖𝑦𝑖𝑏11𝜋𝑖𝑆𝑖𝑐𝑖𝑃𝑑𝑦𝑖𝑏𝑖,𝑧𝑖𝑃𝑍,𝛽𝑖=𝑧𝑖,𝛾(6.3) depending on whether we condition upon the observed genotype data 𝑍𝒪, or whether this is modeled as part of the likelihood (see Section 5.1.3). The difference between these two approaches is that the former requires more assumptions on the missingness mechanism. Therefore, we compare here the two approaches to see whether there is any observable difference between the results. In addition, expression (6.3) enables estimation of the population allele frequency 𝛾. However, it should be noted that while the summation over the single SNP genotype variable 𝑍𝑖 is not a computational problem in the present example, this is not necessarily the case generally, with multiple continuous covariates in the model. Therefore, the potential advantage of expression such as (6.2) is that it avoids specification of the covariate distribution and the integration over the missing covariate values.

A remaining issue to be considered is the numerical evaluation of the integrals over 𝑦𝑖 in the correction terms of (6.2) and (6.3). It should be noted that these have to be evaluated at each parameter value tried at the hill-climbing numerical optimization. Although numerical (quadrature) integration would be feasible in one dimension, here we propose evaluation of such integrals using Monte Carlo integration by repeated sampling from the distribution 𝑃(𝑌𝑖𝑑𝑦𝑖𝑏𝑖,𝑍𝑖,𝛽). This has an added advantage that while the numerical evaluation of the density of the NPQM model is computationally expensive, random variates can be easily simulated from this distribution since it is defined through the quantile function. This proceeds by drawing random variates from [0,1]-uniform distribution and applying formula (6.2) of Karvanen [20]. With random variates 𝑦𝑖(𝑗)NPQM(𝛽0+𝛽1𝑏𝑖+𝛽2𝟏{𝑍𝑖=0},𝛽3,𝛽4,𝛽5), 𝑗=1,,𝑘, drawn for each 𝑖𝒪 given the current parameter values, the integrals in (6.2) are then approximated by the means (1/𝑘)𝑘𝑗=1[1(1𝜋(𝑏𝑖))𝑃(𝑇𝑖>𝑐𝑖𝑦𝑖(𝑗),𝑍𝑖,𝛼)].

Maximum likelihood estimates obtained by numerical maximization of the expressions (6.2) and (6.3) with respect to 𝛼, 𝛽, and in the latter case also 𝛾, are presented in Table 4 (𝛼 and ̂𝛾) and Table 5 (̂𝛽). For comparison, we also maximised a full likelihood of the form (5.1) with respect to (𝛼,𝛽,𝛾) and a weighted pseudolikelihood of the form (5.2) with respect to 𝛽. In the latter case, we accounted for the missingness within the case-cohort set by fitting a logistic regression model logit{𝑃(𝑀𝑖=0𝐸𝑖,𝑏𝑖,𝑌𝑖,𝜂)}=𝜂0+𝜂1𝐸𝑖+𝜂2𝑏𝑖+𝜂3𝑌𝑖 in the set 𝒪 and calculated the adjusted weights as inverses of 𝑃(𝑀𝑖=0𝐸𝑖,𝑏𝑖,𝑌𝑖,̂𝜂)𝑃(𝑅𝑖=1𝐸𝑖,𝑏𝑖). Due to the extra estimation step, we used bootstrap with 5000 replications to obtain standard errors for the weighted pseudolikelihood estimates. Both conditional likelihood expressions gave very similar results, suggesting that the missing data within the case-cohort set is not a major issue in the present case. The comparison between different number of Monte Carlo replicates in numerical evaluation of the conditional likelihood correction term suggests that the estimates do no longer appreciably change when 𝑘 is increased from 1000. However, the change in the estimates when increasing 𝑘 from 100 to 1000 is already well within the standard errors.

Table 4: Maximum likelihood estimates (standard errors) of the parameters 𝛼 of the all-cause mortality model 𝜆𝑖(𝑡)=(𝛼4/𝛼0)(𝑡/𝛼0)𝛼41exp{𝛼1𝑏𝑖+𝛼2𝑌𝑖+𝛼31{𝑍𝑖=0}}, and 𝛾, the population frequency of the lactase persistence allele.
Table 5: Maximum likelihood estimates (standard errors) of the parameters 𝛽 of the BMI model 𝑌𝑖𝑏𝑖,𝑍𝑖,𝛽NPQM(𝛽0+𝛽1𝑏𝑖+𝛽21{𝑍𝑖=0},𝛽3,𝛽4,𝛽5).

Full likelihood and weighted pseudolikelihood estimates agreed well with the conditional likelihood ones, although the latter had higher standard errors, as was the case also in the simulations. As noted by Kettunen et al. [10], the absence of the lactase persistence allele shows association with lower body mass index. The residuals from the model for BMI are significantly skewed to the right, as indicated by the estimates of 𝛽4.

7. Discussion

Although conditional logistic likelihood is well known in the context of risk set sampling designs (e.g., [5, 18, 40]), its connection to the general concept of conditional likelihood has rarely been emphasized by authors. Furthermore, use of the conditional likelihood approach is not limited to the binary event outcome response situation. In this paper we have attempted to highlight these issues in the context of modeling a continuous secondary response variable under cohort sampling designs. Naturally, the conditional likelihood approach is also valid for the primary time-to-event outcome. Although here we considered only parametric specifications for the time-to-event outcome, semiparametric maximum likelihood techniques could also be utilized here. A potential disadvantage of the likelihood-based methods for secondary analysis is that they require specification of a model for the primary time-to-event outcome even though it is not of main interest in the secondary analysis setting, whereas specification of this model is avoided in the weighted pseudolikelihood approach.

Compared to the full likelihood approach which would be applicable under any cohort sampling design, the advantage of the conditional likelihood is that modeling of the population distribution of the covariates collected in the second phase can be avoided, if this is not of primary interest. In addition, as we demonstrated in a simulation example, full likelihood expressions with most of the covariate data unobserved may no longer be well behaved, a problem which the conditional likelihood approach does not have. The disadvantage of the conditional likelihood approach is that the second-phase sampling mechanism, that is, the joint distribution of the inclusion indicators, needs to be specified. In the case-cohort/Bernoulli sampling situation, this is straightforward, but in mechanisms such as risk set sampling, where the sampling probabilities are specified only implicitly, resolving the joint sampling probability can be computationally nontrivial, and approximations may be needed in practice. This is avoided in the full likelihood approach, since it does not require specification of the sampling mechanism. The inverse-probability-weighted estimating function only requires specification of the first-order selection probabilities, but the possible dependencies induced by the sampling mechanism need to be addressed in the variance estimation step. The full and conditional likelihood approaches gave equivalent efficiencies in our simulated setting, although the result might be different if the first-phase data involves covariates which are highly predictive of the second-phase covariate of interest. In any case, both likelihood-based methods gave a clear improvement in efficiency in the secondary analysis setting compared to the inverse-probability-weighted method.


A. On Inverse-Probability-Weighted Pseudolikelihood Estimators

The use of the expression (3.1) in approximating the corresponding complete data log-likelihood 𝑖𝒞log𝑃(𝑌𝑖𝑋𝑖,𝑍𝑖,𝛽), where we denote log𝑃(𝑌𝑖𝑋𝑖,𝑍𝑖,𝛽)𝑙𝑖(𝛽), is justified by considering the expectation of the pseudoscore function with respect to the joint distribution 𝑃(𝑅𝒞,𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑍𝒞,𝑋𝒞)=𝑃(𝑅𝒞𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞)𝑃(𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑍𝒞,𝑋𝒞), which is assumed to be the correct data generating mechanism. This becomes 𝐸𝑅𝒞,𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞,𝑍𝒞𝑖𝒞𝟏{𝑅𝑖=1}𝑌𝜕log𝑃𝑖𝑋𝑖,𝑍𝑖,𝛽/𝜕𝛽𝑃𝑅𝑖=1𝑇𝒞,𝐸𝒞,𝑋𝒞,𝑌𝒞=𝑡𝒞𝑥𝒞𝑦𝒞𝑧𝒞𝑒𝒞𝑖𝒞𝑟𝑖𝟏{𝑟𝑖=1}𝑙𝑖(𝛽)𝑟𝒞𝑖𝑃𝑅𝒞=𝑟𝒞𝑡𝒞,𝑒𝒞,𝑥𝒞,𝑦𝒞𝑃𝑅𝑖=1𝑡𝒞,𝑒𝒞,𝑥𝒞,𝑦𝒞𝑇×𝑃𝒞𝑑𝑡𝒞,𝐸𝒞=𝑒𝒞,𝑋𝒞𝑑𝑥𝒞,𝑌𝒞𝑑𝑦𝒞,𝑍𝒞𝑑𝑧𝒞=𝑖𝒞𝑥𝑖𝑦𝑖𝑧𝑖𝑙𝑖(𝛽)𝑡𝑖𝑒𝑖𝑃𝑇𝑖𝑑𝑡𝑖,𝐸𝑖=𝑒𝑖𝑥𝑖,𝑦𝑖,𝑧𝑖𝑋×𝑃𝑖𝑑𝑥𝑖,𝑌𝑖𝑑𝑦𝑖,𝑍𝑖𝑑𝑧𝑖=𝑖𝒞𝐸𝑋𝑖,𝑌𝑖,𝑍𝑖𝑙𝑖(=𝛽)𝑖𝒞𝐸𝑋𝑖,𝑍𝑖𝐸𝑌𝑖𝑋𝑖,𝑍𝑖𝑌𝜕log𝑃𝑖𝑋𝑖,𝑍𝑖,𝛽,𝜕𝛽(A.1) the last form being the expectation of the complete data score function for parameters 𝛽, which becomes zero through the inner expectation if the parametric model 𝑃(𝑌𝑖𝑋𝑖,𝑍𝑖,𝛽) is correct, that is, corresponds to the data generating mechanism. Thus, the expression (3.1) gives a valid estimating equation for parameters 𝛽. However, since the variance of the pseudoscore function does not equal the Fisher information, a robust/sandwich type variance estimator must be used instead of the inverse of the observed information. As an example, below we derive this in the special case of a Bernoulli sampling mechanism 𝑃(𝑅𝒞𝑇𝒞,𝐸𝒞,𝑌𝒞,𝑋𝒞)=𝑖𝒞𝑃(𝑅𝑖𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖), where this is straightforward. For consideration of asymptotic variances of inverse-probability-weighted estimators under stratified without replacement sampling or risk set sampling, we refer to Breslow and Wellner [41] and Cai and Zheng [42].

Denoting 𝑞(𝛽)𝑖𝒪log𝑃(𝑌𝑖𝑋𝑖,𝑍𝑖,𝛽)/𝑃(𝑅𝑖𝑇𝑖,𝐸𝑖,𝑌𝑖,𝑋𝑖) and ̂𝛽argmax𝛽𝑞(𝛽), the latter is given as a solution to the estimating equation 𝑞(𝛽)=0, or approximately, by the first-order Taylor expansion at the true value 𝛽0, as a solution to 𝑞(𝛽0)+𝑞(𝛽0)(𝛽𝛽0)=0, which in turn, by substituting expected to observed pseudoinformation, gives the approximate relationship ̂𝛽𝛽0𝐸[𝑞(𝛽0)]1𝑞(𝛽0). Taking covariance of both sides then gives ̂cov𝛽𝐸[𝑞(𝛽0)]1cov[𝑞(𝛽0)]𝐸[𝑞(𝛽0)]1, where the pseudo-Fisher information is given by 𝐸𝑞𝛽0=𝐸𝑖𝒞𝟏{𝑅𝑖=1}𝑙𝑖𝛽0𝑃𝑅𝑖=1𝑇𝑖,𝐸𝑖,𝑋𝑖,𝑌𝑖=𝑖𝒞𝐸𝑋𝑖,𝑍𝑖𝐸𝑌𝑖𝑋𝑖,𝑍𝑖𝑙𝑖𝛽0.(A.2) Since the pseudoscore function has zero mean and now 𝑅𝑖𝑅𝑗, 𝑖𝑗, its covariance is given by𝑞cov𝛽0=𝐸𝑖𝒞𝑗𝒞𝟏{𝑅𝑖=1}𝟏{𝑅𝑗=1}𝑙𝑖𝛽0𝑙𝑗(𝛽0)𝑇𝑃𝑅𝑖=1𝑇𝑖,𝐸𝑖,𝑋𝑖,𝑌𝑖𝑃𝑅𝑗=1𝑇𝑗,𝐸𝑗,𝑋𝑗,𝑌𝑗=𝑖𝒞𝐸𝟏{𝑅𝑖=1}𝑙𝑖𝛽0𝑙𝑖(𝛽0)𝑇𝑃(𝑅𝑖=1𝑇𝑖,𝐸𝑖,𝑋𝑖,𝑌𝑖)2=𝑖𝒞𝐸𝑇𝑖,𝐸𝑖,𝑋𝑖,𝑌𝑖,𝑍𝑖𝑙𝑖𝛽0𝑙𝑖(𝛽0)𝑇𝑃𝑅𝑖=1𝑇𝑖,𝐸𝑖,𝑋𝑖,𝑌𝑖.(A.3) In practice, these quantities are evaluated at the maximum likelihood point, replacing the expected quantities by their observed counterparts as 𝐸𝑞𝛽0𝑖𝒪𝑙𝑖̂𝛽𝑃𝑅𝑖=1𝑇𝑖,𝐸𝑖,𝑋𝑖,𝑌𝑖,(A.4)𝑞cov𝛽0𝑖𝒪𝑙𝑖̂𝛽𝑙𝑖̂𝛽𝑇𝑃(𝑅𝑖=1𝑇𝑖,𝐸𝑖,𝑋𝑖,𝑌𝑖)2.(A.5)

Alternatively, bootstrap variance estimation might be utilized, but it should be noted that standard bootstrap would be valid only under Bernoulli type sampling designs, which can be interpreted as sampling from infinite population, whereas dependencies induced by sampling without replacement would require application of finite population bootstrap methods (e.g., [43, 44]).

B. Relationship to Retrospective Likelihood

Consider a special case where the observed data are only (𝑅𝒞,𝐸𝒞,𝑌𝒪,𝑍𝒪), where 𝐸𝑖, 𝑖𝒞, are binary indicators for the case status. This corresponds to the case-control situation considered by Jiang et al. [6] and Lin and Zeng [7]. The set 𝒞 now represents a study base from which the cases and controls have been selected. Now choosing a partitioning 𝑊=(𝑌𝒪,𝑍𝒪) and 𝑉=(𝑅𝒞,𝐸𝒞), and supposing only the case status information has been used in the selection of cases and controls, we obtain a conditional likelihood 𝑃𝑌𝒪,𝑍𝒪𝑅𝒞,𝐸𝒞=𝑃𝑅,𝜃𝒞𝐸𝒞𝑃𝐸𝒞,𝑌𝒪,𝑍𝒪𝜃𝑃𝑅𝒞𝐸𝒞𝑃𝐸𝒞=𝜃𝑖𝒪𝑃𝐸𝑖𝑌𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑍𝑖𝑃𝑍,𝛽𝑖𝛾𝑦𝑖𝑧𝑖𝑃𝐸𝑖𝑦𝑖,𝑧𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑧𝑖𝑃𝑍,𝛽𝑖𝑑𝑧𝑖.𝛾(B.1) The last form is the retrospective form suggested by Lin and Zeng [7, page 257] for secondary analysis under case-control designs. Thus, retrospective likelihood is a special case of conditional likelihood. The advantage of the above expression is that the terms specifying the sampling mechanism cancel out. The drawbacks are that the baseline risk level (part of 𝛼) is not identifiable from this expression (see, e.g., [30, page 1076]) and that the parameters 𝛾 have to be estimated or otherwise made to disappear (Lin and Zeng [7] applied profile likelihood). However, nothing prevents us from choosing a more useful partitioning of the observed data. For instance, with 𝑊=𝑌𝒪 and 𝑉=(𝑅𝒞,𝐸𝒞,𝑍𝒪), we obtain 𝑃𝑌𝒪𝑅𝒞,𝐸𝒞,𝑍𝒪=𝑃𝑅,𝜃𝒞𝐸𝒞𝑃𝐸𝒞,𝑌𝒪,𝑍𝒪𝜃𝑃𝑅𝒞𝐸𝒞𝑃𝐸𝒞,𝑍𝒪=𝜃𝑖𝒪𝑃𝐸𝑖𝑌𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑍𝑖,𝛽𝑦𝑖𝑃𝐸𝑖𝑦𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑍𝑖,,𝛽(B.2) or, with 𝑊=(𝐸𝒪,𝑌𝒪) and 𝑉=(𝑅𝒞,𝐸𝒞𝒪,𝑍𝒪), analogously to Section 4.2, 𝑃𝐸𝒪,𝑌𝒪𝑅𝒞,𝐸𝒞𝒪,𝑍𝒪,𝜃𝜃𝑃𝐸𝒞,𝑌𝒪,𝑍𝒪𝜃𝑃𝑅𝒞𝐸𝒞𝒪𝑃𝐸,𝜃𝒞𝒪,𝑍𝒪=𝜃𝑖𝒪𝑃𝐸𝑖𝑌𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑍𝑖,𝛽𝑦𝑖𝑖𝒪𝑒𝑖𝑖𝒪𝑃𝑅𝒞𝐸𝒞𝑖𝒪𝑃𝐸𝑖=𝑒𝑖𝑦𝑖,𝑍𝑖𝑃𝑌,𝛼𝑖𝑑𝑦𝑖𝑍𝑖.,𝛽(B.3) The baseline risk level may be identifiable from this last expression, depending on the sampling mechanism used (cf. [18]). The misconception that case-control design necessitates the use of retrospective likelihood has been recently criticized by Langholz [45] and results from equating likelihood function to a data generating mechanism. However, the order of the data collection does not need to determine how the likelihood is factorized, as long as the sampling mechanism is properly taken into account.

C. Mean and Variance of the Conditional Likelihood Score Function

In the following, we suppress the covariates 𝑋𝒞 from the notation since these are always conditioned upon and do not affect the derivations. Also, for notational simplicity, the parameter 𝜃 is taken to be a scalar, and the notation is compressed so that, for example, 𝑃(𝑑𝑧𝒪𝜃)𝑃(𝑍𝒪𝑑𝑧𝒪𝜃)𝑃(𝑍𝑖𝑑𝑧𝑖forall𝑖𝒪𝜃). The probability measures indexed with respect to 𝜃 are assumed to possess the regularity properties allowing interchanging the order of integration with respect to the random variables and differentiation with respect to 𝜃. The conditional likelihood function of interest is now 𝑃𝑑𝑞𝒪𝑟𝒞,𝑞𝒞𝒪,𝑧𝒪=𝑃𝑟,𝜃𝒞𝑞𝒞,𝑧𝒪𝑃,𝜃𝑑𝑞𝒞,𝑑𝑧𝒪𝜃𝑃𝑟𝒞𝑞𝒞𝒪,𝑧𝒪𝑃,𝜃𝑑𝑞𝒞𝒪,𝑑𝑧𝒪𝜃𝜃𝑃𝑑𝑞𝒪,𝑑𝑞𝒞𝒪𝑧𝒪𝑃,𝜃𝑑𝑧𝒪𝜃𝑃𝑟𝒞𝑞𝒞𝒪,𝑧𝒪𝑃,𝜃𝑑𝑞𝒞𝒪𝑧𝒪𝑃,𝜃𝑑𝑧𝒪=𝑃𝜃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃𝑃𝑟𝒞𝑞𝒞𝒪,𝑧𝒪,,𝜃(C.1) and the corresponding score function becomes 𝜕log𝑃𝑑𝑞𝒪𝑟𝒞,𝑞𝒞𝒪,𝑧𝒪,𝜃=𝜕𝜃𝜕log𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃𝑟𝜕𝜃𝜕log𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃=𝜕𝜃𝜕𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪𝑟,𝜃𝜕𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑟𝒞𝑞𝒞𝒪,𝑧𝒪.,𝜃(C.2) The expectation of the score function is considered with respect to the full model 𝑃(𝑟𝒞𝑞𝒞,𝑧𝒞,𝜃)𝑃(𝑑𝑞𝒞,𝑑𝑧𝒞𝜃). The expectation of the first term in (C.2) becomes 𝐸𝜕𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪=,𝜃𝑟𝒞𝑞𝒞𝑧𝒞𝜕𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪𝑃𝑟,𝜃𝒞𝑞𝒞,𝑧𝒞𝑃,𝜃𝑑𝑞𝒞,𝑑𝑧𝒞=𝜃𝑟𝒞𝑞𝒞𝑧𝒪𝜕𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪×,𝜃𝑧𝒞𝒪𝑃𝑟𝒞𝑞𝒞,𝑧𝒞𝒪,𝑧𝒪𝑃,𝜃𝑑𝑞𝒞,𝑑𝑧𝒞𝒪,𝑑𝑧𝒪=𝜃𝑟𝒞𝑞𝒞𝑧𝒪𝜕𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪𝑟,𝜃×𝑃𝒞𝑞𝒞,𝑧𝒪𝑃,𝜃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪𝑃,𝜃𝑑𝑞𝒞𝒪,𝑑𝑧𝒪.𝜃(C.3) In the above expressions, it should be noted that the index set 𝒪 is fixed by 𝑟𝒞. Due to the assumption on unconfounded sampling mechanism, the term 𝑃(𝑟𝒞𝑞𝒞,𝑧𝒪,𝜃) does not depend on 𝜃, and we get 𝐸𝜕𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪=,𝜃𝑟𝒞𝑞𝒞𝑧𝒪𝜕𝑃𝑟𝒞𝑞𝒞,𝑧𝒪𝑃,𝜃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃𝑃𝜕𝜃𝑑𝑞𝒞𝒪,𝑑𝑧𝒪=𝜃𝑟𝒞𝑞𝒞𝒪𝑧𝒪𝜕𝑞𝒪𝑃𝑟𝒞𝑞𝒪,𝑞𝒞𝒪,𝑧𝒪𝑃,𝜃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃𝑃𝜕𝜃𝑑𝑞𝒞𝒪,𝑑𝑧𝒪=𝜃𝑟𝒞𝑞𝒞𝒪𝑧𝒪𝑟𝜕𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃𝑃𝜕𝜃𝑑𝑞𝒞𝒪,𝑑𝑧𝒪.𝜃(C.4) Similarly, the expectation of the second term in (C.2) becomes𝐸𝑟𝜕𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑟𝒞𝑞𝒞𝒪,𝑧𝒪=,𝜃𝑟𝒞𝑞𝒞𝑧𝒞𝑟𝜕𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑟𝒞𝑞𝒞𝒪,𝑧𝒪𝑃𝑟,𝜃𝒞𝑞𝒞,𝑧𝒞𝑃,𝜃𝑑𝑞𝒞,𝑑𝑧𝒞=𝜃𝑟𝒞𝑞𝒞𝑧𝒪𝑟𝜕𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑟𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃×𝑃𝑑𝑞𝒪𝑟𝒞,𝑞𝒞𝒪,𝑧𝒪𝑃𝑟,𝜃𝒞𝑞𝒞𝒪,𝑧𝒪𝑃,𝜃𝑑𝑞𝒞𝒪,𝑑𝑧𝒪=𝜃𝑟𝒞𝑞𝒞𝒪𝑧𝒪𝑟𝜕𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃𝑃𝜕𝜃𝑑𝑞𝒞𝒪,𝑑𝑧𝒪𝜃𝑞𝒪𝑃𝑑𝑞𝒪𝑟𝒞,𝑞𝒞𝒪,𝑧𝒪=,𝜃𝑟𝒞𝑞𝒞𝒪𝑧𝒪𝑟𝜕𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃𝑃𝜕𝜃𝑑𝑞𝒞𝒪,𝑑𝑧𝒪.𝜃(C.5) Thus, we have 𝐸𝜕𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪𝑟,𝜃=𝐸𝜕𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃/𝜕𝜃𝑃𝑟𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃,(C.6) and the conditional likelihood score function has zero expectation. The Fisher information is given by 𝐸𝜕2log𝑃𝑑𝑞𝒪𝑟𝒞,𝑞𝒞𝒪,𝑧𝒪,𝜃𝜕𝜃2𝜕=𝐸2𝑟log𝑃𝒞𝑞𝒞𝒪,𝑧𝒪,𝜃𝜕𝜃2𝜕𝐸2log𝑃𝑑𝑞𝒪𝑞𝒞𝒪,𝑧𝒪,𝜃𝜕𝜃2