Joint Models and Their ApplicationsView this Special Issue
A Semiparametric Marginalized Model for Longitudinal Data with Informative Dropout
We propose a marginalized joint-modeling approach for marginal inference on the association between longitudinal responses and covariates when longitudinal measurements are subject to informative dropouts. The proposed model is motivated by the idea of linking longitudinal responses and dropout times by latent variables while focusing on marginal inferences. We develop a simple inference procedure based on a series of estimating equations, and the resulting estimators are consistent and asymptotically normal with a sandwich-type covariance matrix ready to be estimated by the usual plug-in rule. The performance of our approach is evaluated through simulations and illustrated with a renal disease data application.
Longitudinal studies often encounter data attrition because subjects drop out before the designated study end. Both statistical analysis and practical interpretation of longitudinal data can be complicated by dropouts. For example, in the Modification of Diet in Renal Disease (MDRD) study [1, 2], one main interest was to investigate the efficacy of interventions of blood pressure control and diet modification on patients with impaired renal functions. The primary outcome was glomerular filtration rate (GFR), which measured filtering capacity of kidneys, and was repeatedly measured over the study period. However, some patients could leave the study prematurely for kidney transplant or dialysis, which precluded further GFR measurements. This resulted in a dropout mechanism that could relate to patients’ kidney function and correlate with their GFR values. Other patients were followed to the end of the study or dropped out due to independent reasons. Thus, statistical analysis of longitudinal GFR needs to take into consideration the presence of mixed types of informative and independent dropouts.
Many statistical models and inference approaches have been proposed to accommodate the nonignorable missingness into modeling longitudinal data (see reviews [3–8]). According to the target of inference and the interpretation of model parameters, existing methods can be classified into three categories: subject-specific inference, event-conditioning inference, and marginal inference. First, a widely used modeling strategy for longitudinal data with informative dropouts is to specify their joint distribution via shared or correlated latent variables. Under such model assumptions, the longitudinal parameters have a conditional, subject-specific interpretation (e.g., [9–11]). But the interpretation of longitudinal parameters usually changes with the number and characteristics of latent variables assumed, for example, a single random intercept versus a random intercept plus a random slope.
Second, event-conditioning approaches have also been widely used when the target of inference is within subgroups of patients with particular dropout patterns or when the dropout can potentially change the material characteristic of the longitudinal process (e.g., death). The inference is usually conducted conditioning on the dropout pattern or on the occurrence of the dropout event. Thus, model parameters have an event-conditioning subpopulation-averaged interpretation, for example, pattern-mixture models for the group expectation of each dropout pattern [3, 12]; treatment effects among survivors ; gender and age effects in mortal cohort . Because the interpretation of such models is made by conditioning on a future event, event-conditioning approaches may be natural in a retrospective setting but may not be directly useful for the evaluation of treatment efficacy prospectively.
Lastly, when the research objective is to study covariate effects at population level in a dropout-free situation, marginal models address this concern directly. When data are without missing or missing completely at random (using Rubin's definition on missingness ), the estimation of model parameters can be carried out by the generalized estimating equation (GEE) approach assuming a “working” correlation matrix . When dropouts are missing at random, the inverse probability-weighted GEE methods are commonly used [17, 18]. In the presence of informative dropouts, the class of selection models that were originally proposed to adjust selection bias in econometrics  have been widely used for the marginal analysis of longitudinal data [20–22]. Recently, the marginalized transition model  and marginalized pattern-mixture model  were proposed for binary longitudinal data with finite nonignorable nonresponse patterns. These marginalized approaches provide a powerful tool for studying the marginal association between longitudinal outcomes and covariates while incorporating nonignorable nonresponses.
In this paper, we shall adopt the idea of shared latent variables to account for the dependence between longitudinal responses and informative dropouts while focusing on marginal inference for the longitudinal responses. Here dropouts can occur on a continuous time scale. We develop an effective estimation procedure built on a series of asymptotically unbiased estimating equations with light computational burden. The resulting estimators for longitudinal parameters are shown to be consistent and asymptotically normal, with a sandwich-type variance-covariance matrix that can be estimated by the usual plug-in rule.
The remainder of the paper is organized as follows. In Section 2, we introduce the notation and the proposed semiparametric marginalized model. In Section 3, a simple estimating equation-based procedure is first proposed for the situation with pure informative dropouts and is extended to a more general situation where there is a mixture of random dropouts and informative dropouts. Asymptotic properties of resulting estimators are also studied. Simulation studies and an application to a renal disease data set are given in Section 4. Some remarks are discussed in Section 5. All the technical details are provided in the appendix.
2. Notation and Model Specification
2.1. Data Notation
Consider that a longitudinal study follows subjects over time period . For the ith subject , the complete-data consist of , where is the value of response at the jth observation time and is a vector of covariates associated with response . Note that includes baseline covariates that are separately denoted by and potential time-dependent covariates. Let denote the informative dropout time and denote the random censoring time that is independent of given the covariates. In practice, we observe , where and taking the value of 1 if the informative dropout time is observed and 0 otherwise. Throughout the paper, let denote the indicator function. Due to the dropout, longitudinal responses and covariates can only be observed at . Hence, the observed data are , where .
2.2. Semiparametric Marginalized Latent Variable Model
We first introduce the composition of our proposed model and then discuss the model motivation and interpretation. The first component is a marginal generalized linear model for longitudinal responses ’s: where is a known link function and denotes the marginal expectation. The second component is a linear transformation model for the informative dropout time : where is an unspecified monotone transformation function and is assumed to follow a known continuous distribution that is independent of . The last component is a conditional mean model characterizing the dependence between longitudinal responses and informative dropouts: where the latent random effects are investigator-specified functions of and covariates, and is an implicit parameter whose value is determined by the integral equation matching the conditional mean model (2.3) with the corresponding marginal model (2.1), that is,
The marginal mean model (2.1) directly specifies the marginal relationship between the responses and covariates, and is the marginal regression parameters of main interest. Next, the semiparametric linear transformation model (2.2) is chosen to provide a flexible survival model for the informative dropout time while it still can be easily incorporated into model (2.3) for the dependence of the longitudinal responses and informative dropouts. Model (2.2) includes the proportional hazards model , the proportional odds model , and the Box-Cox transformation model as special cases and has been studied intensively in survival analysis literature [27–29]. In addition, as we present in Section 3, the explicit assumption on the error distribution in (2.2) can facilitate the “marginalization” procedure for parameter estimation.
The conditional mean model (2.3) is motivated by the construction of the marginalized random-effects model [30, 31]. As a motivating example, we consider a continuous Gaussian process following a simple random-effects model, , where are the random intercept and slope, and error terms are assumed to follow but independent of or . Note that 's can still exhibit temporal dependence in addition to what has been accounted by the random effects, that is, with AR(1) covariance structure. Furthermore, as in joint modeling approaches via latent variables, the joint distribution of is assumed to be . It is easy to see that the conditional mean has the expression as model (2.3), We use model (2.3) primarily as a parsimonious model for the dependence structure between the longitudinal responses and informative dropout times. However, note that although model (2.3) takes a similar form as the marginalized random-effects model, it does not intend to fully specify the joint distribution of the repeated measurements since model (2.3) only specifies the conditional mean function and there is no conditional independence assumed.
Note that is the solution of (2.4), and thus its value implicitly depends on , , the formulation of and the distribution of . The specification of reflects investigator's assumptions on the dependence structure among the longitudinal responses and their association with the dropout times. It is well known that the dependence assumptions between longitudinal measurements and informative dropouts are usually unverifiable from the observed data, but it intrinsically affects the inference about . Thus, a sensitivity analysis under various assumptions is always warranted. It is clear that the sensitivity analysis can be easily conducted within the framework of model (2.3). For example, the sensitivity analysis may start with a large model in the specification for , for example, , and then examine the statistical significance of estimates for ’s to further simplify the model. As shown in the next subsection, complex structure can be imposed on without introducing much extra computation. Lastly, we note that the marginal interpretation of longitudinal parameters in model (2.1) is invariant under different specifications of the conditional mean model (2.3).
3. Estimation and Asymptotic Properties
3.1. Conditional Generalized Estimating Equation
First, assume that is known. We construct a “conditional” generalized estimating equation for . More specifically, the estimating function is specified as where denotes the vector of observed responses of subject ; ; ; ; ; is a weight matrix.
It is easy to see that has mean zero at the true parameter values under model (2.3). Note that the vector of marginal parameters is implicitly present in with through the constrain equation (2.4). Thus, the Jacobian matrix needs to be derived using both the constrain (2.4) and models (2.1) and (2.3), which is different from the ordinary GEE. More specifically, entries of the Jacobian matrix are given by where , , and we use to denote the derivative of a function throughout this paper. In particular, we have and under the logit-link function for binary data; and under the log-link function for count data. Thus, under these canonical link functions, and are the marginal variance and conditional variance of the responses, respectively. In addition, these formulations also facilitate our selection of the weight matrix . For example, for binary longitudinal data with logit-link function, we can choose a weight matrix as .
It is clear that the implementation of the estimating function (3.1) requires the knowledge of , which is an unknown quantity and has to be estimated first. The estimation of the semiparametric linear transformation model (2.2) has been studied by many authors [27–29]. In particular, Chen et al.  proposed a class of martingale-based estimating equations, where . Then an iterative algorithm can be carried out to solve and simultaneously. We estimate using the approach of Chen et al.  and shall denote the estimates as .
3.2. Estimation Procedure for Pure Informative Dropouts
We first consider the situation of pure informative dropouts, that is, . Define and replace ’s in (3.1) with their estimated counterparts ’s. Denote the resulting estimating function by and define the estimator of as the solution to . The estimation of entails an iteration between solving nonlinear equations for and updating a Newton-Ralphson equation for . More specifically, given the current estimated value of at the Jth step, we first estimate from and then update the parameters by where , , and are evaluated at the current parameter values and . The algorithm is iterated until it converges. Because is assumed to follow an explicit parametric distribution , it greatly simplifies the marginalization procedure (3.4). We propose to use the Gaussian-quadrature approach  to numerically evaluate (3.4) and . Since the integrand of (3.4) is monotonic in and so is the whole integral, it is easy to calculate a large number of , in all iterative steps. Moreover, the numerical integration is only upon the one-dimensional space of and requires light computation even with complex structure assumed on . The proposed iterative algorithm has been implemented using “R” codes, which are available from the authors upon request.
3.3. Estimation Procedure for Mixed Types of Dropouts
We generalize the proposed estimation function (3.1) to accommodate the situation where there are mixed informative dropouts and random censoring. More specifically, the modified estimating equation is given by where ; the jth component of is and the Jacobian matrix . When , the ith component of is the same as the one in (3.1). For , the entries of are given by In addition, the entries of the weight matrix can be changed to accordingly. Conditional expectations of various functions given are computed using the Gaussian-quadrature method. Let and replace 's in (3.6) with their estimated counterparts ’s. Denote the resulting estimating function by . Then the estimator of can be obtained from the equation using the same iterative algorithm described in the previous subsection.
3.4. Asymptotic Properties of
In this subsection, we establish the asymptotic properties of . Towards this end, we need the following assumptions. (C1)The covariates ’s are bounded with probability 1. (C2)The true parameter values and belong to the interior of a known compact set, and the true transformation function has a continuous and positive derivative. (C3)Let denote the cumulative hazard function of . Define and . Then is positive, is continuous, and . (C4) is finite and satisfies and . (C5)The matrix is positive finite, and the number of repeated measurements .
The regularity conditions (C1)–(C4) are also used by Chen et al.  to derive the consistency and asymptotically normality of the estimators . Condition (C5) is needed to establish the consistency and asymptotic normality of , which is given in the following theorem.
Theorem 3.1. Under conditions (C1)–(C5), with probability 1, . In addition, one has, as ,
The definition of and a sketch of the proof for Theorem 3.1 are given in the appendix. The asymptotic variance-covariance matrix can be consistently estimated by its empirical counterpart , which can be easily obtained using the usual plug-in rule.
4. Numerical Studies
We conducted a series of simulation studies to evaluate the finite-sample performance of our proposed approach. Consider a binary longitudinal process with the marginal probability of success as , where was the logit-link function; observations occurred at ; was generated from a Bernoulli distribution with the success probability of 0.5, and . The informative dropout time was generated from a linear transformation model , where . We considered three distributions for : the standard normal distribution (N.), the extreme value distribution (E.), and the logistic distribution (L.), corresponding to the normal transformation model, the Cox proportional hazard model, and the proportional odds model, respectively, for the informative dropout time. We then generated the binary response independently from a Bernoulli distribution with the success probability of , where was calculated to match the marginal mean value as in (2.4) and indicated the level of dependence. We considered several combinations to specify the dependence between longitudinal outcomes and informative dropout times. More specifically, when , there was no informative dropouts; when (or 0.25) and , the dependence existed and was linear in the latent variable ; when (or 0.5) and , the dependence was present through an interaction between the latent variable and the observation time.
For each scenario, we considered samples of size 100 and 200 and conducted 500 runs of simulations. The Gaussian-quadrature approximation was calculated using 50 grid points. We first considered the situation of pure informative dropouts and generated the dropout time from the transformation model with . Under the assumptions of following the normal, the extreme value, and the logistic distributions, the average numbers of repeated measurements were 3.91, 3.37, and 3.94, respectively. The estimation results on and are summarized in Table 1. The proposed estimators are unbiased under all simulated scenarios, and the Wald-type 95% confidence intervals all have reasonable empirical coverage probabilities. The performance of the proposed method is consistent with different distributional assumptions of and different specifications of the dependence structure, and the results improve as the sample size increases.
Next, we consider the situation where there are mixed informative dropouts and random censoring. For simplicity, let be an administrative censoring at the end of the study, that is, . The informative dropout time was generated from the transformation model with and followed the standard normal distribution. This yielded the proportion of informative dropouts of 69.6% and the average number of repeated measurements about 4. Other settings were kept the same as in the previous simulations. The simulation results are presented in Table 2. Again, the proposed approach gives unbiased parameter estimates and reasonable coverage probabilities under all the scenarios. For comparison, we also implemented the ordinary GEE method . When the informative dropout is absent, that is, , the GEE method yields consistent parameter estimates of as expected. But when there is informative dropout (), the performance of the GEE method deteriorates quickly as the magnitude of the dependence between the longitudinal data and informative dropout increases.
Last, we conducted sensitivity analysis for the proposed approach and our simulations consisted of two parts. First, as discussed in Section 2, to better characterize the dependence structure between longitudinal responses and informative dropouts, we would suggest to start with a large model in the specification for and then examine the statistical significance of estimates for ’s to further simplify the model. We simulated data from a simple model with either or , and then applied the proposed approach by assuming a bigger model (2.3) as . The simulation results are summarized in the top panel in Table 3. The proposed method can reasonably well estimate all the parameters, and in particular, could correctly indicate the unnecessary zero term. Second, we simulated data from but fitted misspecified models that omitted some terms. The results are summarized in the lower panel of Table 3. It is evident that misspecified models lead to biased estimates for both longitudinal regression coefficients and dependence parameters.
4.2. Application to Renal Disease Data from MDRD Study
Here we considered a subgroup of 129 patients with low-protein diet in MDRD study B, among whom, 62 patients were randomized to the group of normal-blood-pressure control and 67 patients were randomized to the group of low-blood-pressure control. Besides the randomized intervention, other covariates included time in study (time), baseline disease progression status (Prog), baseline blood pressure (Bp), and log-transformed baseline urine protein level (log.Pt). There were 52 (40.3%) patients left the study prematurely for kidney transplant or dialysis and were treated as informative dropouts.
We applied the proposed approach to estimate the marginal effects of covariates on GFR values. To account for the possible informative dropouts, we assumed that the dependence term had a form of , analogous to the joint modeling approach with latent random intercept and random slope used in Schluchter et al. . We considered the situations of following the standard normal, the extreme value, or the standard logistic distributions. Because the outcome was a continuous variable, we used the identity link function.
Our results are presented in Table 4 and compared with the results from the ordinary GEE  with an independent working correlation matrix. More specifically, the slope estimates from the proposed approach indicate a much faster decreasing rate of GFR (e.g., time Est = −0.27, SE = 0.03, under the normality assumption for ) than the result from the ordinary GEE method (time Est = −0.14, SE = 0.03). A possible explanation is that those patients remaining under observation usually have better kidney functions and thus higher GFR values. The ordinary GEE approach that treats the observed patients as random representatives of the population tends to underestimate the degressive trend of GFR.
The estimates for the intervention on blood pressure control show positive effect of the low-blood-pressure control on the longitudinal GFR development. Although the results are not statistically significant, the estimates from the proposed method (e.g., Intervention Est = 0.82, SE = 1.07, under the normality assumption for ) are about twice large of the values from the ordinary GEE method (Intervention Est = 0.35, SE = 0.90). Moreover, for the proposed approach, the results under different distributional assumptions for are quite similar. The estimates of the dependency parameters are positive and statistically significant. This indicates that higher GFR values are positively associated with longer dropout times in the study. In addition, our proposed approach shows that the baseline urine protein level is significantly associated with the longitudinal GFR development, but the ordinary GEE method does not show such significance. The results obtained using our proposed method are also consistent with those reported in Schluchter et al. .
In this paper, we propose a semiparametric marginalized model for marginal inference of the relationship between longitudinal responses and covariates in the presence of informative dropouts. The regression parameters represent the covariate effects on the population level. The proposed estimators are expected to be insensitive to misspecification of the latent variable distribution , which is desirable pertaining to the sensitivity analysis on unverifiable assumptions for the informative dropouts. In practice, the choice between marginal models and other types of joint modeling approaches should be determined by study objective.
To estimate the regression parameters in the proposed marginalized model, we proposed a class of simple conditional generalized estimating equations and demonstrated its computational convenience. In general, a likelihood-based approach can be used to achieve more efficient inference and is also of great interest. For example, a marginalized random effects model [30, 31] can be used for the longitudinal process and a frailty model  can be used for the dropout time. Furthermore, latent variables can be modeled by a copula distribution or non-Gaussian distributions . The likelihood-based methods enjoy the high efficiency and facilitate the implementation of classical model selection procedures, such as AIC or BIC; however, intensive computations are often involved when the dimension of random effects is high.
A. Proof of Theorem 3.1
Under the conditions (C1)–(C4), Chen et al.  established the consistency of and the uniform consistency of a transformed , that is, . We first derive the consistency of the proposed estimator , which is the solution of the equation . Note that is a positive definite matrix, and by the law of large numbers and the consistency of and , it converges uniformly to a deterministic positive definite matrix over a compact set of . In addition, we have that converges uniformly to a deterministic function satisfying and . Thus, the estimating equation exists a unique solution . Since is the unique solution of , the consistency of easily follows.
To prove the asymptotic normality, by the Taylor expansion, we have where lies between and . Since is consistent and converges uniformly to , we have that converges to . Furthermore, the Taylor expansion of around gives where and its asymptotic representation can be found in the appendix of Chen et al. . The asymptotic representations of and have also been derived by Chen et al. , where and are the counting and at-risk processes, respectively, is the mean-zero martingale process, and the definitions of and the functions , , , and are given in the appendix of Chen et al. .
Hence, has been written as a standardized summation of independent terms with mean zero. By the central limit theorem, it is asymptotically equivalent to a multivariate Gaussian variable with zero mean and covariance matrix , which is the limit of From (A.1), it is easy to see that the estimator is asymptotically normal with mean zero and the variance-covariance matrix , which can be consistently estimated by its empirical counterpart using the usual plug-in rule.
The authors would like to thank the editor and the referees for their instructive comments. This research was supported partially by NIH grants RO1 CA-140632 and RO3 CA-153083.
G. J. Beck, R. L. Berg, C. H. Coggins et al., “Design and statistical issues of the modification of diet in renal disease trial. The modification of diet in renal disease study group,” Controlled Clinical Ttrials, vol. 12, pp. 566–586, 19991.View at: Google Scholar
J. W. Hogan and N. M. Laird, “Mixture models for the joint distribution of repeated measures and event times,” Statistics in Medicine, vol. 16, no. 1–3, pp. 239–257, 1997.View at: Google Scholar
J. W. Hogan and N. M. Laird, “Model-based approaches to analysing incomplete longitudinal and failure time data,” Statistics in Medicine, vol. 16, no. 1–3, pp. 259–272, 1997.View at: Google Scholar
R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, John Wiley & Sons, New York, NY, USA, 1987.
P. J. Heagerty and S. L. Zeger, “Marginalized multilevel models and likelihood inference,” Statistical Science, vol. 15, no. 1, pp. 1–26, 2000.View at: Google Scholar
W. H. Press, Numerical Recipes in C : The Art of Scientific Computing, Cambridge University Press, New York, NY, USA, 2nd edition, 1992.
P. K. Andersen, O. Borgan, R. D. Gill, and N. Keiding, Statistical Models Based on Counting Processes, Springer Series in Statistics, Springer, New York, NY, USA, 1993.View at: Zentralblatt MATH