Abstract

Mixed effects models are widely used for modelling clustered data when there are large variations between clusters, since mixed effects models allow for cluster-specific inference. In some longitudinal studies such as HIV/AIDS studies, it is common that some time-varying covariates may be left or right censored due to detection limits, may be missing at times of interest, or may be measured with errors. To address these “incomplete data“ problems, a common approach is to model the time-varying covariates based on observed covariate data and then use the fitted model to “predict” the censored or missing or mismeasured covariates. In this article, we provide a review of the common approaches for censored covariates in longitudinal and survival response models and advocate nonlinear mechanistic covariate models if such models are available.

1. Introduction

Mixed effects models are widely used in the analysis of clustered data, especially analysis of longitudinal data or survival data. In a longitudinal study, some variables are measured repeatedly over time, and these variables may be used either as responses or covariates, depending on study objectives. A common problem is that data on some of these variables may be left or right censored due to detection limits, may be missing at times of interest, or may be measured with errors. For example, in HIV/AIDS studies, viral load values may be left censored due to lower detection limits and may be missing or measured with substantial errors. In statistical analysis, these “incomplete data" issues must be addressed for correct statistical inference. In this article, we consider the case when these incompletely observed and time-varying variables are used as important covariates in mixed effects models for longitudinal response data or for time-to-event response data. To simplify the discussion, we focus on time-dependent covariates with left censoring, since similar methods/models may be used for right censoring or missing data or measurement errors in the covariates.

Longitudinal data with left censoring have received increasing attention in the literature in recent years (e.g., [18]). A common approach is to assume an empirical model for the covariate of interest based on the observed data, such as a linear mixed effects model. Then, the empirical model is used to “predict" the true covariate values when these values are censored, assuming the fitted model continues to hold for the unobserved censored values. A potential problem with this approach is that the assumed empirical covariate model based on the observed data may not hold for the censored covariate values, due to possibly different data-generation mechanisms for these “too small to observe" values. For example, in AIDS studies, censored viral loads below the detection limit may behave very differently from those above detection limit (observed values), due to a possibly different disease status for suppressed viral loads [6]. Moreover, the assumed model and distribution for the censored values cannot be verified based on observed data.

Recently, Kong and Nan [4] proposed an interesting approach based on ideas similar to that for right censored survival data, i.e., they used ideas similar to Cox models for right censored survival data for longitudinal data with left censoring. Yu et al. [6] proposed an approach which treats censored values as point mass. While these two approaches make no distributional assumptions for the censored values, the methods may not be efficient if censored values indeed follow a parametric distribution similar to that for the observed values.

In some applications such as HIV viral dynamics and pharmacokinetic modelling, mechanistic or scientific models can be derived based on the underlying data-generation mechanisms. These models are often nonlinear and are derived based on a set of differential equations which approximately describe the true data-generation mechanisms, so these models are justified biologically or scientifically (e.g., [9, 10]). Moreover, these mechanistic models have been shown to fit observed data quite well based on many data analyses [11]. Since these mechanistic models are based on underlying true data-generation mechanisms, they should hold for censored values, even though these values are not observed. Therefore, these models can be used to better “predict" the unobserved censored values than empirical models. In this article, we will provide a review of such approaches. The approaches are illustrated by an HIV/AIDS dataset.

2. Mixed Effects Models with Censored Covariates

In this section, we focus on generalized linear mixed effects models for longitudinal responses and survival models for time-to-event responses, with left censored and time-dependent covariates. The methods can be extended to other types of regression models in a conceptually straightforward way.

2.1. Generalized Linear Mixed Models with Censored Covariates

We first consider generalized linear mixed models (GLMMs) with a left censored and time-dependent covariate in a longitudinal study, following Zhang et al. [7]. Let be the response of interest measured for individual at time , . Let be an important time-dependent covariate which is subject to left censoring, measurement errors, and missing data (assuming missing at random). We denote the unobserved true value of by in the presence of censoring or missing data or measurement errors. Let be a known detection limit for such that -values cannot be observed (detected) if (i.e., left censoring), and let be the censoring indicator such that if and otherwise. Let be a vector of other covariates.

Consider the following GLMM:where is a known link function, ’s are unknown parameters, is a subset of , contains random effects, and is a unknown covariance matrix. We assume that the response follows a distribution in the exponential family such as a normal or Poisson or Binomial distribution. When the covariate is left censored or missing or measured with error, we may assume an empirical model for based on the observed -data, such as a linear mixed effects (LME) model. Then we assume that the LME model continues to hold for censored or unobserved values and proceed for likelihood inference. However, as noted in Section 1, such an approach may be problematic since censored values may not follow the same model obtained based on the observed data.

When a mechanistic or scientific model is available for covariate , such as in HIV viral dynamics, the scientific model should hold not only for observed data but also for unobserved data (e.g., censored or mismeasured or missing data), so that the model can be used to provide better “predictions" for the unobserved true covariate values. Such a scientific model is often nonlinear. For longitudinal data with large between-individual variations, by introducing random effects in the nonlinear model to account for between-individual variations and within-individual correlations among repeated measurements, we obtain a nonlinear mixed effects (NLME) model. Thus, we assume that the covariate follows the following NLME model:where is a known nonlinear function, vector contains random effects, vector contains fixed parameters, is the true covariate value at time , is an unknown covariance matrix, and ’s are random errors (measurement errors).

Note that when is a linear function (so model (2) is an LME model), the covariate model (2) is an empirical model which is chosen based on the observed covariate data. In a more general sense, the empirical models also include semiparametric or nonparametric mixed effects models. Such an empirical model is commonly used to address censoring, missing data, and measurement errors in the literature (e.g., [1, 2, 11]). When covariate is not normal, such as binary or count, generalized linear mixed models may be considered to fit observed covariate data, which are still empirical models. These empirical models may provide poor “predictions" to the unobserved data such as censored data.

2.2. Survival Models with Censored Time-Dependent Covariates

For survival models with time-dependent covariates, the covariates may also be left censored. Moreover, parameter estimation and inference for Cox models require that covariate values are available at event times [11]. However, this is usually not the case, since covariate values are unlikely to be available at all event times. Thus, this leads to missing covariate problems. The covariates may also be measured with errors, i.e., the observed covariate values may not be the true values but values with errors. In all cases, a common approach is to model the covariate process based on the observed covariate data and then use the fitted covariate model to “predict" the censored or missing covariate values. As noted in the previous section, a mechanistic or scientific covariate model may make better “predictions" than empirical covariate models, as shown in Zhang and Wu [8].

Here we consider a Cox model for the survival data with possible right censoring of the event times. For individual , we define to be the minimum of the observed event time and the right censoring time and define to be the censoring indicator such that if the event time is right censored and otherwise, . Let be the hazard function for individual at time . The Cox model with time-dependent covariates can be written aswhere is a vector of regression coefficients and is the (unspecified) baseline hazard function.

When the time-dependent covariate is left censored or missing or measured with errors, inference for the Cox model can be challenging. Similar to the GLMM in the previous section, a common approach is to model the time-dependent covariate based on observed covariate data, assuming the fitted covariate model holds for the censored covariate values. Again, such an empirical approach can be problematic if censored covariate values behave quite differently than observed values. The problem can be fixed if a mechanistic covariate model is available. We may again consider the mechanistic NLME model (2) to address censoring in the covariates.

3. Statistical Inference

For parameter estimation and inference, two methods are commonly used: the two-step method and the joint likelihood method. We briefly review the two methods below.

3.1. Two-Step Methods

To estimate the parameters in the models, a simple approach would be the so-called two-step method: in the first step we fit the covariate model based on the observed covariate data, and then in the second step we fit the response model separately, with the censored or missing covariate values substituted by their predicted values from the first step.

Specifically, consider the GLMM response model (1) and the covariate model (2). In the first step, we fit the NLME covariate model (2) to the observed covariate data and obtain estimates of the parameters and the empirical Bayes estimates of the random effects . The predicted value of the covariate at time is given by Then, in the second step, we fit the following GLMM to the response data using the standard complete-data method for fitting GLMMIf the covariate value is censored or missing or mismeasured at time , its value is imputed by the predicted value .

An obvious issue with the above simple two-step method is that the estimation uncertainty in the first step is ignored in the second step. The standard error of the parameter estimate may be underestimated, leading to misleading inference for the parameter . To fix this problem, we may use the bootstrap method to obtain more reliable standard errors of the parameters in the response model [11]. A parametric bootstrap method, which generates samples from the above fitted models, may be used to produce more reliable standard errors of the estimates. Still, the two-step method may not be efficient because covariate data and response data are not used simultaneously.

If the response data are survival data, the issues mentioned above for the two-step method remain. Moreover, in this case, the longitudinal covariate data may be truncated by the events such as death or dropouts. In this case, the two-step method may lead to biased estimation.

3.2. Joint Likelihood Method

A more desirable and formal method than the two-step method is to use the likelihood method based on the “joint likelihood" for both the response and covariates. Maximum likelihood estimates (MLEs) of all the unknown parameters in the two models may then be obtained simultaneously based on the joint likelihood for all observed data. If all assumed models and distributions hold, the MLEs are the most efficient estimates. Let be the collection of all unknown parameters in the response and covariate models, and let denote a generic density function. The joint log-likelihood for the observed data is given by where is a density function from the exponential family, , and is the censoring indicator for the covariates.

Evaluation of the intractable integration in the log-likelihood can be computationally challenging, especially when the dimension of the random effects is higher. By treating the random effects as “missing data," we may use the EM algorithm to find the MLEs. Let be the censoring components of the covariate vector . By treating as “missing data", Zhang et al. [7] proposed a Monte Carlo EM algorithm in which the E-step is implemented with a Gibbs sampler combined with rejection sampling methods. The Monte Carlo EM algorithm is still computationally intensive but is feasible. Alternatively, we may use computationally more efficient Laplace approximations or linearization methods to for approximate inference [11].

For the survival response models, the joint log-likelihood is given bywherewith the survival function defined as . Statistical inference can again be based on a Monte Carlo EM algorithm, although the computation can be more tedious due to the nonparametric baseline hazard in the Cox model.

4. Examples

In the following, we show two examples from an HIV/AIDS study. In the first example, we consider a Poisson generalized linear mixed model with censored covariates. In the second example, we consider a Cox survival model with censored covariates. In both examples, the time-dependent covariate is subject to left censoring and is modelled by a NLME model to address the censoring as well as missing data and measurement errors. The methods were implemented by Monte Carlo EM algorithms in R. R code is available upon request.

4.1. Generalized Linear Mixed Models with Censored Covariates

We consider an AIDS longitudinal dataset and study how viral load (VL) may relate to CD4 counts over time during an anti-HIV treatment. Viral loads usually have a lower detection limit so that viral load values below the limit cannot be observed, i.e., viral load may be left censored. Moreover, viral loads may be missing or measured with errors. As an illustration, we view CD4 count () as the response and VL as a time-dependent covariate (), and we model the longitudinal CD4 counts as a Poisson GLMM: where , ’s are random effects, and TR denotes a treatment indicator. Since VL may be left censored and may be measured with errors, we consider the following mechanistic NLME model which is justified biologically [9, 10]: where , ’s are random effects, and viral load values are -transformed. The random effects are assumed to follow multivariate normal distributions with mean 0 and unstructured covariance matrices. As a comparison, we also fit observed VL data based on an empirical LME model (ELM): The unknown parameters are estimated using a Monte Carlo EM algorithm as described in Zhang et al. [7].

Figure 1 shows the NLME and ELM models fit to the observed viral loads of two randomly selected subjects, where the times are rescaled to be in . It suggests different fitted curves from the two covariate models. In particular, the predicted lines based on the NLME model fit the uncensored viral loads quite well; and for the censored portion, the lines follow the mechanistic model and preserve an overall nonlinear trend. On the other hand, the empirical LME model renders noticeable deviation of the fitted lines from the uncensored viral loads and imposes a linear or quadratic curve for the censored viral loads. Such discrepancies between the covariate model fitting, particularly in the censored portion, induce different parameter estimates in the response model. Table 1 summarizes the parameter estimates of the response CD4 model, with the covariate VL being fitted based on the NLME and ELM models, respectively. As we can see, the results of the parameter estimates are different. For example, the estimate of , which measures the association between CD4 and VL, is significant at 5% level based on the NLME covariate model but is not significant based on the ELM covariate model. The results based on the NLME model should be more reliable since it provides more reliable predictions for the censored viral loads, since the NLME model may make better predictions for the unobserved censored values than the ELM method as the NLME model is based on the underlying data-generation mechanism which is the same for both observed and unobserved covariate values. The higher the percentage of censored/missing values is, the better the NLME model performs. This is confirmed by the simulation study in Zhang et al. [7].

4.2. Survival Models with Censored Covariates

As another example, we consider the foregoing dataset again, but now we focus on the occurrences of the first CD4:CD8 decline. The objective here is to determine if and how the time to the first CD4:CD8 decline may be related to treatment and viral load. We consider the following Cox survival model for the time to first CD4:CD8 decline:For this dataset, the Weibull distribution seems to provide a reasonable fit to the observed event times, so we consider the parametric Weibull distribution for the event times. For viral load, we use the same NLME and ELM models described in the first example.

Figure 2 shows, for two randomly selected subjects, the fitted lines to the observed viral loads based on the joint Cox survival model with the mechanistic NLME covariate model and empirical LME model (ELM), respectively, together with the corresponding estimated hazard functions and survival probability functions. We see that the mechanistic NLME model and the empirical LME model lead to different hazard and survival estimates. The NLME based joint model predicts monotonically increasing hazards, indicating the ever increasing risk of the event. On the other hand, the LME based model predicts more curved risk functions. Table 2 shows the results of the parameter estimates for the survival model. Here the differences seem relatively small, but as discussed, the predicted hazards and survival probabilities can be substantial. Since the NLME covariate model is derived based on reasonably biological justifications, they provide better “predictions" for censored (unobserved) viral loads and more reliable prediction for each individual’s hazard and survival probability than the ELM covariate model, based on similar reasons as that for Table 1, which is also confirmed by simulations in Zhang and Wu [8].

5. Discussion

The nonlinear mechanistic covariate models are very appealing to address censoring and missing data in covariates, since the “predicted values" based on such models are more reliable than the commonly used empirical covariate models. These nonlinear mechanistic models are widely used in modelling HIV viral dynamics, pharmacokinetics, growth or decay, and some other areas [12, 13]. However, in many cases, such mechanistic models may not be available. In this case, an alternative approach is to treat the censored values as “point mass" to avoid unverifiable distributional assumptions for the censored values. The advantages of the nonlinear mechanistic covariate models are more obvious when the percentage of the censored values is higher, as confirmed in Zhang et al. [7]. The limitations of the nonlinear mechanistic covariate models are as follows: (i) in many applications such mechanistic models may not be available and (ii) computation can be challenging, as discussed below.

Since the mechanistic covariate models are often nonlinear, computation is a main challenge in likelihood inference. Although Monte Carlo EM algorithms can almost always be used, they may offer potential problems such as very slow convergence or even nonconvergence. Moreover, the Monte Carlo EM algorithms usually need to be combined with Markov Chain Monte Carlo (MCMC) methods which are used to generate Monte Carlo samples in the E-step of the EM algorithms, making the computation even more challenging. When the dimensions of the random effects are high, we recommend approximate methods such as Laplace approximations and linearization methods as reviewed in Wu [11]. These approximate methods can be computationally much more efficient and provide reasonable approximations.

Data Availability

The dataset is available upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research is partially supported by Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant no. 22R80742.