Abstract

In any longitudinal study, a dropout before the final timepoint can rarely be avoided. The chosen dropout model is commonly one of these types: Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR), and Shared Parameter (SP). In this paper we estimate the parameters of the longitudinal model for simulated data and real data using the Linear Mixed Effect (LME) method. We investigate the consequences of misspecifying the missingness mechanism by deriving the so-called least false values. These are the values the parameter estimates converge to, when the assumptions may be wrong. The knowledge of the least false values allows us to conduct a sensitivity analysis, which is illustrated. This method provides an alternative to a local misspecification sensitivity procedure, which has been developed for likelihood-based analysis. We compare the results obtained by the method proposed with the results found by using the local misspecification method. We apply the local misspecification and least false methods to estimate the bias and sensitivity of parameter estimates for a clinical trial example.

1. Introduction

Missing data are common in various settings, including surveys, clinical trials, and longitudinal studies. Methods for handling missing data strongly depend on the mechanism that generated the missing values as well as the distributional and modeling assumptions at various stages. This study focuses only on Missing at Random and Missing Not at Random dropout models, under a Linear Mixed Effect (LME) model.

Much of the literature on missing data problems assumes the dropout model is only MAR and not MNAR, but this assumption is clearly limited [1]. The consequences of misspecifying the missingness mechanism are investigated by deriving the so-called least false values, which are the values the parameter estimates converge to when the assumptions may be wrong. Derivation and illustration of theoretical least false values for the LME method are made under Missing at Random (MAR) and Missing Not at Random (MNAR) dropout. The misspecified dropout model MAR is assumed in this study.

Copas and Eguchi [2] gave a formula to estimate the bias under such misspecification using a likelihood approach. As the LME is a likelihood-based method, the estimates obtained through the Copas and Eguchi method can be compared with the LME least false estimates. The procedure will be applied by adding a tilt to the MAR dropout model to provide what Copas and Eguchi call local misspecification.

The local model uncertainty is elaborated as proposed by Copas and Eguchi [2] and illustrated both when model misspecification is present and when the data is incomplete. Furthermore, we find that the Copas and Eguchi method gives very similar results to the least false method. Misspecification will be dealt with assuming MAR where actually the truth is MNAR. Beside Copas and Eguchi [2], many other authors have developed methods to assess the sensitivity of inference under the MAR assumption [3, 4]. Moreover, Lin et al. [5] extended the Copas and Eguchi method and assumed a doubly misspecified model while having only single misspecification. Also, there has been interest in the Copas and Eguchi method from a Bayes perspective [610]. Recently, [11] performed simulation based sensitivity analysis.

In Section 2, the LME method is presented and we show how to calculate the least false values. A description of the Copas and Eguchi method is provided in Section 3.1, followed by an example in Section 3.2. A simulation study is described in Section 4. The Copas and Eguchi bias estimate results are studied and examined with the least false values derived from the LME method, and we then show the coverage of nominal confidence intervals. A sensitivity analysis is conducted to assess how inference can depend on missing data. In Section 5, the methods are applied to data from a clinical trial with two treatments and two measurement times as introduced and analysed by Matthews et al. [12]. We compared the results obtained by the proposed method with the results found by using the Copas and Eguchi method.

2. Linear Mixed Effect (LME) Method

A statistical model containing fixed effects and random effects is called a mixed effect model. These models have been shown to be effective in many disciplines in the biological, physical, and social sciences. Usually a linear form is assumed.

Reference [13] gave a definition of the response in the LME model which is of the form:

For example, a simplified version of the Liard and Ware [14] mixed model approach for longitudinal data would include a random effect in the intercept term in a model for responses. If is the response at time on subject , the model is where is the marginal mean, which will usually be a linear function of covariates, is independent Gaussian noise, and is a realisation of a zero mean scalar Gaussian random variable. Since has zero mean, the marginal mean of remains after integrating out . However, since is common to all , we get dependence between observations on the same subject. For example, if is positive, then all values would tend to be above the marginal mean and so on. In the context of longitudinal data, some reviews of linear mixed models can be found in [15, 16].

2.1. Assumptions

Suppose there are individuals in a study and each provides longitudinal responses and dropout information . Generally, we will assume a linear model for (in the absence of dropout) and logistic models for the probability of continuing to the next timepoint given that a subject is still under observation at time . At times, we refer to a true or generating model as the way in which data are obtained and to an assumed or fitting model as that chosen by the analyst for estimation.

For simplicity in this work, the study assumes that there are just two observations or treatment periods. The methods are of course more general.

At time 1, there is a measurement provided for all subjects, denoted by for subject . Then at time 2, some subjects are dropped out before measurement. Let = indicate that there is a measurement at time 2 and = otherwise. Let = and assume = where is a parameter vector of dimension and is the design matrix associated with subject , which is of dimension . The standard model assumes just one covariate and iswhere , , , , , and , , and Let , and .

Returning to the general case, the influence of missing data depends on the missingness mechanism, that is, the probability model for missingness. Knowing the reason for the missingness is obviously helpful to handle missing data. There are four general missingness mechanisms as introduced by Little and Rubin [17] and Wu and Carroll [18]. They are Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR), and Shared Parameter (SP).

For simplicity in this investigation, the parameters are assumed to be common between timepoints. Let the dropout parameters be . The MAR dropout logistic model is then

The missingness is called Missing Not at Random, if it depends on unrecorded information, which predicts the missing values. An example is that a patient was unsatisfied with a particular treatment, and thus this patient is more likely to quit the study. If missingness is not at random, then some bias is expected in inferences.

Let the dropout parameters now be . The MNAR version for the two-timepoint example is the logistic model:

2.2. LME Least False

In this section, the Linear Mixed Effect (LME) method is investigated, which is based on a maximum likelihood estimating approach. The performance of the LME method under MAR and MNAR dropout is examined. Derivation and illustration of theoretical least false values are made. Assuming a Gaussian random intercept model, the score equation of current interest is [19]where , is a design matrix associated with subject which is , and we will use as notation for the first row of ; thus , , and . We can rearrange the terms in (6) to beThese components are in detailwhere , and

Also

Similarly for the right hand side of (8)

FinallyWe assume independent and identically distributed responses, with finite variance for the covariate and error distributions, and dropout probabilities bounded away from both zero and one. On dividing all sums by n, the weak law of large numbers applies and we can replace the sums with expectations as follows:In the left hand side of (14), there will be two parts. Firstand second

Similarly, the right hand side isExpressions for , , , , , , and have been obtained under different dropout models. For illustration, we show calculation of under MAR in the Supplementary Materials available at the journal website (available here).

Finally to find the least false value , the inverse of the matrix has been considered in the left hand side of (14) and we multiply this inverse by the matrix in the right hand side, which will yield the array of the least false values . In the following section, we present simulations regarding how the LME method performs under MAR and MNAR dropout model.

2.3. Numerical Investigation

A scalar variable is generated, and then the longitudinal means are generated =, =. This was followed by from a bivariate normal distribution with mean . Missingness was generated from (4) and (5) for the MAR and MNAR models, respectively. In all of the following simulations, unless it is stated otherwise, the parameters =, , were followed. In the following, we show the effect of dropout on the limiting values and .

As LME provides consistent estimates under MAR, the least false values and are not affected by changing the dropout probabilities under MAR. Therefore, only MNAR concentrations were considered. From a contour plot of under MNAR (Figure 1), in order to minimise the bias in , should be chosen to be around zero. For negative , the dropout is associated with large , so and both tend to be low if dropout does not occur. Hence is lower than it should be. The opposite happens for a positive .

Figure 2 shows a contour plot of under MNAR. Here, negative bias is obtained as moves away from zero in either direction. Such an attenuation of regression effect is common when there are errors in variables [20]. It seems that a similar effect is obtained here.

Having obtained least false values, we propose their use in sensitivity analyses. Before doing so, a sensitivity procedure is investigated for local misspecification as proposed by Copas and Eguchi [2].

3. The Effect of Local Misspecification of the Dropout Model When Using Likelihood-Based Methods under the MAR Assumption

In the previous section, we investigated the consequences of misspecifying the missingness mechanism by deriving the so-called least false values, which are the values the parameter estimates converge to when the assumptions may be wrong.

As an alternative, Copas and Eguchi [2] give a formula to estimate the bias under such misspecification using a likelihood approach. As the LME is a likelihood-based method, we can compare the Copas and Eguchi method with the LME least false estimates. The procedure will be applied by adding a tilt to the MAR dropout model to provide what Copas and Eguchi [2] call local misspecification.

3.1. Description of Copas and Eguchi Method

We use the notation of Copas and Eguchi [2], denoting by complete data and by incomplete data. There are two types of model: the true model and the assumed model. The true model is also called the generating model and it means how the data are actually generated or simulated. On the other hand, the assumed model or what is also known as the fitting model is what we fit to data. The true model for complete data is denoted by and the corresponding true model for incomplete data is which can be derived from . Here is a generic (vector) parameter. The assumed or working model is a parametric model which gives the distribution of , and its marginal density is .

Thuswhere the notation means integration over all missing values in that are consistent with the observed .

A method is provided to approximate the bias in the estimation of the parameters of the misspecified model following Copas and Eguchi [2]. We consider MAR as the working model and MNAR as the true model. Thus, the misspecification is caused by assuming MAR but the truth is MNAR.

Suppose there is a random sample of observations, and the true model is given by , which is defined by in Copas and Eguchi [2] as a tilt model:

Thus, the misspecification is determined by the quantity . In this, , which is assumed to be small, measures the size of misspecification while determines its direction. We assume has zero mean and unit variance under the working model . The misspecification is local because is small. Hence, is close to and can be written asNow if the model actually used to fit the data is , then the limiting value of the MLE as is given by equation in Copas and Eguchi [2]where and are the score and information matrix for the model , respectively.

However, will be considered as the working model for the marginal data. Copas and Eguchi [2] show that if (20) is true and is small, then a similar approximation holds for the marginal data , i.e.,where again has zero mean and unit variance. In this case according to in Copas and Eguchi [2] the limiting value iswhere and are the score and information matrix for the model , respectively. To calculate the bias, , tilt . In the next section, how to calculate this amount under MAR and MNAR in our setting of two timepoints will be determined.

3.2. Copas and Eguchi Method for Two-Timepoint Example

The bias consists of, as shown in (24), the score, information matrix, and the tilt. In order to calculate these components, the likelihood model in use is defined. Under MAR, either of the following equivalent formulations can be selected:The conditional distribution of given is needed quite a lot in this section. Hence, for simplicity, we use to denote this quantity. Since is bivariate normal in this assumed model, where and . Also, the complete data is and incomplete data is whereTherefore, at , =, but will differ from at .

In addition, the models are defined as , , , and . MAR is assumed as the working model or misspecified model. Under MAR, there is ; then from (25) the working model for complete data by assuming isSimilarly, from (26) the working model for incomplete data by assuming is

Under MNAR, if there is complete data, then we will always set . Thus, from (25), the true model for complete data isThe true model for incomplete data on the other hand is the marginal density: Note that the integral is over the missing values . Referring to (27), the missing values are undefined in case that .

This means that in order to use Copas and Eguchi’s ideas, we should convert the specific in (31) into the general form of (23). To do this, we will redefine the MNAR model in tilt form:Here and . For small this is a good approximation to the logistic MNAR model.

Calculation of the terms needed for the bias expression (24) is now possible and follows directly. Details are in the Supplementary Materials available at the journal website.

4. Simulation Study

We use the same simulation setup as before. The limiting values and are compared using different methods under MAR and MNAR dropout models. Next, the local model uncertainty will be elaborated as proposed by Copas, and we illustrate how to apply it both when model misspecification is present and when the data is incomplete. We find that the Copas and Eguchi [2] method gives very similar results to the least false. Misspecification will be dealt with assuming MAR where actually the truth is MNAR.

4.1. Comparing the Copas and Eguchi Method with LME Least False Results

In this section, the parameter estimates are affected when a MAR model is fitted to data that are MNAR, and compared with the values that the Copas and Eguchi method predicts.

The sample size is 10000, and 10 simulations are used. We used large samples here, as our first task is to check the accuracy of the large-sample approximations underpinning the least false values. The aim is to show the variation in treatment effect estimates as varies. A grid of from -0.2 to 0.2 is selected. Figure 3 is produced when =(-2,-2,-1,-1) and =(-0.5,0), which gives dropout rate around 40%. Here the blue lines (dotted lines) are simulation estimates using maximum likelihood, the red lines (solid lines) are Copas and Eguchi estimates, and the light blue lines are the LME least false estimates. These show that the least false, simulations, and Copas and Eguchi [2] results all match well. Therefore, we can use the least false results for bias correction as an alternative to Copas and Eguchi.

4.2. CI Coverage for the Estimated and

The Copas and Eguchi and LME least false values show how estimates are biased by assuming MAR when the data are MNAR. The misspecification parameter is , with =0 meaning no misspecification. If the value of was known, then the parameter estimates will be adjusted to take into account the misspecification. This idea will be illustrated in this section.

For a range of true (generating) , 1000 samples are simulated, each of size 1000. This is a realistic number for applications. In each case, and are estimated using maximum likelihood under a MAR assumption. Afterwards, the estimates are adjusted using either the estimated Copas and Eguchi bias or the bias arising through least false calculations, in both cases taking an assumed . Coverage of the resulting nominal 95% confidence intervals is then recorded. The estimated confidence interval width is not adjusted, just its location.

Tables 1 and 2 give the results. Here we use for the true , and denotes the assumed value used in adjusting the estimates. Also, () are used for the Copas and Eguchi adjustment method and () for the least false adjustment method.

In Table 1, the assumed is zero, meaning no correction. Results at the correct value of =0 are good. Otherwise, the CI for goes badly wrong. Note that there is no correction here, so the Copas and Eguchi and least false results should be the same. Small differences are just because of the different calculations that are involved. For example, the least false calculation needs an estimate of but the Copas and Eguchi one does not. The CI coverage is noted for which is not too much affected at any true in the range (-0.1,+0.1). For example, at =-0.1, the CI coverage for is about 95%, whereas there is undercoverage for when deviates from zero. For example at =-0.1, the CI coverage for is about 85%. This indicates that is less sensitive to the misspecification than in this scenario.

In Table 2, the assumed value is taken of =-0.1, which means that dropout is associated with high . Note that, in contrast to the Table 2, there is correction here, so the Copas and Eguchi and least false results will not be the same; for example, at =+0.1, the CI coverage for is about 62.7%, but the CI coverage for is about 57.8%. However, both estimates and have undercoverage as goes further from the assumed value -0.1.

4.3. Sensitivity Analysis

Of course, in practice is not known. For any given data set, a sensible sensitivity procedure would mean plotting bias-corrected estimates and confidence intervals for a range of assumed values. Here, a grid of assumed is used from -0.2 to 0.2. We will show that, for each limiting value calculated by the Copas and Eguchi method, the simulated values are within noise of the theoretical values for large sample sizes (=10000). The noise is estimated from the simulations; that is, a confidence interval is achieved from the simulations with reassurance that the population values are present. A correct MAR model is obtained and after that, under true MNAR, MAR is assumed.

Figure 4 illustrates the case when MAR is the correct model () and the unadjusted confidence intervals (red lines) include the true parameter values (=-1 and =-1), as in this case so do the adjusted ones (blue lines). The horizontal lines are at the true values. We note that decreases as increases whereas increases as increases. Note that has a wider CI than .

Figure 5 has the true =0.1, so the study has fitted MAR to data that are really MNAR. The lines cross at =0 because the same MAR model is fitted. The important point is that better estimates of the true ’s are obtained at the correct . Also, as mentioned in Figure 4, has wider CI than .

Note that, both under MAR and MNAR, and have opposite trends; decreases as increases whereas increases as increases.

5. Application: Sensitivity Analysis for Clinical Trial

In this section, the method is illustrated using a real data example. The data is considered from a clinical trial with two treatments and two measurement times as introduced and analysed by Mathews et al. [12]. The covariates are only treatment type and time. The parameter vector is , ignoring any time interaction. There are 422 subjects, assigned to either treatment A or B. Treatment A is associated with treatment effect =1 and treatment B is when =0. Then, at time 2, the mean of the group receiving treatment B is and the mean of the group receiving treatment A is +. At time , all subjects provided a response, but 24.4% dropped out by time . There are 212 subjects receiving treatment A, but only 126 provided a response at time 2 and the other 86 dropped out. Hence the missingness percentage is about 40%. The dropout reason is not known. For treatment B, there are 210 subjects, of which 193 subjects continued to time 2 and hence there are 17 that did not, and this gave around 8% missingness.

A sensitivity analysis approach (over a grid of ) using the Copas and Eguchi and LME methods is shown in Figure 6. The blue lines use the Copas and Eguchi method and the red lines use the least false method. The idea is to adjust the estimate to compensate for bias from a misspecified MAR fit. Consequently, for example, if the least false value is known under MAR to underestimate a parameter, the difference for the estimate is added to back-calculate. Dashes are the CIs, based on the MAR standard errors. The first plot shows confidence intervals for the treatment B mean as the assumed value of changes. The horizontal line is the estimate under MAR. The second plot shows the confidence intervals for the mean of treatment A. The third plot is the difference in means between treatment A and B, which yields the treatment effect means, i.e., means. In the first plot, the horizontal line is at -0.74 which is the same value for the LME estimate for . Again, the LME estimate for is about -0.40 in Figure 2. Also, note that + equals -1.15. This supports the finding here and shows better results.

The first thing to note is how close the least false and Copas and Eguchi estimates are. There is almost no difference over this range of . We take from -1.5 to +1.5. The value of under MAR is -1.66, meaning the range of allows to have the same order of effect as . Clearly at large values of , there is concern that the misspecification is not local, which is the assumption of Copas and Eguchi. However, the least false results apply to any misspecification, not necessarily local, and the fact that Copas and Eguchi estimate is so close to the least false one suggests that it can work well even under quite large misspecification.

When is negative, the estimates get adjusted upwards, and the opposite is true for positive . This makes sense: at negative , large values have low probability of staying in the trial. Hence the observed means are lower than they would be in the hypothetical no-dropout situation, so we adjust upwards.

The estimates seem to be affected more at the positive than at the negative one. At the very largest shown, there would be a significant change in the value of the estimated true mean. However, there is very little effect of misspecification on the difference between means (third subplot), as the adjustments essentially cancel.

6. Conclusion

We considered the Linear Mixed Effect models (maximum likelihood method) for handling missing data. Then, by deriving the so-called least false values, we investigated the consequences of misspecifying the missingness mechanism. The closed form expressions were given to calculate the least false values and . The knowledge of these least false values allowed us to conduct sensitivity analysis, which was illustrated for the LME method.

Copas and Eguchi [2] gave a formula to estimate the bias under the misspecification. We derived and explored the Copas and Eguchi approximation for the bias raised by the misspecification of the working model. The results found by using Copas and Eguchi method are compared with the results obtained by the method proposed. Also, we applied the Copas and Eguchi method to estimate the bias for the real data example.

Moreover, we explained how to use a sensitivity analysis to see how the methods work under a range of . We found that the Copas and Eguchi method and LME least false match very well. Both gave very close results over the grid of considered. This suggests that the least false method can provide a credible alternative to Copas and Eguchi in sensitivity analysis. In fact, it might be preferred since there is no assumption of local misspecification. Finally, we illustrated the results using example data from a clinical trial with two measurement times.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This project was supported by King Saud University, Deanship of Scientific Research, College of Science Research Center.

Supplementary Materials

Supplementary material is available online at the journal website. There are two sections. One is for the least false estimate, which shows the calculations for E[R] in (18). The other section illustrates the Copas and Eguchi bias. It shows the calculation of the terms needed for the bias expression (24). (Supplementary Materials)