Abstract

Background. In many studies with longitudinal data, time-dependent covariates can only be measured intermittently (not at all observation times), and this presents difficulties for standard statistical analyses. This situation is common in medical studies, and methods that deal with this challenge would be useful. Methods. In this study, we performed the seemingly unrelated regression (SUR) based models, with respect to each observation time in longitudinal data with intermittently observed time-dependent covariates and further compared these models with mixed-effect regression models (MRMs) under three classic imputation procedures. Simulation studies were performed to compare the sample size properties of the estimated coefficients for different modeling choices. Results. In general, the proposed models in the presence of intermittently observed time-dependent covariates showed a good performance. However, when we considered only the observed values of the covariate without any imputations, the resulted biases were greater. The performances of the proposed SUR-based models in comparison with MRM using classic imputation methods were nearly similar with approximately equal amounts of bias and MSE. Conclusion. The simulation study suggests that the SUR-based models work as efficiently as MRM in the case of intermittently observed time-dependent covariates. Thus, it can be used as an alternative to MRM.

1. Introduction

Time-dependent covariates are common in longitudinal studies, and they may be missing thoroughly for some observations times, along with the outcome variable. This is a special kind of missing covariate which is called sporadic or intermittent missing (i.e., subjects with missing observations between observed time points) [15]. Sometimes, these intermittent missing observations may occur due to the design of the study, since some of the time-dependent covariates are costly or may not be necessary to be measured at all time points, especially in medical studies [69]. For example, in determining the prognosis of hospitalized patients, it is necessary to measure some blood profiles at different time points because the costs of measurements are high or frequent measurements may cause harm to the patients. Moreover, in the case of hemodialysis patients, during the dialysis session, some factors such as blood pressure or levels of magnesium may be measured at frequent intervals for each patient, but others such as blood urea nitrogen (BUN) may be measured only at the beginning and the end of the study.

Most of the existing standard statistical models typically require covariates to be completely observed. However, this assumption is not realistic in the presence of intermittently observed time-dependent covariates [10]. Lack of simultaneous measurement of time-dependent covariates leads to unobserved covariates and unbalances the design matrix. In such situations, the covariate is considered to be missing, but in actuality, it is simply not measured.

For analyzing incomplete and unbalanced longitudinal data with numerical response variable, there are several reasonable methods including mixed-effect regression model (MRMs). MRM has several desirable features that make it useful in longitudinal research such as including the subjects with missing covariate across time in the analysis. While MRM does allow incomplete data across time, data must be complete within the given time point (in terms of both the dependent variable and covariates) for that time point to be included in the analysis [11]. To handle this problem of MRM, several naive simple approaches have been developed including (1) complete case analysis (CC) or list-wise deletion, (2) using baseline observation carried forward (BOCF), and (3) using last observation carried forward (LOCF) [1216]. There are some studies that show the application of these naive methods such as that conducted by Roy and Lin [10] about their resulting asymptotic bias while we choose them for dealing with missing covariates. However, little is known about their performance in spite of the fact that they are a commonly used ways of imputation for data with missing time-dependent covariates.

In addition, most of classical analytic techniques for intermittent observation of time-dependent covariates have centred on models using an individual formulation of multivariate linear models. Examples are the random-effect model proposed by Jennrich and Schluchter [17] or the proposed models of Roy and Lin [10, 12]. However, alternate methods can be expressed with respect to each observation time. This formulation introduces several correlated models with some similar and different parameters simultaneously, as in seemingly unrelated regression (SUR) models of Zellner [18]. There are some proposals based on this method for unbalanced and incomplete longitudinal data like the proposed models of Park and Woolson [19]. Although the application of their model included time-dependent covariates, they simply assumed that if response variable for th subject at time is observed, then the covariate vector of th subject at time is always observed. In their study, missing observations are assumed to be missing at random. However, this condition is not possible in presence of intermittent observation of time-dependent covariates.

In the present study, we intend to view the SUR model in the context of longitudinal data and to examine its performance in presence of intermittently observed time-dependent covariates which is a common situation in medical studies. For this purpose, we considered the situation where the outcome variable was observed completely, but the time-dependent covariates could be only measured at some of the observation times for all cases. Moreover, we considered continuous outcome and time-dependent covariate. We first performed the SUR-based model for longitudinal data with time-dependent covariates measured at the same time points, and, further, the performance of this model as compared with MRM was evaluated. We next proposed some naive approaches based on SUR model for longitudinal data with intermittently observed time-dependent covariates. Furthermore, the capabilities of the proposed models were compared to MRM under classic approaches for missing covariate including complete case analysis, BOCF, and LOCF.

2. Methods

2.1. Mixed-Effect Regression Model

Suppose that there are subjects and time points for which data are collected. Let be a response variable and be a covariate vector of the th subject () observed at time (), where represents the th covariate () for given and . When the outcome is continuous, the use of MRM for this kind of data is common. A basic characteristic of these models is the inclusion of random subject effects into regression models in order to take into account the effect of subject on their repeated observations. These random subject effects, thus, describe each subject’s trend across time and explain the correlational structure of longitudinal data. The model is written as where some coefficients including and may be random [20].

As we mentioned earlier, while MRM can handle the dropout, data must be complete within the given time point (in terms of both the dependent variable and covariates) for that time point to be included in the analysis. However, there are three naive approaches for handling missing in covariate which enables one to fit MRM models that assume that covariates are completely observed. These naive approaches are baseline measure throughout, carrying forward the last observation, and list-wise deletion (complete case analysis).

In list-wise deletion method, cases with missing data were simply deleted from the data sets. This is the easiest approach in handling the missing covariates. In the other simple naive approach, baseline approach or baseline observation carried forward (BOCF) unobserved time-dependent covariates were replaced by baseline measure of those covariates. In this method, we assume that the time-dependent covariates have not changed since baseline. And finally, LOCF approach which simply fills the missing covariates data by carrying forward the last observed value and assuming that the values at the time of dropout are the same as those of the previous values. The question of interest is how much bias would arise by doing so.

2.2. Seemingly Unrelated Regression Model

Model (1) can be presented through a system of related equations by considering a set of linear equations for each time points. For simplicity, we assumed complete observation of outcome variables at all time points.

For expression of the model at each time point, let be an observed response vector and be an known vector at time . If we assume that we observe the same number of variables for each time point, then the model in the population is defined as Or, generally

This model (2) or (3) is based on Zellner's seemingly unrelated regression model. We devised a specific model for each time point, and this formulation introduced a system of correlated equations with different covariates at each observation time, some of which simultaneously could have time-invariant predictor variables such as sex. Nevertheless, correlation across the errors in different equations can provide links that can be exploited in estimation.

As presented in models (2) or (3), some restrictions were placed on the coefficients for each covariate to be the same for all time points. By considering these restrictions, we can have a unique effect for each covariate as in MRM. The assumptions that we make about is that it is uncorrelated with the explanatory variables in all equations. In other words, we assumed that and are orthogonal in the conditional mean sense [21].

In model (2), we assumed that the same set of variables is observed for each time point which is usually an unreal assumption especially in medical studies because of intermittent observation of time-dependent covariates. Thus, some alternative approaches to handle sporadic or intermittent covariates are introduced.

Consider that there are subjects and time points for which data on 3 variables including continuous outcome variable , a continuous time-dependent covariate , and a time-invariant covariate were collected. For simplicity, consider that an outcome variable is measured completely at all time points but the time-dependent covariate is not measured, for example, at time . We also assumed complete data on variable . Our proposed methods for this situation based on model (2) are presented in three models as follows. The easiest form is where represents the subject and represents the time index. This is similar to complete case analysis of MRM, in which only the observed data without any imputations are used. However, there is a substantial difference; SUR model (4) uses all the available data, but MRM in complete case analysis deletes the observed when the value of the covariate at that time point is missing.

Secondly, the application of BOCF to SUR can be exemplarily described as Its difference from model (4) is in placing instead of unobserved covariate .

And finally, the third approach is which is interpreted as the application of LOCF to SUR model. The difference between models (5) and (6) is the difference between and as the input in the right-hand side of .

Similar to model (2), we assumed in all models (4), (5), and (6) that is uncorrelated with the explanatory variables in all equations (i.e., , for ) [21]. We considered the same coefficients for effect of and at all equations as and , respectively. Furthermore, the parameter simply denotes a different aggregate time effect in each time point.

Since the properties of SUR model in the presence of intermittently observed time-dependent covariates had not been evaluated in the previous studies, we investigated the behaviour of this model under various conditions by means of simulation studies and also compared its performance with MRM under three classic imputation methods, including complete case analysis, BOCF, and LOCF.

2.3. Simulation Study

We studied the performances of the previously mentioned models, in terms of the amount of bias and MSE of the estimators, by simulation in SAS statistical package V 9.1 (SAS Inst., Cary, USA). The bias is the deviation in an estimate from the true quantity, which can indicate the performance of the methods being assessed, and the MSE provides a useful measure of the overall accuracy, as it incorporates both measures of bias and variability [22].

For simulation design, we used data from a psychiatric study by Hedeker et al. [23] which is extensively used in many methodological studies to examine the validity of the proposed models [24]. The objective of the study was to describe the longitudinal relationship between imipramine (IMI) and desipramine (DMI) plasma levels and clinical response in 66 depressed inpatients. Following a baseline placebo period of 1 week, patients received 225 mg/day doses of IMI for 4 weeks of the study. Additionally, the patients were rated with the Hamilton Depression Rating Scale (HDRS) [23] twice during the baseline placebo week (at the start, and end of this week) as well as at the end of each of the four treatment weeks of the study. These HDRS scores represent the dependent variables measured across time. Higher scores on HDRS represent higher levels of depression. Blood samples, drawn weekly for 4 weeks, 15 hours after the last drug intake of each patient, were assayed for IMI and DMI concentrations; these will be treated as time-dependent covariates.

The sex and baseline depression were recorded for each patients. Diagnosis of the baseline depression classified as two types. The first type, nonendogenous or reactive depression, is associated with some tragic life events such as bereavement, whereas the second type, endogenous depression, is not a result of any specific event and appears to occur spontaneously.

A log transformation is used for IMI and DMI, since an inspection of the data indicated that the magnitude of these measurements varied greatly between individuals (from 4 to 312 mg/L for IMI and from 1 to 740 mg/L for DMI).

Although the total number of subjects in this study was 66, we only considered 52 subjects that had all the measures at the end of four weeks, and it was coded as 0, 1, 2, and 3 for these respective time points. Because the purpose of the present study is to examine the performance of the proposed models for intermittently observed time-dependent covariates, other kinds of missing data such as attrition or missing in response variable were omitted. We applied different mixed-effect regression model for these data, but the best model (it has the smallest Akaike information criterion (AIC)) was random coefficient model.

We generated complete data with different sample sizes (, 60, 120, and 240) from the random coefficient model Here and are subject-specific random effects for intercept and slope, respectively, which are from a normal distribution with a mean of 0 and variance of and , respectively. is the random error from a normal distribution with a mean of 0 and a variance of .

The regression parameters , , , , , and are the fixed effects for intercept, week, sex, Diag, IMI, and DMI, respectively. The estimated parameters of model (7) based on restricted maximum likelihood estimation (REML) method for 52 subjects are obtained as respectively. For this simulation study, we assumed both a continuous response and continuous time-dependent covariates. In order to simulate the trajectories of , it was necessary to simulate the trajectories of longitudinal covariates and . We generated two time-dependent covariates of interest ( and ) from linear regression against week with a random intercept, where , , , and are the fixed intercepts and slopes, respectively; , , , and are from normal distribution with a mean of 0 and a variance of , , and , respectively. We set for generating trajectories of IMI and DMI (based on REML estimated parameters of models (9)). Using these estimates and estimates from model (7), 500 simulations are carried out to generate complete data with different sample sizes.

The intermittent observations of time-dependent covariates are determined as follow. For simplicity, we assumed that all the subjects had complete data for their response variables and also for their time-dependent covariates at baseline and at the end. However, our interested time-dependent covariates, and , might not be measured at weeks 1 or 2 or both time points for each covariate. By considering these kinds of missing data points and , we had 15 different scenarios for intermittent observation of time-dependent covariates. Table 1 presents these scenarios.

We based all the simulations reported on 500 replications as before. Each simulated data set with intermittently observed covariates was analyzed using different methods including MRM with complete case analysis, baseline value throughout, and last observation carried forward and also SUR models (4), (5), and (6). We indicated these methods by , , , , , and , respectively. Additionally, the performances of these methods were compared for all scenarios of intermittent observation of time-dependent covariates IMI and DMI. Moreover, we compared the performance of the proposed model with MRM for complete simulated data (simulated data sets with completely observation of time-dependent covariates and response variable).

The relative bias and mean square error of both the proposed models and MRM were computed in each step by and, respectively. The estimation methods for the mixed model and the proposed SUR model were restricted maximum likelihood estimation (REML) and iterated seemingly unrelated regression (ITSUR), respectively. The ITSUR method uses information about contemporaneous correlation among error terms across equations in an attempt to improve the efficiency of parameter estimates [25]. It is drawn from the iteration of SUR method of estimation by recomputing the estimate of the cross-equation covariance matrix from the SUR residuals and then computing new SUR estimates based on this updated covariance matrix estimate. Continuing this iteration until convergence produces ITSUR estimates.

3. Results

Tables 2, 3, 4, and 5 show the results from the simulation study. Table 2 presents the findings based on complete observation of time-dependent covariate, and Tables 3 to 5 show the findings based on three situations of intermittent observation of IMI and DMI from Table 1. The mean of estimated coefficients of interest, the amount of bias, and the mean square errors of these estimated coefficients based on different sample sizes of 20, 60, 120, and 240 quantify these findings. In addition, we calculate and report the naive approaches of handling intermittently observed time-dependent covariates IMI and DMI in Tables 3 to 5. The results from other situations of Table 1 were similar (not shown).

In general, when the data are complete, the estimated parameters through the proposed model are similar to those of MRM for small, medium, and large sample sizes; also, it has little bias (less than 0.13) and MSE for all estimators of interest. Moreover, the differences between the two models become negligible as the sample size increases. The amount of bias of the proposed model for large sample sizes (120 or 240) is less than 0.04.

Table 3 presents the findings of MRM and the SUR model for the situation 1 for which we have complete observed values for HDRS and IMI, but DMI was measured only at baseline (0), 1, and 3 weeks.

Table 4 presents the coefficients, bias, and MSE of the estimators from both the proposed model and MRM for the situation 4. All values for HDRS and DMI were available for this situation, but IMI was measured only at baseline (0), 2 and 3 weeks.

Table 5 presents the performance of the proposed model compared to naive approaches of handling intermittent observed time-dependent covariates IMI and DMI for situation 9 for which we have measured HDRS at all weeks; however, DMI and IMI were measured only at baseline (0) and 3 weeks.

Generally, the results for the parameters of interest are almost the same when we fit MRM and SUR models with BOCF and LOCF imputation methods, respectively. However, the best performance of these models in the simulated data for all sample sizes was fitted models based on LOCF imputation method.

The proposed BOCF and LOCF based on SUR model yielded nearly unbiased estimates of parameters (the amount of bias less than 0.15 in all cases). The LOCF approach in SUR model outperforms CC in every instance. However, the performance of LOCF and BOCF based methods for the SUR model was similar because, in our simulated data, there is a linear relationship between IMI and DMI through the weeks of the study.

As shown in Tables 3, 4, and 5, the proposed SUR model by last observed measure of the covariate carried forward in the model yielded parameters with low bias even in situation 9 for which we observed IMI and DMI only at two time points. The amount of bias for SUR model based on LOCF approach was 0.07 or less for effects of IMI and DMI. Not surprisingly, parameters from the baseline approach are nearly as unbiased as the LOCF approaches; the amount of bias is less than 0.13.

In MRM model, three naive approaches were used, and their performances were compared to each other and also to the proposed SUR model. MRM based on the complete case analysis led to bias of less than 0.2 for estimated parameters in each scenario. Complete case analysis based on SUR model in some of the situations performed unstable for parameters, with bias ranging from 0.01 to 0.93. This can occur because when we ignored some independent variables in the regression models, it decreased the determination coefficient of the model. However, the amount of bias that is considered troublesome has varied from [26] to [27] in different studies. The amounts of bias for most of the estimated coefficients in the present study were even smaller than .

The MRM based on LOCF approach performs as well as SUR model based on LOCF approach, with amount of bias less than 0.16. The baseline value carried forward in MRM performed similar to LOCF approach. The amount of bias was not substantially decreased if we increased the sample size from 20 to 240.

In general, when the bias is nearly the same, the MSEs of the estimated coefficients in SUR model are slightly larger than those in MRM. However, this difference becomes smaller and smaller as the sample size increases.

4. Worked Example

This example is motivated by our research on evaluating the changes of biochemical levels of magnesium (Mg), blood urea nitrogen (BUN), and full history and clinical assessment of serum creatinine (Cr), albumin, hemoglobin, calcium (Ca), sodium (Na), potassium (K), and phosphorous (P) and their relation to intradialytic hypotension (IDH) in 21 chronic hemodialysis patients in a midweek single dialysis session. According to national kidney foundation guideline (NFK-DOQI guideline), IDH was defined as a decrease of systolic blood pressure of more than 20 mmHg of the basal value [27, 28]. Systolic blood pressure and serial assessment of magnesium, potassium, phosphorus, calcium, and sodium were measured at the start of the hemodialysis session, 2 hours later, and at the end of the session. Among biochemical determining factors, the levels of BUN and weight were measured at the start and the end of the session and clinical examination of the other factors such as serum creatinine, albumin, and haemoglobin were measured only at the start of the session. This fact motivated us to examine the performance of the mentioned methods of the present study in this data set.

In the first step of the analysis, covariates that have value under 0.25 in the univariate analysis (the results were not shown) were considered in the last model which are levels of Cr, K, P, BUN, and weight. The intermittently observed time-dependent covariate of interest in this example is blood urea nitrogen and weight. Based on AIC index, the best fitted model is mixed-effect regression model with random effect of intercept.

Table 6 presents the estimated coefficients and standard error (SE) of each covariate from different proposed models. As seen in the table, there is similarity between both estimated coefficients and their standard errors from the proposed model as compared with the mixed-effect regression model in spite of small sample size (). These findings are consistent with the results from our simulation study. Consider the point that SUR model based on LOCF and BOCF leads to the same result due to the fact that for unobserved value of BUN after 2 hours the last observation and baseline observation being the same. This situation also happens for MRM. Thus, we only present the findings for BOCF method in Table 6.

Additionally, among considered variables in the last model, the effect of blood urea nitrogen is significant through all models.

5. Discussion

Intermittently observed time-dependent covariates are relatively common situation in practice. Sometimes, these covariates are measured intermittently for all of the subjects. This situation happens in medical applications commonly. This specific research question about analyzing longitudinal data with a special kind of intermittently observed time-dependent covariate was our motivation for the present study. We proposed a general class of multivariate linear model with respect to each observation time. The proposed model is based on SUR model, and it introduces several correlated multivariate linear models with some similar and different covariates at each observation time, like the model proposed by Park and Woolson [19]. However, they considered that if is observed, then is always observed, and missing observations are assumed to be missing at random. We considered some restrictions on the parameters to be the same for all similar covariates at each observation time, and it led to fewer parameters and obtaining a unique estimate for each covariate as mixed model. Furthermore, the advantages of this model are its simple applicability, faster speed of convergence, and the ease of estimation. The other desirable feature of this method is that it does not require any knowledge about the covariance structure and the distribution of random effects unlike mixed-effect regression model. The presented models are quite standard and are not novel, but we attempt to expose these models to medical applications. Up to our knowledge, there is no comprehensive approach for such intermittent observation of time-dependent covariates in medical studies based on the proposed SUR models.

Besides the main objectives of the study to handle intermittent observation of time-dependent covariates in longitudinal data, we examined the performance of SUR based model for longitudinal data with covariates measured at the same time points. For complete data, our simulation study showed that the SUR model performs as efficient as the MRM even in a small sample size. Moreover, SUR models converge faster than MRM. These simulation results are consistent with the models of Patel [29] and Verbyla and Venables for complete and balanced longitudinal data [30]. However, they considered other estimation methods, and also they did not compare models to MRM.

Park and Woolson proposed a SUR model for handling incomplete and unbalanced longitudinal data [19]. They considered two methods of estimation: one method is the generalized least square (GLS) and the other iterative maximum likelihood estimation (MLE) using the EM algorithm. In this study, we used ITSUR method as our estimation method, and other kinds of classic imputation methods such as BOCF or LOCF. In SUR model when we have both different variables and some similar ones simultaneously for different response variables, ITSUR is one of the best methods of estimation. Moreover, the two main motivations for the use of ITSUR as our estimation method are (1) the use of information about contemporaneous correlation among error terms across equations in an attempt to improve the efficiency of parameter estimates, and (2) imposition of restrictions in the involved parameters in different equations.

In the case of longitudinal data with intermittent observation of time-dependent covariates, at each time points, we considered a univariate model for each measured response and also the observed measure of each time-dependent covariate as an independent variable at that time point. For unobserved value of the time-dependent covariate at that specific time point, we considered 3 easy methods. Similar to SUR models for complete observed time-dependent covariate, we imposed restriction on the parameters to be the same at all time points for that covariate. We put this restriction on all the time-invariant covariates such as sex as well. Our findings demonstrate that sometimes some bias can arise from fitting the SUR model. However, based on the study of Sinharay et al., this amount of bias could be negligible because in all situations it is smaller than .

Among the introduced model for intermittent covariates in longitudinal data, SUR model with last observation carried forward has the best desirable performance. Moreover, through this method, in all situations of Table 1, the amount of bias was less than . These findings are consistent with MRM with LOCF method.

There many situations in medical and behavioral research in which variables of interest are measured intermittently. Oftentimes, these kinds of data are analyzed through MRM. However, it is common place for educational researchers to conduct SUR model for analyzing these kinds of datasets. SUR models are underutilized and should be given more consideration as an analytic technique due to their elegant features.

There are some further points which are worth noting to mention here. Firstly, in generating time-dependent covariates, we considered the linear relationship between variable week and time-dependent covariates of interest (IMI and DMI) based on Reisby’s dataset. Even so, sometimes the time-dependent covariates change through the time by curve linear relation. These kinds of relationships should be a consideration of future investigations. Secondly, we restricted our study to only continuous response and time-dependent covariates. The application of the proposed method for categorical responses or covariates could be investigated in future studies.

Thirdly, we considered three classic imputation approaches, and performances of other kinds of imputation methods such as multiple imputations could be studied in the forthcoming investigations. Finally, we simply assumed complete observation of the response variable, and we did not address other kinds of missing covariates such as attrition. Thus, future studies could examine the performance of the proposed models in the presence of other kinds of missing covariate in addition to intermittent observation of the covariates in more depth.

6. Conclusion

The SUR model performs useful analysis of longitudinal data with completely observed time-dependent covariates as MRM. This model also shows desirable performance in presence of intermittently observed time-dependent covariates even for small to medium sample sizes. Moreover, the desirable features of these models including simplicity, faster speed of convergence, and no need for knowledge about the covariance structure and the distribution of random effects unlike mixed-effect regression model make these models applicable in experimental and medical studies.

Abbreviations

BUN:Blood urea nitrogen
BOCF: Baseline observation carried forward
CC: Complete case
DMI:Desipramine
HDRS: Hamilton Depression Rating Scale
IMI: Imipramine
ITSUR: Iterated seemingly unrelated regression
LOCF: Last observation carried forward
MRM:Mixed-effect regression model
MSE: Mean square error
REML: Restricted Maximum Likelihood
SE: Standard error
SUR: Seemingly unrelated regression.

Conflict of Interests

The authors declare that they have no conflict of interests.

Authors' Contribution

S. Keshavarzi and N. Zare were responsible for the design, simulation, analysis, and interpretation. S. M. T. Ayatollahi supervised the study and helped to draft the paper. M. Pakfetrat provided the data of motivating example. All of the authors read and approved the final version of the paper.

Acknowledgments

This paper is extracted from Ph.D. dissertation by S. Keshavarzi in Biostatistics, Shiraz University of Medical Sciences, Shiraz, Iran. The authors are grateful to Reisby et al. for using their data in simulation section. We are also thankful to Dr. Nick Bellissimo and Dr. Shokrpour for their invaluable comments.