#### Abstract

This paper develops a new empirical likelihood method for semiparametric linear regression with a completely unknown error distribution and right censored survival data. The method is based on the Buckley-James (1979) estimating equation. It inherits some appealing properties of the complete data empirical likelihood method. For example, it does not require variance estimation which is problematic for the Buckley-James estimator. We also extend our method to incorporate auxiliary information. We compare our method with the synthetic data empirical likelihood of Li and Wang (2003) using simulations. We also illustrate our method using Stanford heart transplantation data.

#### 1. Introduction

Suppose that one observes right censored regression data consisting of i.i.d. triples , , where for subject , is a known monotone transformation of the survival time of interest, is the corresponding censoring time, and is a vector of covariates. We consider the problem of making inferences for the slope parameter of the semiparametric linear regression model as follows: where is a vector of unknown regression coefficients and 's are independent and identically distributed random errors with an unknown distribution . Assume that is independent of and that conditional on , and are independent, . Because the error distribution is completely unknown, we assume no intercept term in model (1) in order for to be identifiable. Assume further that the covariance matrix of is positive definite. Model (1) provides a useful alternative to the popular Cox [1] model when the proportional hazards assumption does not hold. Furthermore, it makes no parametric assumption on the error distribution and is, thus, more flexible than parametric accelerated failure time models (cf. Andersen et al. [2]).

Buckley and James [3] extended the least squares method to estimate the regression coefficient in model (1) with right censored data. The Buckley-James estimator can be calculated using an iterative algorithm. Its consistency and asymptotic normality have been established by Lai and Ying [4], Ritov [5], Tsiatis [6], Ying [7], and others. However, its asymptotic variance involves the unknown hazard function of , and its derivatives whose nonparametric estimations are problematic (cf. Ritov [5]). To overcome this problem, several approaches have been studied in the literature. Jin et al. [3, 8], Wei et al. [9], and Lin and Geyer [10] considered rank-based inferences. Koul et al. [11] introduced a synthetic data approach which was further investigated by Zhou [12], Lai et al. [13], and others. Let , where is the survival function of the censoring time. It can be shown that if is independent of and . Koul et al. [11] proposed to estimate the regression parameters by regressing on , where is the Kaplan-Meier [14] estimate of and is referred to as a synthetic variable. Koul et al. [11] showed that the synthetic data estimator is asymptotically normal and its variance is simple to estimate. However, it requires a strong assumption that the censoring time is independent of both the survival time and the covariates. It can also have poor small sample performance in the presence of heavy censoring (cf. Li and Wang [15]). Jing and Qin [16] and Li and Wang [15] independently developed empirical likelihood-based inference for using synthetic data. Their methods provide substantial improvement over the normal theory method of Koul et al. [11], but are still not very satisfactory for small samples (cf. Li and Wang [15]). The synthetic data approach makes an independence assumption between and . Li and Lu [17] showed that it can yield seriously biased parameter estimate if depends on . Zhou and Li [18] developed a censored data EL method under a weaker censorship assumption that and are conditionally independent given . Their method also demonstrated better small sample performance than the synthetic data EL method when the errors ’s are independent and identically distributed (homogeneous). On the other hand, it is not easy to extend the method of Zhou and Li [18] to incorporate auxiliary information and to construct confidence regions for linear combinations of the regression coefficients, which are relatively easy using the synthetic data EL method (cf. Li and Wang [15]).

In this paper, we develop a new empirical likelihood method for linear regression with right censored data. We construct an estimated empirical likelihood based on the estimation equation for the Buckley-James [3] estimator. The approach inherits many appealing features of empirical likelihood. For example, the shape and orientation of a confidence region are entirely determined by data. Most importantly, it does not involve variance estimation. Our simulation shows that, under the assumptions of model (1), our method is far superior to the synthetic data empirical likelihood method of Li and Wang [15], with substantially shorter confidence intervals and higher coverage probabilities. Compared to Zhou and Li [18]’s method, our method may not be as efficient, but it is easier to compute. It can also be extended easily to incorporate auxiliary information and to construct confidence regions for linear combinations of the regression coefficients. More discussion of the pros and cons of our method in relation to existing methods are given later in Remark 2.

The use of empirical likelihood dates back at least to Thomas and Grunkemeier [19] who constructed confidence intervals for survival probabilities. The idea was later popularized by Owen [20, 21] who derived confidence regions for the mean of a random vector. Since the work of Owen [20, 21], there has been extensive developments of empirical likelihood methods for a wide range of applications including, among others, linear models [22–24], generalized linear models [25], quantile estimation [26], biased sample models [27], generalized estimating equations [28], dependent process model [29], partial linear models [30], mixture proportions [31], confidence bands for survival functions [32], confidence bands for quantile functions [33], censored cost regression [34], and confidence tubes for multiple quantile plots [35]. Some nice discussion of properties of empirical likelihood can be found in DiCiccio, Hall and Romano [36], Hall [37], and Hall and Scala [38], and others. A comprehensive survey of empirical likelihood and further references can be found in Owen [39] and Li et al. [40].

In Section 2, we derive an estimated empirical likelihood for by combining the ideas of Owen [21, 22] and Buckley and James [3]. An adjustment factor is adopted so that the adjusted empirical likelihood has an asymptotic standard Chi-square distribution. We also discuss how to incorporate auxiliary information using empirical likelihood. Section 3 presents results from a simulation study to illustrate the performance of our method compared with the synthetic data empirical likelihood method. A real data example is also provided. Section 4 gives some concluding remarks. The proofs are collected in the appendix.

#### 2. Empirical Likelihood Inference

##### 2.1. Empirical Likelihood Based on the Buckley-James Estimating Equation

We motivate our procedure by first considering the case where and are known.

We first give a review of the Buckley-James equation. It can be shown that, under model (1), or, equivalently,

Because is not always observable, we impute by its conditional expectation given the observed data as follows:

Noting that , we can replace by in (3). This leads to where

Now, the problem of testing is equivalent to testing based on i.i.d. observations , . This problem can be readily solved using the results of Owen [21]. Specifically, define Owen [21] showed that where is the solution of the equation Moreover, under , has an asymptotic central Chi-square distribution with degrees of freedom. Thus, one would reject if , where is the upper quantile of the standard Chi-square distribution with degrees of freedom.

Now, we consider the case where both and are unknown. Define , . We estimate by the Kaplan-Meier [14] estimator of based on , where are the order statistics of the -sample, and is the associated with . In addition, we estimate by the sample mean .

We propose to use as a likelihood ratio statistic for testing . However, we can no longer use Owen's [21] result for the null limiting distribution of because 's are not i.i.d. The following theorem states that an adjustment factor is needed so that the limiting null distribution is standard Chi-squared.

Theorem 1. * Assume that the conditions listed in the appendix hold. Define that
**
where
**Then, under ,
**
where is a standard Chi-square random variable with degrees of freedom. *

It follows immediately that an approximate -level test rejects if Moreover, an approximate confidence region for is given by

*Remark 2. * Although both the above derived method and Zhou and Li [18] use the Buckley-James estimation equation, a sample version of (7), they are different in that we use the complete data likelihood, whereas Zhou and Li [18] uses the exact censored data likelihood to construct the EL. Similar to Li and Wang [15], the above method can be extended to incorporate auxiliary information and to obtain EL procedure for a subset, contrast, or linear combinations of the regression coefficients, which does not seem easy when using the method of Zhou and Li [18]. As an illustration, we show below how to extend our method to incorporate auxiliary information.

##### 2.2. Empirical Likelihood with Auxiliary Information

Auxiliary population characteristics of the covariate are sometimes available in practice. Effective usage of the auxiliary information can lead to more efficient inference (cf. Chen and Qin [41], Qin and Lawless [28] and Zhang [42, 43]). Here, we show how to use empirical likelihood to incorporate auxiliary information of .

Assume that the available auxiliary information on is given in the form , where , is a vector of known functions. To make use of the auxiliary information, we maximize subject to , and .

Let . By the method of Lagrange multipliers, it can be shown that (17) is maximized at where satisfies the following equation Hence, the empirical log-likelihood ratio function for is given by

Similar to the previous section, an adjustment factor is needed for to have a standard Chi-square asymptotic distribution, as stated in the following theorem.

Theorem 3. * Assume that is positive definite and that exists. Define that ,
**
where and are defined in Theorem 1. Then, under the conditions of Theorem 1,
**
as , where
**
and . *

#### 3. Numerical Results

We carried out Monte Carlo simulations to examine the performance of our proposed empirical likelihood method based on the Buckley-James estimating equation (ELEE) in comparison to the empirical likelihood method based on synthetic data (ELSD) [15, 16]. We considered five models. In model A, the data were generated from , where and are independent normal random variables with mean 0 and variances 0.25, respectively, the censoring time is a normal random variable with mean and standard deviation 4. Model B is the same as model A except that ~ Bernoulli. Model C assumes that , where , ~ Weibull (shape = 1.843, scale = 1), and . Model D is the same as model A except that , allowing the censoring time to depend on . Model E is the same as model A except that , allowing for heterogeneous errors. We adjust to produce different censoring rate (CR). We also vary the sample size . The achieved confidence levels and average lengths of the ELEE and ELSD confidence intervals for the slope parameter are summarized in Table 1. Each entry in the table was computed using 3,000 Monte Carlo samples.

We see from Table 1 that under models A–D, the coverage probabilities of ELEE method are consistently close to or slightly above the nominal level, whereas the ELSD method can have severe under-coverage for small samples () and large censoring rate (75%). Furthermore, the ELEE confidence intervals are much narrower than the ELSD confidence intervals. In particular, the ELSD method failed completely with unreasonably low coverage under model D when the censoring time is dependent on . On the other hand, under model E with heterogeneous errors and independent censoring time, ELEE showed larger coverage probability errors than ELSD, as one would have expected. Thus, the ELEE method seems to dominate the ELSD method when the errors are homogeneous, but can be outperformed by ELSD in the presence of heterogeneous errors.

We now illustrate our method using the Stanford heart transplant data (Miller [44], Table 1). The data include the lengths of survival (in days) after transplantation, ages at time of transplant, and T5 mismatch scores for 69 patients who received heart transplants at Stanford and were followed to April 1, 1974. Twenty-four patients were still alive on April 1, 1974 and thus their survival times were censored. For illustration purpose, we considered two models, labeled as (I) and (II), respectively, where the dependent variable is the logarithm to base 10 of the length of survival from transplantation. Specifically, model (I) regresses on the mismatch score T5, and model (II) regresses on age. As in Koul et al. [11], regression of survival on the mismatch score T5 was performed with nonrejection-related death being treated as censoring since the mismatch score is directed at the rejection phenomenon [44]. Table 2 reports the parameter estimates and 95% confidence intervals for the slope parameters using our empirical likelihood method based on the Buckley-James estimating equation (ELEE) and the empirical likelihood method based on synthetic data (ELSD) [15, 16].

It is seen from Table 2 that both the point estimates and confidence intervals are quite different between the ELEE and ELSD methods for this data. For example, the KSV slope estimates of T5 and age are positive, which seems to contradict to the common belief that the survival time tends to be negatively correlated with T5 (the mismatch score) and age at diagnosis. For model (II), the ELSD confidence interval does not include zero and thus concludes a significant age effect at the 5% significant level, whereas the ELEE method does not produce a significant result. The ELSD results could be misleading in this example since the censoring time appears to depend on age and T5 as pointed out by Leurgans [45] and Li and Lu [17]. We have also examined the Cox-Snell residual plots for a number of parametric accelerated failure time (AFT) models. We found that the log-normal AFT model fits the data fairly well which indicates that the semiparametric AFT model should also fit the data well.

#### 4. Discussion

This research adds a new tool to the toolbox of empirical likelihood (EL) methods for linear regression with right censored data. The three EL methods, namely, the method of Li and Wang [15], the method of Zhou and Li [18], and the method of this paper should be regarded as complementary, as opposed to competing, methods for linear regression with right censored survival data. Each of the three approaches has its own merits and shortcomings. None dominates the others in every aspect. So it is important to understand the pros and cons of these methods. First of all, if the errors in the regression model are i.i.d (homogeneous) and the censoring time is conditionally independent of the survival time given the covariates, then the method of Li and Zhou [18] and the method of this paper are expected to be superior to the Li and Wang [15] method, as demonstrated by simulations in Table 1 and Li and Zhou [18]. This is because the former two methods use the Buckley-James estimating equations which take advantage of the homogeneous error assumption to implicitly impute a censored observation from the model using all residuals, whereas the latter uses synthetic data that does not utilize the homogeneous error assumption. Furthermore, based on our limited experience, the Li and Zhou [18] seems to be more efficient than the method of this paper. This is not a surprise since, Li and Zhou method [18] uses the exact censored likelihood, whereas the ELEE method of this paper uses an approximate likelihood. On the other hand, Li and Zhou [18] only developed an EL procedure for the whole vector of regression coefficients and did not discuss how to extend their method to incorporate auxiliary information which can be easily done using the approach of this paper as described in Section 2.2. Secondly, if the errors are heterogeneous, then the Li and Zhou [18] method and the method of this paper are expected to fail as indicated by Table 1 (model E), but the synthetic data method of Li and Wang [15] would still be asymptotically valid under the stronger censorship assumption that the censoring time is independent of both the survival time and the covariates. Finally, the method of Li and Wang [15] can fail completely when the censoring time is dependent on the covariate as shown in Table 1 (model D).

Some further extensions are warranted in future research. For example, the method of this paper can be extended to draw inference for linear combinations of the regression coefficients along the lines of Li and Wang ([15, Section 3]). It can also be extended to construct EL confidence regions based on estimating equations for a class of M-estimators described by Ritov [5]. Finally, this paper focuses only on EL methods for right censored data. It would be interesting to investigate how these EL methods compare to other methods such as the rank-based regression methods in future studies.

#### Appendix

*Assumption A.1. *The covariate has compact support.

*Assumption A.2. *There exists a function such that converges uniformly to in probability as .

The following lemmas are needed to prove Theorem 1.

Lemma A.3. * Let denote the hazard function of . Then,
**
where and
**
Moreover, can be consistently estimated by . *

* Proof. * By Proposition 4.1 of Ritov [5]
where , , are orthogonal locally square integrable martingales with respect to the filtration . This, together with Rebolledo's martingale central limit theorem (cf. Andersen et al. [2]), proves (A.1).

Lemma A.4. * Under the conditions of Theorem 1,*(a)*,
*(b)*. *

* Proof. *(a) By Assumption A.1, for some constant . Thus,
where we have used Lemma 3 of Owen [21] in the second step and Lemma 4.1 of Ritov [5] in the third step.

(b) Define that . Then, applying Lemma 4.1 of Ritov [5], it can be shown that

Let , where and . Then, by Owen [21],
where is the smallest eigenvalue of . Let be the unit vector in the th coordinate direction. By Lemma A.3,
The conclusion of (b) then follows from (10), (A.5)–(A.7), and the arguments used in the proof of (2.14) of Owen [21].

* Proof of Theorem 1. * For simplicity, we denote by throughout the proof.

Applying Taylor's expansion to (9), we have
where

Note that

This, together with (a), (b), and the fact that
implies that

Note that
By (10), (A.5), (A.13), and Lemma A.4, we get

Again by (10), we have
Moreover, by Lemma A.4 and (A.11), we have
It then follows from (A.15) and (A.16) that

Combining (A.8), (A.17) and (A.14) yields the following identity
This, together with Lemma A.3, proves Theorem 1.

The following lemma is needed to prove Theorem 3.

Lemma A.5. * Assume that the conditions of Theorems 1 and 3 hold. Then,
**
where
*

*Proof. * This result is a direct consequence of Lemma A.3 and the following facts

*Proof of Theorem 3. * The theorem can be proved using Lemma A.5 and along the lines of the proof of Theorem 1. We omit the details.

#### Acknowledgments

The authors thank the editor and two referees for their insightful and constructive comments that lead to significant improvements of the paper. The authors also thank Doctor Jun Dong for his programming assistance. The research of Gang Li was partly supported by National Institute of Health Grants nos. P30 CA-16042 and UL1TR000124-02. Gang Li's research was supported in part by NIH Grant no. CA016042 and NIH Grant no. P01AT003960