Abstract
The nonparametric estimation for the density and hazard rate functions for right-censored data using the kernel smoothing techniques is considered. The “classical” fixed symmetric kernel type estimator of these functions performs well in the interior region, but it suffers from the problem of bias in the boundary region. Here, we propose new estimators based on the gamma kernels for the density and the hazard rate functions. The estimators are free of bias and achieve the optimal rate of convergence in terms of integrated mean squared error. The mean integrated squared error, the asymptotic normality, and the law of iterated logarithm are studied. A comparison of gamma estimators with the local linear estimator for the density function and with hazard rate estimator proposed by Müller and Wang (1994), which are free from boundary bias, is investigated by simulations.
1. Introduction
Censored data arise in many contexts, for example, in medical follow-up studies in which the occurrence of the event times (called survival) of individuals may be prevented by the previous occurrence of another competing event (called censoring). In such studies, interest focuses on estimating the underlying density and/or hazard rate of the survival time. Nonparametric estimation using kernel smoothing method has received considerable attention in the statistical literature. A popular approach for estimating the density function and the hazard rate function is done using a fixed symmetric kernel density with bounded support and a bandwidth parameter. The kernel determines the shape of the local neighborhood while the bandwidth controls the degree of smoothness. In order to get a reasonable estimator, these two parameters, the kernel and the bandwidth parameter, have to be chosen carefully. For a review about kernel smoothing approaches, we refer the reader to Silverman [1] and Izenman [2] for uncensored data and Singpurwalla and Wong [3], Tanner and Wong [4], Padgett and McNichols [5], Mielniczuk [6], and Lo et al. [7] in the case of right censoring.
It is well known from the literature that the kernel has less impact than the bandwidth on the resulting estimate. However, when the density function of the data have a bounded support, using the classical kernel leads to an estimator with a large bias near the endpoints. The problem of bias, called also the boundary effect, becomes a serious drawback when a large portion of the sampled data are present in the boundary region. In fact, when we estimate the underlying function near the endpoints, as the support of the kernel exceeds the available range of the data, the bias of the resulting estimator becomes larger. This is especially the case in survival analysis, since the survival time is assumed to be nonnegative variable. So, near zero, the symmetric kernel estimator of the density and the hazard functions underestimates the true ones.
For uncensored data, several methods are available in the literature to overcome this problem, for example, the reflection method of Schuster [8], the smooth optimum kernel of Müller [9], the local linear estimator of Lejeune and Sarda [10], the transformation approach of Marron and Ruppert [11], and the boundary kernel of Jones [12] and Jones and Foster [13]. The local linear estimator is a special case of boundary kernel method. The main idea behind the boundary kernel method is to use an adaptive kernels in the boundary region and to use a fixed symmetric kernel in the interior region. For nonnegative data and in order to overcome the boundary bias problem, Chen [14] considers the gamma kernel estimator. Simulation results of Jones [12] and Chen [14] show that the local linear estimator outperforms the boundary kernel estimator of Müller [9]. For right-censored data and to resolve this problem, Müller and Wang [15] propose a new class of polynomial boundary kernel estimator for hazard rate function, where the kernel and the bandwidth parameter depend on the point where the estimate is to be evaluated. Hess et al. [16] show numerically its performance via an extensive simulation study.
In this paper, we adapt the gamma kernel smoothing procedure to estimate the marginal density and the hazard function of positive independent survival data that are subject to right censoring. We show that both estimators are robust against boundary problems. Also, we establish the mean integrated square error, the asymptotic normality, and the law of iterated logarithm of the two estimators. Furthermore, via a Monte Carlo study, the finite sample performance of the estimators is investigated under various scenarios.
The paper is organized as follows. Section 2 introduces the gamma kernel estimators for the density and the hazard rate functions for right-censored data. In Section 3, we establish the asymptotic properties of the gamma kernel estimators. In Section 4, we investigate the finite sample properties of the gamma kernel estimators. Section 5 provides an application to the classical bone marrow transplant data. The last section is an appendix that gathers the proofs.
2. Methodology
Let (survival times) and (censoring times) be two i.i.d. nonnegative independent random sequences with distribution functions and , respectively. Under the censoring model, instead of observing , we observe the pair , where and with being the indicator function. We denote by the density function of and by the corresponding hazard function. The nonparametric maximum likelihood approach proposed by Kaplan and Meier [17] leads to the estimator of given by where and are the corresponding 's. This estimator was studied by many authors. For reference, we cite Breslow and Crowley [18], Wang [19], and Stute and Wang [20] among many authors. Lo and Singh [21] expressed the KM estimator as an i.i.d. mean process with a remainder of negligible order. This result was improved by Lo et al. [7] and will be useful in this paper to make the connection between the uncensored and censored case.
From now on, we will denote the right endpoint of a given (sub)distribution by , that is, . We will also use the notation for the corresponding survival function, that is, . Let be the distribution function of , that is, . We suppose that or equivalently . In the remainder of this paper, except if mentioned otherwise, all the integrations are taken over the interval , where . Based on the smooth kernel technique, we propose to estimate the density by the gamma kernel estimator defined as follows: where the kernel is given by the weights 's are the jumps of at for , and is the bandwidth parameter.
Naturally, the gamma kernel estimator that we propose for the hazard rate is As we will see later, those estimators are free of boundary bias; this is due to the fact that the gamma kernel is defined on the positive real, and so no weight is assigned outside the support of the underlying density and/or hazard rate functions. The shape of the gamma kernel and the amount of smoothing are not only controlled by the bandwidth parameter, but also vary according to the position where the function is estimated. For uncensored data, Chen [14] shows that the gamma kernel density estimator achieves the optimal rate of convergence for the mean integrated squared error within the class of nonnegative kernel density estimators. Bouezmarni and Scaillet [22] state the uniform weak consistency for the gamma kernel estimator on any compact set and also the weak convergence in terms of mean integrated absolute error. In the next section, we will prove that even when the data are censored, the gamma kernel estimators perform in the interior and the boundary regions.
3. Asymptotic Properties
In this section, we state the asymptotic properties of the gamma kernel estimator of the density and the hazard rate functions. We start with the following theorem which will play an important role for the remainder of this section. To be concise, in the following, we will denote by either the density or the hazard rate function and either the gamma kernel estimator for the density or the hazard function.
Theorem 3.1. Assume that is twice continuously differentiable. If (i) , then the integrated mean squared error of is where for the density function and are given by and for the hazard rate function and are given by The optimal bandwidth parameter which minimizes the asymptotic is given by and the corresponding optimal asymptotic integrated mean squared error is
Theorem 3.1 states that the gamma kernel estimators of the density and hazard rate functions are free of boundary bias and provides the theoretical formula of the optimal bandwidth. However, in real data analysis, to choose an appropriate bandwidth, one needs to use a data driven procedure, for example, the cross-validation, the bootstrap, or the method proposed by Bouezmarni and Scaillet [22]. Of course, all those methods need to be carefully adapted to the censoring case. Also, from (A.25) in the appendix, the asymptotic variances of the gamma kernel estimator are of a larger order near the boundaries than those in the interior. However, Theorem 3.1 shows that the impact of the increased variance near the boundary on the mean integrated squared error is negligible and the optimal rate of convergence in term of integrated mean squared error is achieved by the gamma kernel estimators.
The following proposition deals with the asymptotic normality of the gamma kernel estimators.
Proposition 3.2. Under the same conditions in Theorem 3.1. if (ii) , then for any such that and , one has where
We establish in the next proposition the law of iterated logarithm of the gamma kernel estimators. Let , where is defined in Proposition 3.2.
Proposition 3.3. Under the same conditions in Theorem 3.1. if, (iii), and .(iv), then
4. Simulations Results
The finite sample performance of the proposed methodology is studied in this section. Two models are considered. (i)Model A: the survival time follows an exponential distribution, and the censoring times were generated from an uniform distribution . was chosen to be the solution of the following equation , where is the desired percentage of censoring. (ii)Model B: the survival times follow a Weibull distribution with scale parameter and shape parameter , and the censoring times are also generated from a Weibull distribution with shape parameter and a scale parameter given by . This ensures that the degree of censoring is equal to .
First, we show the results for the density estimator. To evaluate the performance of the gamma kernel estimator for the density function, we compare this later with the local linear estimator of Jones [12] adapted for the right censoring case. The local linear method is known to be a robust technique against boundary bias problems. We consider different sample sizes, . As a measure of errors of the estimators, we analyze the mean and the standard deviation of the -norm.
Tables 1 and 2 provide the results obtained with 1000 replications for the mean and the standard deviation, respectively. Firstly and as expected, when the sample size increases, the mean integrated square error for the two estimators decreases; see Table 1. This is true for both models and all degrees of censoring. For example, in model A, we can see that with 10% of censoring and using the gamma kernel estimator, the mean error decreases from 0.0117 to 0.0101 when the sample size goes from 125 to 250. Note that, for the of censoring, the rate at which the mean error decreases is much smaller. Secondly, except the censoring case with model A, the gamma kernel estimator outperforms strongly the local linear kernel estimator. Thirdly, for model B, when the degree of censoring increases the mean integrated square error increases as expected. In fact, for model B, the cumulative distribution of the censoring times increases when the degree of censoring increases. This implies that the mean integrated squared error of the gamma kernel estimators increases. From Table 2, one can see that the variance of both models and both estimators decreases with the sample size, for all the degree of censoring. For model B, the variance of gamma kernel estimator increases with the degree of censoring. Another point to remark is that, for model B, the local linear estimator is dominated in terms of variance by gamma kernel estimator in all situations. For model A, the variance of the gamma kernel estimator is smaller than the variance of the local linear estimator for and 0.25, but not for .
For the hazard function, we compare our gamma kernel estimator with the boundary-corrected hazard estimator of Müller and Wang [15]. Table 3 shows the mean and the variance of errors based on 1000 replications for both and . The same data-generating procedure, as for density function, was used. This results clearly demonstrate that our method is far more efficient than the method proposed by Müller and Wang [15].
5. Application
To illustrate our approach with real data, we consider the classical bone marrow transplant study. The variable of interest is the disease-free survival time, that is, time to relapse or death. Among a total of 137 observations, there are 83 censored times all of them are caused by the end of the follow-up period. The data together with a detailed description of the study can be found in [23, Section 1.3]. Figure 1 shows the estimated hazard function using both our method (G) and that proposed by Müller and Wang [15] (MW) with boundary correction. For the Muller estimator, we use the data-driven global optimal bandwidth as proposed by the authors. To calculate the MW estimator and its bandwidth parameter, we make use of the R package muhaz; see Hess and Gentleman [24]. The estimator of Müller and Wang [15] shown in Figure 1 is based on the bandwidth while for the gamma kernel estimator . This is a data-driven bandwidth based on the least squared cross-validation method of Marron and Padgett [25] adapted to our case. Note that this is just a practical choice and may be far from the optimal one. From Figure 1, one cannot decide which estimator is better but, given the results of our simulation study, we suspect that the MW method has a global tendency of underestimating the real hazard function and so the real risks after a bone marrow transplant.
Appendix
We start this section with some notations. Let where denote the subdistribution of the uncensored observations. For positive real numbers and , and or 1, let We also set . First note that The following two lemmas will play an important role in the demonstrations. The first lemma is a due to Lo et al. [7] which expressed the KM estimator as an i.i.d. mean process with a remainder of negligible order.
Lemma A.1 (see [7]). For , where and, for any ,
The next lemma gives a strong approximation of the gamma kernel estimator of the density. This lemma permits us to derive the asymptotic properties of the gamma kernel estimator for density and hazard rate function.
Lemma A.2. The gamma kernel density estimator admits the strong approximation on the interval : where Also, for any ,
Proof of Lemma A.2. Let us first show that can be made arbitrary small for any . From it follows that for , Observe that, for any , . So, for any . Now, using this result, the integration by part, and Lemma A.1, we obtain We deduce the result of Lemma A.2 by using the following inequality:
Proof of Theorem 3.1. We start with the gamma kernel density estimator. From the asymptotic bias of the gamma kernel estimator for uncensored data (see Chen [14]) and the fact that for the interior region
and for the boundary region
it can be easily verified that, in our case, the bias is given by
where .
To calculate the asymptotic variance of the gamma kernel estimator, we only need to evaluate the variance of since the other terms are negligible. We start with the fact that
where . Using integration by parts, the first integral becomes
So that,
Again, using the integration by parts we obtain,
Therefore, from , we get
where is a gamma() random variable and
From Chen [14], we have that for a small value of ,
So after some easy development, we get
Finally, we derive the integrated mean squared error from (A.25) and (A.17). Now, for the gamma kernel estimator of the hazard function, we start by the following decomposition:
where
Observe that , because the term III is deterministic. Using Schwartz inequity and the boundedness of the gamma kernel density estimator we found that and . Now, from the bias formula of the gamma kernel estimator and the conditions on the bandwidth parameter, we get
Again, using the Schwartz's inequality, we obtain
Combining these formulas, one can see that
Finally, using the expression of the bias and the variance of the gamma kernel density estimator, we derive the desired result of the gamma kernel estimator for the failure rate function.
We will only give the proof of the asymptotic normality and the iterated logarithm for the gamma kernel estimator. Thereafter, the result is straightforward for the gamma kernel hazard estimator. The proofs are based on Lemma A.2. Indeed, we only need to prove the asymptotic normality and the iterated logarithm for defined in (A.8) since the other terms are negligible.
Proof of Proposition 3.2. By (i), see Theorem 3.1, we have and under the conditions on the bandwidth parameter, we have .
Therefore, we need to state that
But since
it suffices to prove that
In fact, using inequality (A.14),
Therefore, from the variance of ,
Proof of Proposition 3.3. Condition (iv) ensures that .
On the other hand, we apply [26, Theorem 1] to , where is defined as in the proof of Proposition 3.2; we get under condition (iii) on the bandwidth parameter
which concludes the proof of the theorem.
Acknowledgment
M. Mesfioui acknowledges the financial support of the Natural Sciences and Engineering Research Council of Canada.