#### Abstract

Interval censored (IC) failure time data are often observed in medical follow-up studies and clinical trials where subjects can only be followed periodically, and the failure time can only be known to lie in an interval. In this paper, we propose a weighted Wilcoxon-type rank test for the problem of comparing two IC samples. Under a very general sampling technique developed by Fay (1999), the mean and variance of the test statistics under the null hypothesis can be derived. Through simulation studies, we find that the performance of the proposed test is better than that of the two existing Wilcoxon-type rank tests proposed by Mantel (1967) and R. Peto and J. Peto (1972). The proposed test is illustrated by means of an example involving patients in AIDS cohort studies.

#### 1. Introduction

Interval censored (IC) failure time data often arise from medical studies such as AIDS cohort studies and leukemic blood cancer follow-up studies. In these studies, patients were divided into two groups according to different treatments. For example, in leukemic cancer studies, one group of the patients was treated with radiotherapy alone, and the other group of patients was treated with initial radiotherapy along with adjuvant chemotherapy. The two groups of patients were examined every month, and the failure time of interest is the time until the appearance of leukemia retraction; the object is to test the difference of the failure times between the two treatments. Some of the patients missed some successive scheduled examinations and came back later with a changed clinical status, and they contributed IC observations. For our convenience, we assume that in such a medical study, the underlying survival function can be either discrete or continuous, and there are only finitely many scheduled examination times. IC data only provide partial information about the lifetime of the subject, and the data is one kind of incomplete data. To deal with such incomplete data, Turnbull [1] introduced a self-consistent algorithm to compute the maximum likelihood estimate of the survival function for arbitrarily censored and truncated data. For IC data, there have been some related studies in the literature as well. For example, Mantel [2] extends Gehan’s [3, 4] generalized Wilcoxon [5] test to interval censored data, and R. Peto and J. Peto [6] also develop a different version. Sun [7] applied Turnbull’s algorithm to estimate the number of failures and risks of IC data and then propose a log-rank type test.

Fay [8], Sun [7], Zhao and Sun [9], Sun et al. [10], and Huang et al. [11] extend the log-rank test to interval censored data. Petroni and Wolfe [12] and Lim and Sun [13] generalize Pepe and Fleming’s [14] weighted Kaplan-Meier (WKM) [15] test to interval censored data.

For the purpose of comparing the power of the test statistics, Fay [8] proposed a model for generating interval censored observation. A similar selection scheme can also be seen in the Urn model of Lee [16] and mixed cased model of Schick and Yu [17]. In this paper, we propose a Wilcoxon-type weighted rank test to compare with the existing two Wilcoxon-type rank tests proposed by Mantel [2] and R. Peto and J. Peto [6]. We restrict ourselves to the Wilcoxon-type rank tests because these tests are simple to use and have the robustness property that their powers are fairly stable under different lifetime distributions.

This paper is organized as follows. In Section 2, we review the Turnbull’s [1] algorithm and introduce Fay’s [8] selection model for generating interval censored data. This selection model can be extended to a more general one, and the consistency property can be found in Schick and Yu [17]. In Section 3, we introduce Mantel’s [2] and R. Peto and J. Peto’s [6] generalized Wilcoxon-type rank tests and propose our weighted rank test. In Section 4, a simulation study is conducted to compare the performance of the three tests under different configurations. Finally, an application to AIDS cohort study is presented in Section 5.

#### 2. Data Treatment

Assume that is the lifetime random variable of a survival study, measured in discrete units and taking values . Let be the collection of all admissible intervals, and define , where , so that , and . Note that the observed failure time data in a clinical trial can be discretized if the underlying variable is continuous.

##### 2.1. Turnbull’s Algorithm

Suppose that there is a sample of i.i.d. observations of , . Here, is the IC observation of the th individual in the sample, where , and . The case is to denote that the failure time of the th subject occurs after the last examination time . Turnbull [1] proposed an algorithm to estimate the unknown probabilities . The algorithm can be described by the following four steps.

*Step 1. *Start with initial values .

*Step 2. *Obtain improved estimates by setting

*Step 3. *Return to Step 1 with replacing .

*Step 4. *Stop when the required accuracy has been achieved.

The algorithm is simple and converges fairly rapidly. The estimate yielded from the iteration is in fact the unique maximum likelihood estimate of and is a self-consistent estimate.

##### 2.2. Return Probability Model

To comply with the periodical clinical inspection, Fay [8] proposed a simulation model for generating IC data. He assumed that the probability for a patient to return to the clinic for inspection at time points are i.i.d. Bernulli random variables ; that is, , , , . means that the patient returned to the clinic at the inspection time , and means that the patient missed the inspection. In our model, we always assume that . The failure time is independent of , and the observable random interval is

###### 2.2.1. Model Consistency

Under Fay’s [8] selection model, the consistency property has been proved. This selection model can be generalized to the case that the return probability at each examination time point may be different; say that , . To demonstrate the generalized return model, we set and , , and . The selection probabilities for all admissible intervals are shown in Tables 1 and 2.

It is not difficult to see that the selection probability of the interval is where , , and . For instance, the interval may be selected under two possibilities. First, the true value of is , and the patient who missed the inspection at then goes to inspection at ; in this case, the interval is selected with probability . Second, the true value of is , and the patient missed the inspection at then goes to inspection at ; in this case, the interval is selected with probability , and therefore .

The generalized return probability model can be viewed as a special case of the mixed case model in Schick and Yu [17]; under very mild conditions, the estimate of computed by Turnbull’s algorithm is still consistent.

#### 3. Wilcoxon-Type Rank Tests for Interval Censored Data

Two-sample Wilcoxon rank test is a well-known method to test whether two samples of exact data come from the same population. The method is constructed by ranking the pooled samples and giving an appropriate rank to each observation. However, this ranking technique is in general not admissible for intervals. In this section, we will discuss how to generalize the ranking technique and then propose a Wilcoxon-type rank test for IC data to compare with two existing rank tests proposed by Mantel [2] and R. Peto and J. Peto [6]. Suppose that two samples of IC data for and are, respectively, , and , . To test whether these two samples come from the same population is equivalent to testing the equality of survival functions and ; that is,

##### 3.1. Mantel’s Test

Mantel [2] extended Gehan’s [3, 4] generalized Wilcoxon test to interval censored data by defining the score of the th observation as the number of observations that are definitely greater than the th observation minus the number of observations that are definitely less than the th observation. He proposed the test statistic

Under , the test statistic is approximately normal distributed with mean 0 and variance

##### 3.2. R. Peto and J. Peto’s Test

Different from the Mantel’s generalized version, R. Peto and J. Peto [6] defined the score of the th observation as where is the estimated survival function, ; hence, . They proposed the test statistic Under , the test statistic is approximately distributed as .

##### 3.3. Our Proposed Wilcoxon-Type Weighted Rank Test

To transform an IC data to exact, we first assign each inspection time a primary rank ; for instance, . Rewrite any observation, say , as , where , and . Then, we associate the observation with the weighted rank Let , be, respectively, the average weighted rank of the and samples, so that To test whether two IC samples come from the same population, we propose the test statistic Under , the central limit theorem implies that W.R.T is approximately distributed as a standard normal random variable. However, the mean and variance of and may depend on the probability space where they are defined; it means, different selection probability for IC intervals in (4) leads to different mean and variance of and . We therefore only consider the selection model of Fay defined in Section 2.2. In this model, the selection probability of an IC interval is in one of the following categories: Consider the probability space (), where the probability measure is defined in Section 2. To compute the variance of and , we define a random variable on this space by assigning value to the interval in , where The value can be viewed as the weighted rank of . If , are chosen as in the Wilcoxon test for exact data, then our proposed test statistic W.R.T is a Wilcoxon-type weighted rank test. Under this probability space, the expectation can be simplified as in the following theorem.

Theorem 1. *Suppose that is the random variable defined on the probability space according to (17). Then, the expectation of , , can be simplified as
**
which is independent of the choice of .*

*Proof. *It is obvious that can be written as , where the coefficients , are to be determined. The theorem is, hence, proved if we can show that all the coefficients are ones.

Consider first. An interval contributes in if and only if it contains the point . Therefore, it must be of the form , . For intervals , , the probabilities are defined in (13). For interval , the probability is defined in (14). Therefore, the coefficient is
Next, consider the coefficient for . An interval contributes if and only if it contains the point . Therefore, it must be of the form , where . It is necessary to study the contribution of the interval to in four different categories.(i).

By (13), this category contributes .(ii).

By (14), the interval contributes .(iii).

By (15), this category contributes .(iv).

By (16), this category contributes .

Consequently, the coefficient of is
Finally, the proof for the case is

The variance of , , is where and are the selected probability and the weighted rank of the th admissible interval of , respectively, .

Consider the formulas (13)–(16), the selection probability depends on and ; therefore, the likelihood function can be written as where , and are positive integers determined by the sample. Since the probability can be estimated by Turnbull’s [1] algorithm discussed in Section 2.2, and can also be estimated by trivially.

For demonstration, we set , inspection times , , and the true lifetime is exponentially distributed with . For different sample sizes , 100, and 150, different return probabilities of inspection , , and 0.3, and simulation with 100 replications, Table 3 presents the estimates of and sample variance and sample deviation of . To show the normality of W.R.T, we assume that the two populations (sample size ) are coming from the same distribution exponential (1/5). By simulation with 10000 replications and different return probabilities of inspection , , and 0.3, Table 4 presents the quantiles of W.R.T and . Figure 1 shows the CDF plots of and W.R.T with .

#### 4. Simulation Study

In this section, we carry out simulation studies to compare the performance of W.R.T test with Mantel’s [2] and Peto’s [6] tests. In the study, we assume that the failure time random variable is distributed as exponential, total sample sizes are and 200, and each sample has subjects. The interval censored data are generated by the following four steps.

*Step 1. *Generate a failure time from some distribution.

*Step 2. *Create a 0, 1 sequence with probabilities , , and .

*Step 3. *The observation is , if , and .

*Step 4. *Repeat Step 1 to Step 3 for times.

We consider three return probabilities, , , and 0.3, two sets of inspection time points, , and 1000 replications at significance level 0.05.

In the case of , 6 return points, we set the hazards 1/3 for population 1 and for population 2. Figure 2 shows the density plot of exponential distribution with , −0.2, 0, 0.2, 0.4. In the case of , 10 return points, we set the hazards 1/4 for population 1 and for population 2. Figure 3 shows the density plot of exponential distribution with , −0.3, 0, 0.3, 0.6. Tables 5 and 6 present the powers of the three tests with sample size and 200. Simulation result shows that when the failure times come from the exponential distribution, our proposed test W.R.T is the most powerful.

#### 5. An Application to AIDS Cohort Study

Consider the data of 262 hemophilia patients in De Gruttola and Lagakos [18], among them, 105 patients received at least 1,000 g/kg of blood factor for at least one year between 1982 and 1985, and the other 157 patients received less than 1,000 g/kg in each year. In this medical study, patients were treated between 1978 and 1988, the observations (] for the 262 patients, based on a discretization of the time axis into 6-month intervals. The failure time of interest is the time of HIV seroconversion. The object is to test the difference of the failure times between the two treatments. Applying our proposed test, namely, W.R.T, Mantel’s [2] and Peto’s [6] tests to this data set, the values of the three test statistics are −7.815, −7.352, and 56.476, respectively. All the three values are less than 0.001 and have the same conclusion that the HIV seroconversion appeared in the two groups of patients being significantly different.