Abstract

Many statisticians resort to using the asymptotic normal approximation method to carry out statistical inference for many statistical tests, especially nonparametric ones. In this article, the saddlepoint approximation method is proposed as an alternative to the asymptotic normal approximation method to carry out statistical inference for a number of nonparametric tests for an important type of data that appears frequently in many clinical studies such as cancer and tumorigenicity studies. In clinical trials, there are many strategies through which treatments are assigned to patients. Equal allocation of both treatments is a largely prevalent approach in clinical trials to eliminate experimental bias and increase power. Accordingly, the statistical analysis is carried out based on the truncated binomial design, which is one of the designs that achieve a perfect balance between the two treatments. To clarify the accuracy of the proposed approximation method, two sets of real data are analyzed, and for the same purpose, a comprehensive simulation study is carried out.

1. Introduction

Medical and epidemiological studies are performed mainly to measure the occurrence of an event. This event of interest is often the onset of a disease, the disappearance of the diseases symptoms, the attainment of a biochemical marker, or death. In many cases, the event of interest may happen repeatedly, for instance, occurrences of the same infection, such as recurrent pyogenic infections, multiple upper respiratory infections, recurrent bacterial infections, and repeated occurrences of certain tumors. Data collected from studies that include such recurrent events can be distinguished according to the way the items under study are monitored. If the items under study are monitored continuously and the times of all occurrences of recurrent events are available, these data are commonly referred to as recurrent event data [13]. While if it is not possible to monitor the items under study continuously, and it is only possible to examine the items under study at prescheduled time points. Therefore, only the numbers of occurrences of the events within consecutively prescheduled times are available. In this case, the data collected from such studies is called panel count data [46]. For two main reasons, we focus here on the second type of data, panel count data. The first reason is the continuous observation for the items under study may be too expensive. The second reason, in some cases, is impossible to monitor the items under study continuously. When each item under the study is examined only once, a special case of panel count data arises, and such data are commonly known in literature as current status data [7]. In this case, the data consists of the total number of occurrences of the recurrent event up to the observation time. Panel count data frequently appears in reliability and medical follow-up studies [5, 6, 810]. A number of authors have provided some nonparametric tests for the panel count data. Among them are Thall and Lachin [6]; Sun and Fang [11]; Zhang [12]; Park et al. [13]; and Balakrishnan and Zhao [14]. Current status data appears in tumorigenicity studies that involve the incidence rate of particular tumors, and in these studies, only the number of tumors that occurred prior to the animal’s death or sacrifice is known. Demographic studies are another field that commonly generates current status data [15, 16]. Comparison of the rate of development of the recurrent event, in different treatments, has obtained a lot of attention (see Dinse and Lagakos [17]; Finkelstein [18]; Huang [19]; and Sun and Kalbfleisch [20]). Almost all of the approaches presented in the literature demand that all subjects’ observation times follow the same distribution. Sun [21] proposed a permutation test that can be applied when the observed time distributions differed between groups.

Clinical trial design involves choosing a strategy for assigning treatments to patients. The strategies used to assign treatments to patients should be completely randomized to avoid selection bias, which may be intentional or unintended. There are many randomization designs, which are the basis for statistical inference. The simplest randomization design is called the complete randomization design, in which a balanced coin is thrown each time a patient has to be assigned to treatment. The hallmark of complete randomization is that it provides the highest level of randomization but may lead to treatment imbalances that may accidently bias trial results. Thus, equal allocation of both treatments has become a largely dominant approach in clinical trials to eliminate experimental bias and increase power [22]. Random allocation rule (RAR) and truncated binomial design (TBD) are most commonly used designs to ensure balance between treatments. For sample size N, one can apply the RAR by selecting randomly from all probable sequences of treatment assignments consisting of N/2 patients on each treatment. To use the TBD, toss a balanced coin until one treatment is assigned to N/2 patients; all subsequent patients will receive the opposite treatment. Rukhin [23] pointed out that RAR and TBD turn out to be completely different: the target fulfillment moments for the TBD happen much earlier than for the RAR. In this article, we propose an approximation method to approximate the exact p-values of a class of nonparametric tests for current status and panel count data under TBD. It is worth noting that Abd-Elfattah [24] approximated the p-values of a set of nonparametric tests for current status and panel count data under RAR.

Lots of approximations are routinely used in statistics. These approximations are often approximations to some intractable integrals. Such approximations are typically developed under the assumption that the sample size becomes infinitely large. In general, asymptotic approaches lead to approximations that are accurate to the first order. Saddlepoint methods provide higher-order approximations, and the precision holds even for extremely small sample sizes. Furthermore, saddlepoint approximations are particularly precision in the distribution’s tails and are frequently employed as tail-probability approximations. Daniels [25] introduced the most fundamental saddlepoint approximation to the statistical community, which is simply a method for approximating the density or the mass function of a statistic from its cumulant generating function, CGF. Daniels explained that the saddlepoint approximation is a substantial improvement over the approximations provided by both the central limit theorem and Edgeworth expansions. Daniels’ highly influential article on developing high-order approximations in statistics has led several authors to provide approximations for a number of statistical functions and probability quantities. For example, Lugannani and Rice [26] derived the saddlepoint approximation for the tail probability . Skovgaard [27] proposed an approximation for conditional probabilities of the form . The saddlepoint approximation for the conditional probabilities is known as a double saddlepoint approximation. Wang [28] developed a saddlepoint approximation for the bivariate probabilities of the form . For the saddlepoint approximations of multivariate probabilities, see Kolassa [29]. A number of statisticians have applied these different saddlepoint approximations to solve some statistical problems including Daniels [30]; Robinson [31]; Daniels [32]; Daniels [33]; Davison and Hinkley [34]; Daniels [35]; and Booth and Butler [36]. After that, saddlepoint applications have spread in many statistical fields such as survival analysis [37, 38], regression analysis [39, 40], reliability analysis [41, 42], bootstrapping [43, 44], resampling analysis [28, 34], nonparametric statistics Abd-Elfattah and Butler [45, 46]; Abd El-Raheem and Abd-Elfattah [47, 48]; Kamal et al. [49, 50]. Here, we have mentioned few applications and references of the saddlepoint applications for rigorous proofs and more applications, see Butler [51].

Abd-Elfattah [52] applied double saddlepoint approximation technique to approximate the tail probability of the linear rank class, for ordinary survival data under TBD. Here, we extend the results of Abd-Elfattah [52] to an important type of data, current status and panel count data, that appear frequently in demographic studies, acquired immunodeficiency syndrome studies, tumorigenicity experiments, and human immune deficiency virus studies.

The second section contains a number of nonparametric tests for the panel count and current status data, which are finally reformatted as a linear combination of the treatment indicator and score function. The proposed method for approximating the tail probability for statistics of nonparametric tests for the panel count and current status data is presented in the third section. The fourth section is devoted to clarifying the accuracy of the proposed method compared to the normal approximation method, by analyzing a number of real data sets. Moreover, the accuracy of the proposed method is confirmed by conducting a comprehensive simulation study. Finally, an adequate conclusion is presented for the results of this article.

2. Nonparametric Tests for Current Status and Panel Count Data

This section contains the statistics of a number of nonparametric tests for the panel count and current status data, which are finally reformatted as a linear combination of the treatment indicator and score function.

2.1. Nonparametric Tests for Current Status Data
2.1.1. Nonparametric Test Procedure I

Suppose a tumorigenicity study with N independent patients divided into two groups: control and treatment. Such a tumorigenicity study aims to test the null hypothesis that there is no difference in the tumor propagation rate in the two groups. The observed data for the jth patient is (, , ), where is the treatment indicator, is the observed time, and is the tumor indicator at . Let denote the cumulative distribution function of the common propagation of the tumor, be the maximum likelihood estimate, MLE, of , and represent the distinct ordered times of . Under the premise that the tumor is not fatal, i.e., that the time of observation and time to event are independent, a log-rank type test statistic for testing null hypothesis, , is given by the following equation [20]:where is a prespecified weight, at , and is the weighted isotonic regression of with weights . Thus,and [53]. In the previous expression, is the number of subjects who died or were sacrificed at time , is the number of tumors discovered at the time , is the number of the blocks of the isotonic estimate, and , a partition of , the blocks of the isotonic estimate of . The statistic is asymptotically normally distributed with mean 0 and standard deviation as follows:where . It should be stated that the statistic with reduces to the statistics of Hoel and Walburg [54] and to the statistic of Finkelstein [18] when .

2.1.2. Nonparametric Test Procedure II

Suppose an experiment in which independent individuals are exposed to recurrent events over time. Let be the number of events that happened for the jth subject at time x and be the conditional expectation function of given the indicator variable of individual j, . The observed data for the jth individual is (, , ), where denote the observed time for the jth individual and is the number of events that occurred at time . Supposing that the observed times are independent of given , and follow the same distribution for both treatment groups, a statistic for testing the null hypothesis , that is independent of , is given by the following equation [7]:where is the isotonic regression estimate (IRE) of , which can be evaluated using the pool-adjacent-violators algorithm [53].

It should be noted that if are independent Poisson processes with mean function (MF), thenwhere is the baseline MF and η is regression coefficient, then the statistic in equation (4) seems as a score test for from the conditional likelihood about η. The statistic is asymptotically normally distributed with mean 0 and standard deviation as follows:where and .

2.2. Nonparametric Comparison for Panel Count Data
2.2.1. Nonparametric Comparison Procedure I

Suppose a recurrent event study with N independent individuals produce only panel count data. Let be the observation times for the jth individual. As in Section 2.1.2, let be the treatment indicator of jth individual, and , , be the observed value of at time . Assume that all individuals are drawn from two populations or receive one of two different treatments, and the purpose is to determine whether there is a treatment difference based on the observed panel count data. Let and represent the MF of ’s corresponding to the individuals receive treatments A and B, respectively. Therefore, the null hypothesis can be formulated as . Then, to test , a log-rank type test statistic is given by the following equation [11]:where is IRE of under the null hypothesis . Statistic is a generalization of the statistic for current status data. Statistic , asymptotically, has a normal distribution with mean 0 and variance that can be estimated by the following equation:

2.2.2. Nonparametric Comparison Procedure II

Balakrishnan and Zhao [14] developed test procedures for the null hypothesis described in Section 2.2.1 by replacing the IRE with the nonparametric maximum likelihood estimator, NPMLE. The statistic of Balakrishnan and Zhao [14] is given by the following equation:where , , and is the NPMLE of the common MF under the null hypothesis . Statistic in equation (9) has an asymptotic normal distribution with mean 0 and estimated variance as follows:

It may be appropriate to include the statistics mentioned above in a class of tests that takes the linear form as follows:where is the treatment indicator and is a score function which varies according to the test.

3. Saddlepoint P Value Procedures

In this section, the saddlepoint approximation method is applied to approximate the exact p value of the class of nonparametric tests for panel count and current status data (11) under TBD.

Let be the random treatment indicators of TBD, where if control or treatment are assigned, respectively, . Blackwell and Hodges [55] defined the stopping rule of the TBD as follows:where , are independent Bernoulli trials until one treatment is assigned to patients.

Let be the observed value of statistic S in equation (11) and is the p value of the S, then the saddlepoint approximation of under TBD is given by the following equation [52]:

The conditional probabilities in equation (13) can be approximated using Skovgaard’s saddlepoint approximation [27] as follows:where and are the standard normal distribution function and density function, respectively,and is the cumulant generating function of . The saddlepoints and solve the system of equations and , respectively.

4. Illustrative Examples and Some Numerical Results

This section is dedicated to clarifying the proposed procedures for approximating the exact p value of the class of nonparametric tests for panel count and current status data under TBD that were discussed in the previous sections. Firstly, two sets of real current status and panel count data are analyzed to illustrate the proposed procedures. After that, the accuracy of the proposed method, saddlepoint approximation, is compared with the traditional method, normal approximation, by carrying out a simulation study. Two nonparametric procedures were presented in Section 2 for current status data and another two for panel count data. In this section, one statistic is selected for each data type and applied to it, and the application to other statistics will be the same. These statistics are the statistic in equation (4) for the current status data and the statistic in equation (7) for the panel count data.

4.1. Illustrative Examples

We illustrate the proposed procedures for approximating the exact p value discussed in the previous section using two sets of real current status and panel count data. The first data set arising from a tumorigenicity study is taken from Ii et al. [56]. The data are the number of tumors detected at the time of death for two groups of male and female mice. The goal of the study was to compare tumor development rates in male and female groups. Two random sample of size and 100 are selected from the data of Ii et al. [56] as examples of small and large sized samples. In Table 1, the results for these data sets are referred as data set 1 and data set 2, respectively. In the table, the approximated p values are calculated using the normal approximation method, NAM, and the saddlepoint approximation method, SPAM. Moreover, the table contains the simulated exact mid-p-value, Exact. The simulated exact mid-p-values are computed by simulating million truncated binomial sequences for the control and treatment indicators and then calculating the ratio of times that S is greater than plus half the ratio of times it reaches .

The second illustration is for panel count data collected from a study of superficial bladder tumors [8, 57]. All tumors detected at the start of the study were removed and the patients were randomly divided into a treatment or control group. In follow-up visits, patients were examined, the new tumors were removed, and treatment continued. Two portion of this data set as small and intermediate size data sets are used to illustrate the proposed procedures. In Table 1, the results for these data sets are referred as data set 3 and data set 4, respectively. The results of Table 1 show a clear superiority of the saddlepoint approximation method over the normal approximation method to approximate the exact p value of the tests.

4.2. Some Numerical Results

A simulation study is carried out under various situations to compare the accuracy of the proposed method and the commonly used approximation method. For the current status data, we generate from Poisson processes, where , σ is constant and , . For each sample size , one thousand data sets are generated from this Poisson process with . For each data set, the treatment labels, , are chosen by simulating a TBD sequence.

In the same way, one thousand panel count data sets of sizes, , are generated by considering a follow-up study in which patients are followed-up quarterly. The number of visits for each patient is distributed uniformly over the 3 visits. For the number of observations, , we generate ’s from a discrete uniform distribution , which are integers selected to give specific percentages of missing observations. Let the number of tumors discovered at each follow-up visit follow a Poisson process with MF , where and , , . For the purpose of comparing the accuracy of the saddlepoint approximation and normal approximation methods, the simulated exact mid-p-value, normal approximation p value, and saddlepoint approximation p value are calculated for each simulated data set. After that, the percentage of closeness of the saddlepoint approximation method to the simulated method is calculated as one of the comparison criteria between the two methods. This criterion is included in the Tables 2 and 3 under the heading “Sad.Perc.” In addition, two other criteria are used to compare the two approximation methods, namely the mean square error, Mse, and mean relative absolute error, Rae. We can define the Mse of the saddlepoint approximation p values and normal approximation p values as and , respectively, where and is the simulated exact mid-p-value. Also, the Rae of the saddlepoint approximation p values and normal approximation p values can be defined as , and , respectively. The results of the simulation study are presented in Tables 2 and 3.

The results of Tables 2 and 3 show the superiority of the saddlepoint approximation method over the normal approximation method in approximating the exact p-value of the class of nonparametric tests for panel count and current status data under TBD in terms of Mse and Rae. For example in the case of sample size with of Table 3, the saddlepoint p-values were closer to the simulated exact mid-p-values of the time and the mean relative absolute error for the saddlepoint approximation method was versus for the normal approximation method. When averaged over all simulations, the saddlepoint approximation p-values were closer to and of the time to the simulate exact p values for current status and panel count, respectively.

5. Conclusion

In this paper, the saddlepoint approximation method is proposed as a more accurate alternative than normal approximation method to approximate the exact p value of the class of nonparametric tests for panel count and current status data under TBD. Such data appears frequently in clinical and tumorigenicity studies. A TBD was applied to ensure perfect equilibrium between the two treatment groups, which is one of the requirements of many clinical studies. The accuracy of the proposed method was verified by analyzing a number of real data sets and carrying out a comprehensive simulation study.

Data Availability

The data sets generated and analyzed in this study are simulated data created by using R program.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The second author extends her appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through a research groups program under grant R. G. P. 2/144/43.