Research Article | Open Access
Samane Nematolahi, Sahar Nazari, Zahra Shayan, Seyyed Mohammad Taghi Ayatollahi, Ali Amanati, "Improved Kaplan-Meier Estimator in Survival Analysis Based on Partially Rank-Ordered Set Samples", Computational and Mathematical Methods in Medicine, vol. 2020, Article ID 7827434, 11 pages, 2020. https://doi.org/10.1155/2020/7827434
Improved Kaplan-Meier Estimator in Survival Analysis Based on Partially Rank-Ordered Set Samples
This study presents a novel methodology to investigate the nonparametric estimation of a survival probability under random censoring time using the ranked observations from a Partially Rank-Ordered Set (PROS) sampling design and employs it in a hematological disorder study. The PROS sampling design has numerous applications in medicine, social sciences and ecology where the exact measurement of the sampling units is costly; however, sampling units can be ordered by using judgment ranking or available concomitant information. The general estimation methods are not directly applicable to the case where samples are from rank-based sampling designs, because the sampling units do not meet the identically distributed assumption. We derive asymptotic distribution of a Kaplan-Meier (KM) estimator under PROS sampling design. Finally, we compare the performance of the suggested estimators via several simulation studies and apply the proposed methods to a real data set. The results show that the proposed estimator under rank-based sampling designs outperforms its counterpart in a simple random sample (SRS).
The idea of ranked set sampling (RSS) was introduced by McIntyre  for the first time. It can provide a more structural method for collecting the sample units. A generalization of RSS is the PROS sampling design. Both sampling methods are similar with a clear difference; in the PROS sampling design that we use in this paper, the ranker divides the sampling units into ranked subsets of prespecified sizes based on their partial ranks . These sampling designs are techniques to obtain more representative samples from the underlying population where measurement of the units is costly and/or time-consuming. In such sampling designs, sampling units are ordered fairly accurately by using available auxiliary information which may be costly to some extent (see ).
After the PROS sampling design was introduced by Ozturk , many statisticians became interested in this rank-based sampling method. For example, Ozturk  and Frey  have relaxed the assumption concerning the prespecification of the number of subsets in each set. Nazari et al.  have developed nonparametric kernel density estimators using PROS data. Hatefi et al.  have applied PROS sampling in mixture modeling to estimate the age structures of short-lived fish species. Ozturk  have used the properties of PROS samples under multiple auxiliary information in the estimation of the population mean and total infinite population settings. Nazari et al.  have estimated the distribution function using PROS samples. Hatefi et al.  have studied the information and uncertainty structures of PROS data.
Currently, survival study is one of the important statistical tools for analyzing the data extracted from medical studies and social sciences. Presence of censoring observations is the distinction between survival analysis and other statistical analyses (see ). However, survival analyses are expensive due to the need of a large sample size and the potentially long follow-up duration . For the sake of parsimony, we may consider the cost-effective sampling methods, in which only a small proportion of the available units is measured; however, they contain a portion of the information contributed by all of the units; for more information, see .
In this study, we develop the KM nonparametric estimator using the PROS sampling design. The KM estimator measures the probability that a person survives longer than a specific time, which is fundamental in survival analysis. We study the asymptotic properties of this new estimator and compare it with SRS and RSS counterparts. What distinguishes the present research from previous endeavors is that we employ the PROS sampling design for incomplete data containing censored observations, while all research on PROS sampling design has been concerned with the inference procedure for complete data. There are only a few results available when the researcher has incomplete data and the sampling design is based on RSS not PROS samples. For example, Yu and Tam  have considered maximum likelihood estimation of parameters of the log-normal distribution and have introduced a KM estimator for RSS. Zhang et al.  have used RSS for estimating the KM estimator of a reliability function with random right-censored data where the population distribution is unknown. Strzalkowska-Kominiak and Mahdizadeh  have proposed a KM estimator based on RSS when censored data are under random detection limit assumption. Mahdizadeh and Strzalkowska-Kominiak  have proposed a confidence interval for a distribution function when data are right-censored with random censoring time by applying RSS design.
In Section 2, we present some primary notes. In Section 3, we introduce the nonparametric KM estimator. In Section 4, we show the asymptotic normality of the KM estimator based on imperfect PROS sampling design. We compare the performance of the PROS KM estimator with respect to its SRS and RSS counterparts using simulation studies in Section 5. In addition, we illustrate our proposed method with a real example. We consider a dataset collected in Amir Medical Oncology Center, as our population in Section 6.
2. Necessary Background
2.1. Ranked Set Sampling
To obtain a RSS of size , with set size and cycles, from the underlying population, a set of units is randomly selected from the population. The units are ranked via some mechanisms. Then, the unit that ranked as the smallest was selected for the final measurement. Another set of units is drawn and ranked, and the unit ranked as the second smallest is selected for measurement. This process is continued until the unit ranked as the maximum is selected and measured. This is one cycle of the RSS procedure; the cycle can be repeated times to generate RSS of size (see ).
2.2. Partially Rank-Ordered Set Sampling
In this section, we introduce the PROS sampling design and present the necessary notation. This sampling design is of the form design in Ozturk (see ). In order to extract a PROS sample of size , we choose a set size and a design parameter that partitions the set into mutually exclusive subsets . Sampling units are then assigned to the subsets , based on visual inspection, judgment ranking, or using a concomitant variable such that all units in the subset are judged to have smaller ranks than all units in the subset , when . A unit is then randomly selected from the subset for full measurement and denoted by . Again, we randomly select a set containing units and assign them to subsets; after that, we randomly draw a member from subset and denote it by . These steps are continued until we randomly extract a unit from , . These observations constitute one cycle of the PROS sampling design; after repetitions of this process, we achieve a PROS sample of size , denoted by ; for more details, see .
Table 1 presents a simple example of the construction of a PROS sample when , and , the cycle size is , and the design parameter is . Each set contains nine units assigned to three partially rank-ordered subsets. In this process, units in each subset have equal chance to take any place in the subset. One unit, in each set from the bold-faced subset, is randomly drawn and measured. The resulting PROS sample is denoted by .
It should be noted that, if all members in the subset have exactly smaller ranks than all members in , the PROS sampling design is perfect. Otherwise, we have an imperfect PROS sampling design. Suppose that α is a doubly stochastic matrix; we model the subsetting error probabilities in the imperfect PROS as follows (see  and ): where is the probability of assigning a unit into the subset when it belongs to the subset with .
Throughout this paper, we use as a symbol of an imperfect PROS sampling design with the design , where α represents a subsetting error probability matrix, shows the number of subsets, and and exhibit the number of cycles and the set size, respectively. It should be pointed out that .
SRS and RSS designs are special cases of the PROS sampling design when and , respectively. For a perfect PROS design, since for and for , the subsetting error matrix is an identity matrix and the notation can be used.
In this paper, the cumulative distribution function (CDF) of the studied variable in the population, CDF of for , and CDF of the th-order statistic among a simple random sample of size are denoted by , , and , respectively. In addition, the corresponding probability density functions are represented by , , and .
3. Kaplan-Meier Estimator Based on PROS Sampling Design
Definition 1. Let and be two independent random variables where we observe and be the indicator variable which specifies the event/censored status. The KM estimator defined as where are ordered values of the simple random sample (SRS) with related values; see  for more information.
The KM estimator based on the sample, , defined as where is the KM estimator based on the independent and identically distributed (SRS) , defined as where are ordered values of and values are related to values for .
4. Asymptotic Properties
In this section, we study the behavior of the nonparametric KM estimator in large samples based on the imperfect PROS sampling design. The asymptotic properties of the KM estimator under the SRS were widely available in the literature survey [19–21].
We demonstrate that no stronger assumptions are needed while using the imperfect PROS-based KM estimator. At first, we introduce the following lemma, which is a straight result of Lemma 2.1 in Stute and Wang .
Lemma 1. Suppose and are two independent random variables. In addition, let be the PROS sample from subset in the th cycle and be the corresponding censored time.
Set , then we have
Theorem 1. Assume and are continuous and where Also, set As and , we have where
Proof. In view of the equivalent theorem in SRS sampling design , it suffices to show that, for every ,
As to equation (15), under continuity of and and , we also have
Under the continuity of , there exists a density . We have ; hence, .
By using the above relationship, By equation (9), this phrase is finite, so we prove equation (15).
To prove that (16) holds, we have to determine a lower bound for.
We know that for , we have
Therefore, we have
Based on Lemma 1 and the above equations
We define constant as
Because , we have so (26) is smaller than
In view of equation (10), this equation is finite, and this completes the proof.
5. Simulation Study
In this section, we compare the performance of the KM estimator of survival function under the PROS sampling design relative to its SRS and RSS counterparts.
To do so, we considered two situations in which the original random variables were generated from an exponential distribution with mean 1 (model A) and standard log-normal distribution with mean 1.649 (model B). The censored variables in the two cases are supposed to have an exponential distribution; a common rate of exponential distribution was determined when the desired censoring level was prespecified. In all simulation scenarios and the set size for the RSS sampling design is . The algorithm of the simulation study is explained in Appendix B.
By using distribution theory, if and are independent and distributed exponentially with means and , respectively, then . On the other hand, . Setting the values of the censoring level and in these equations, we can find the appropriate value of the exponential rate in model A. Given the fact that there is no such expression for model B, we found the exponential common rate for the censoring variable by trial and error, although one can easily solve this problem numerically by using software like R. The values of the exponential rate were equal to 0.013 and 0.190 and led to censoring levels of 0.1 and 0.6, respectively.
For each combination of sample sizes , 120, and 240 and the mentioned censoring levels 0.1 and 0.6, 5000 samples were generated under the SRS, RSS, and PROS sampling designs. For different values of , , and and the misplacement probabilities and for , the values of the mean squared error (MSE) were computed for the three estimators from each sample when, 0.5, 0.7, and 1.
5.1. Comparing the Kaplan-Meier Estimators
We compare the performance of the KM estimators of the survival function between the studied sampling designs. The efficiency of the PROS estimation with respect to its SRS and RSS counterparts, at the point , is defined as where , , and are the KM estimators of the survival function at point based on PROS, RSS, and SRS sampling designs, respectively.
Note that . and are similarly defined. Also, for a fixed percentile , and is the inverse of the underlying distribution function. The values of RP and SP calculate for and and 5 in both models when we consider , 0.25, 0.50, 0.75, and 0.90. Because of the large volume of output and similar results in both models, we only report the results for model A in this article.
In the literature, the sample sizes in the PROS and RSS designs were similar but they have used a much smaller set size for RSS sampling design than for PROS. However, simulation studies that are not presented here show that the RSS-based estimator may performs better than the one using the PROS sampling design under the same sample size and the same set size.
As shown in Figures 1 and 2, in model A, the KM estimator based on the PROS sampling design in most cases is more efficient than the KM estimator based on the RSS and SRS sampling designs with similar sample sizes. The best performance of the PROS design over the SRS and RSS designs happens when the ranking errors are small or zero, i.e., when and 1. The efficiency of the KM estimator based on PROS relative to SRS is as good as or higher than the efficiency of the KM estimator based on the PROS relative to the RSS procedure, regardless of the censoring level and ranking error. Assuming a fixed sample size and censoring level, by increasing the for large values of , the efficiency of the KM estimator based on the PROS sampling design is enhanced. It should be noted that in an imperfect PROS sampling design , the efficiency reduced as increased.
We can conclude that increasing the level of censorship in a smaller sample size leads to a reduction in efficiency in both models, but for a larger sample size, this rarely happens; in other words, the level of censored data in the smaller sample size has a greater impact on the performance of the PROS sampling design compared to the that in the larger sample size.
We conclude that, regardless of the censoring level and ranking error, increasing the sample size leads to increased efficiency. The perfect PROS KM estimator performs three times more efficiently than the SRS KM estimator in several simulation scenarios. It is worth noting that RP might decrease when one considers the same set size in the PROS and RSS designs with similar sample sizes. In all figures, we consider for the PROS design.
In addition, we compared these three sampling methods using a mean integrated squared error (MISE) indicator, defined as
From Table 2, we can conclude that most of the time, PROS has less MISE than the RSS and SRS sampling methods with similar sample sizes, especially for a large . In addition, we observe that as the level of censored data increases, the amount of the MISE value increases as well in both models. It should be mentioned that in the low level of censorship, the log-normally distributed (model B) has lower MISE than the exponentially distributed (model A), but at the high level of censorship, model B has larger MISE than model A, for the same values of and and the subsetting probabilities . As we expect, increasing the sample size reduces the MISE.
The results show that when , 0.7, and 1 in a smaller sample size with a low percentage of censored data, the larger leads to the smaller MISE of the estimators, but with a high percentage of censored data, the MISE value increases as increases. However, in larger sample sizes, the MISE of the estimator decreases as the goes up in all censoring levels.
In Table 2, as the misplacement probabilities decrease, the superiority of the PROS estimator compared to the RSS and SRS estimators becomes more obvious. The MISE values of the KM estimator derived from perfect PROS and perfect RSS sampling designs are smaller than those in imperfect methods. Note that the KM estimator based on the SRS sampling design has a smaller MISE value than the one based on the imperfect rank-based sampling designs for some cases in small sample sizes and high censorship percentage.
Note that the RSS KM estimator can have a lower MISE than the PROS one, when we consider a similar set size and fixed sample size.
6. Real Data Application
In this section, we use the information of children under 18 years of age with nonhematological disorders such as Beta-Thalassemia and Idiopathic Thrombocytopenic Purpura (ITP) and children with hematological malignancies including various types of lymphoma and Acute Lymphocytic Leukemia (ALL), registered in the Amir Medical Oncology Center during May 2014 to August 2017. The dataset contains the survival information of 61 patients. We provide KM estimates of which is the survival time (in months) as the variable of interest by using which is the white blood cells as the concomitant variable, which are used for ranking purpose. The correlation coefficient between and is 0.455 and is significant ( value = 0.0001); also, we should add that 50.8% of people are censored. We considered the perfect PROS and RSS sampling designs. In order to estimate the KM estimator of survival time, we regarded this data set as a target population and extract PROS, RSS, and SRS samples (with replacement) of size from the population. We considered design parameter . At the first step, we randomly selected patients from the target population and then partitioned these patients into subsets , and based on their WBC values. At the next step, we randomly selected a unit from subset and observed its survival time. Again, we randomly selected 15 patients and assigned them to , and and randomly drew a member from subset and repeated these steps until we selected a unit from subset ; these observations constitute one cycle of PROS; in this real data, we considered 3 cycles, and finally, we have 15 survival time observations from patients.
In RSS, we randomly selected 5 patients from the target population and ranked them based on their WBC values, then we selected the patient with the smallest WBC and observed its survival time. This procedure continued until the survival time of the 5th ranked unit in the 5th set of units measured. These 5 observations constitute one cycle of RSS; in this example, we considered 3 cycles, and finally, we observed the survival time of 15 patients.
For each sampling design, the KM estimator was calculated in different time points. Then, this process was repeated times. We took . These 50 KM charts under the three sampling designs are shown in Figure 3. Figure 3 shows that the variation of the KM estimators in each fixed time under the PROS sampling design is less than the variation of the RSS and SRS counterparts. We conclude that in this real data, the PROS estimate performs better than the RSS and SRS designs. We uploaded the raw data as a supplementary material (available here).
7. Summary and Concluding Remarks
In numerous medical fields, the exact measurement of the desired variable is expensive or time-consuming. Rank-based sampling designs such as PROS can help overcome this difficulty by ranking a small number of sampling units based on a concomitant variable. These sampling designs can be used to obtain samples that are more informative and also result in more accurate inference about the parameters of interest.
In this paper, we considered the problem of the KM estimator that is a proper and commonly used technique in survival analysis associated with an imperfect PROS sampling design. PROS is a new sampling design that avoids ranking all units in a given set. Furthermore, we developed asymptotic distributional properties of the new KM estimator based on a proposed sampling method. We showed how well this estimator performs in comparison with its RSS and SRS counterparts. The simulation results recommend that under both perfect and imperfect subsetting assumptions, the efficiency of the estimator based on the PROS sampling design is higher than the efficiency of the estimator based on the two other sampling methods with the same sample sizes. It is noteworthy that, by increasing the set size in RSS while keeping the sample size fixed in both designs, the RSS KM estimator can have smaller values of MSE than the PROS one. Finally, we applied all the introduced sampling designs to a real data set. We believe that it would be appealing to apply the proposed methodology to useful statistical models, for example, a Cox regression model for analyzing time to event data that is applicable to the majority of medical fields.
Finally, we will recommend the use of recently proposed sampling designs to extend this study, for example, even order ranked set sampling (EORSS)  and quartile pair ranked set sampling (QPRSS)  designs that have recently received attention by some researchers.
A. The Proof of Lemma 1
B. Algorithm of Simulation Scenarios
The steps of simulation study algorithm are as follows:
Step 1: Perform data generation in the following ways:(i)Generate 1000 random event time observations from the desired distribution (X)(ii)Generate 1000 random censored time observations from the desired distribution (C)(iii)Observe the status variables ()(iv)Calculate survival time variable ()
Step 2: Perform sampling in the different studied designs:(i)Generate PROS, RSS, and SRS samples from the target population. For PROS and RSS, we generate the samples based on different values for subsetting error matrices, set sizes, and cycle sizes
Step 3: Estimate the desired estimators:(i)estimate the KM estimator using the corresponding formula coding
Step 4: calculate comparison criteria:(i)Compute the MSE of the KM estimator in different percentile points(ii)Compute the MISE values for KM estimators under the three different sampling designs
Step 5: Repeat all the above steps 5000 times.
Step 6: Compute the mean of 5000 calculated MSE and MISE and report them.
In the present study, we used the information about children under 18 years of age with non-hematological disorders such as Beta-Thalassemia and Idiopathic Thrombocytopenic Purpura (ITP) and also children with hematological malignancies including various types of lymphoma and Acute Lymphocytic Leukemia (ALL), registered in Amir Medical Oncology Center during May 2014 to August 2017, as a population of interest.
Conflicts of Interest
The authors report no conflicts of interest. The authors alone are responsible for the content and writing of this article.
The present paper was extracted from the Ph.D. dissertation of Mrs. Samane Nematolahi and was supported by Shiraz University of Medical Sciences, Shiraz, Iran (Grant no. 97-01-01-16858).
This supplementary file includes the data for the Section 6 (real data) example in the paper. This file contains the information of children under 18 years of age with nonhematological disorders such as Beta-Thalassemia and Idiopathic Thrombocytopenic Purpura (ITP) and children with hematological malignancies including various types of lymphoma and Acute Lymphocytic Leukemia (ALL), registered in Amir Medical Oncology Center during May 2014 to August 2017. The dataset contains the survival information of 61 patients. (Supplementary Materials)
- G. McIntyre, “A method for unbiased selective sampling, using ranked sets,” Australian Journal of Agricultural Research, vol. 3, no. 4, pp. 385–390, 1952.
- J. Frey and T. G. Feeman, “Efficiency comparisons for partially rank-ordered set sampling,” Statistical Papers, vol. 58, no. 4, pp. 1149–1163, 2017.
- A. Hatefi, M. J. Jozani, and O. Ozturk, “Mixture model analysis of partially rank-ordered set samples: age groups of fish from length-frequency data,” Scandinavian Journal of Statistics, vol. 42, no. 3, pp. 848–871, 2015.
- O. Ozturk, “Sampling from partially rank-ordered sets,” Environmental and Ecological Statistics, vol. 18, no. 4, pp. 757–779, 2011.
- O. Ozturk, “Combining multi-observer information in partially rank-ordered judgment post-stratified and ranked set samples,” Canadian Journal of Statistics, vol. 41, no. 2, pp. 304–324, 2013.
- J. Frey, “Nonparametric mean estimation using partially ordered sets,” Environmental and Ecological Statistics, vol. 19, no. 3, pp. 309–326, 2012.
- S. Nazari, M. Jafari Jozani, and M. Kharrati-Kopaei, “Nonparametric density estimation using partially rank-ordered set samples with application in estimating the distribution of wheat yield,” Electronic Journal of Statistics, vol. 8, no. 1, pp. 738–761, 2014.
- O. Ozturk, “Estimation of a finite population mean and total using population ranks of sample units,” Journal of Agricultural, Biological, and Environmental Statistics, vol. 21, no. 1, pp. 181–202, 2016.
- S. Nazari, M. Jafari Jozani, and M. Kharrati-Kopaei, “On distribution function estimation with partially rank-ordered set samples: estimating mercury level in fish using length frequency data,” Statistics, vol. 50, no. 6, pp. 1387–1410, 2016.
- A. Hatefi and M. J. Jozani, “Information content of partially rank-ordered set samples,” AStA Advances in Statistical Analysis, vol. 101, no. 2, pp. 117–149, 2017.
- Q. Zaman and K. P. Pfeiffer, “Survival analysis in medical research,” Interstat, vol. 17, no. 4, pp. 1–36, 2011.
- J. W. Song and K. C. Chung, “Observational studies: cohort and case-control studies,” Plastic and Reconstructive Surgery, vol. 126, no. 6, pp. 2234–2242, 2010.
- E. Strzalkowska-Kominiak and M. Mahdizadeh, “On the Kaplan–Meier estimator based on ranked set samples,” Journal of Statistical Computation and Simulation, vol. 84, no. 12, pp. 2577–2591, 2013.
- P. L. H. Yu and C. Y. C. Tam, “Ranked set sampling in the presence of censored data,” Environmetrics, vol. 13, no. 4, pp. 379–396, 2002.
- L. Zhang, X. Dong, and X. Xu, “Nonparametric estimation for random censored data based on ranking set sampling,” Communications in Statistics - Simulation and Computation, vol. 43, no. 8, pp. 2004–2015, 2014.
- M. Mahdizadeh and E. Strzalkowska-Kominiak, “Resampling based inference for a distribution function using censored ranked set samples,” Computational Statistics, vol. 32, no. 4, pp. 1285–1308, 2017.
- D. A. Wolfe, “Ranked set sampling: an approach to more efficient data collection,” Statistical Science, vol. 19, no. 4, pp. 636–643, 2004.
- W. Stute and J. L. Wang, “The strong law under random censorship,” The Annals of Statistics, vol. 21, no. 3, pp. 1591–1607, 1993.
- E. L. Kaplan and P. Meier, “Nonparametric estimation from incomplete observations,” Journal of the American statistical Association, vol. 53, no. 282, pp. 457–481, 1958.
- P. Major and L. Rejto, “Strong embedding of the estimator of the distribution function under random censorship,” The Annals of Statistics, vol. 16, no. 3, pp. 1113–1132, 1988.
- W. Stute, “The Central Limit Theorem under random censorship,” The Annals of Statistics, vol. 23, no. 2, pp. 422–439, 1995.
- M. Noor-ul-Amin, M. Tayyab, and M. Hanif, “Mean estimation using even order ranked set sampling,” Punjab University Journal of Mathematics, vol. 51, no. 1, pp. 91–99, 2019.
- M. Tayyab, M. Noor-ul-Amin, and M. Hanif, “Quartile pair ranked set sampling: development and estimation,” Proceedings of the National Academy of Sciences, India Section A: Physical Sciences, vol. 18, pp. 1–6, 2019.
Copyright © 2020 Samane Nematolahi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.