Mathematical and Statistical Aspects in Health SciencesView this Special Issue
On Suitability of Mixture of Generalized Exponential Models in Modeling Right-Censored Medical Datasets Using Conditional Expectations
The exploration of suitable models for modeling censored medical datasets is of great importance. There are numerous studies dealing with modeling the censored medical datasets. However, majority of the earlier contributions have utilized the conventional models for modeling the said datasets. Unfortunately, the conventional models are not capable of capturing the behavior of the heterogeneous datasets involving the mixture of two or more subpopulations. In addition, the earlier contributions have considered conventional censoring schemes by replacing all the censored items with the largest failed item. This paper is aimed at proposing the analysis of right-censored mixture medical datasets. The mixture of the generalized exponential distribution has been proposed to model the right-censored heterogeneous medical datasets. In converse to conventional censoring schemes, we have proposed censoring schemes which replace the censored items with conditional expectation (CE) of the random variable. In addition, the Bayesian methods have been proposed to estimate the model parameters. The performance and sensitivity of the proposed estimators have been evaluated using a detailed simulation study. The detailed simulation study suggests that censoring schemes based on CE provide improved estimation as compared to conventional censoring schemes. The suitability of the model in modeling heterogeneous datasets has been verified by modeling two real right-censored medical datasets. The comparison of the proposed model with existing mixture model under Bayesian methods advocated the improved performance of the proposed model.
The exploration of suitable models for modeling censored medical datasets is of great importance. There are numerous studies dealing with modeling the censored medical datasets. However, majority of the earlier contributions have utilized the conventional models for modeling the said datasets. Unfortunately, the conventional models are not capable of capturing the behavior of the heterogeneous datasets involving the mixture of two or more subpopulations. Some researchers have considered mixture models for analysis of medical datasets. Hanson  proposed mixture of Gamma distributions to model the survival times of the lung cancer patients. Noor et al.  considered the mixture of exponential models to analyze the data regarding incidents of mortality due the different types of the cancer. Bussy et al.  introduced a supervised learning mixed model for modeling censored mixture data. The estimation of the model parameters was carried out using Expectation Maximization (EM) algorithm. The applicability of the proposed model was illustrated using three real datasets relating to genetic cancer. Mahmud et al.  proposed a mixture of log-skew-normal distributions for analysis of data from a Diary of Asthma and Viral Infectious Study conducted during 2004. Geissen et al.  presented multiexperiment mixture model that enables the researchers to simultaneously model censored and uncensored data. Cheung et al.  proposed a family of mixture models for undiagnosed prevalent disease with interval-censored incidents. Xiang et al.  proposed a mixture cure model for analysis of the survival data with a cure fraction. The estimation under the proposed model was carried out using EM algorithm. However, earlier contributions regarding analysis of censored mixture medical data have utilized the classical method of estimation, such as maximum likelihood estimation and EM algorithm. Recently, Noor et al.  proposed Bayesian methods for analysis of heterogeneous medical datasets. However, the model proposed by Noor et al.  was less flexible as it has no shape parameters. Resultantly, the corresponding analysis was quite straight forward. In addition, all of the above contributions have considered conventional censoring environment which replaces the censored items either by test termination time (for type-I censoring) or by the largest observed value (for type-II censoring). In other words, the censored items have been simply replaced by a common predetermined value. Statistical properties of estimates based on such censoring environment can be unclear . Steiner and Mackay  addressed this issue by proposing the use of conditional expectation (CE) of random variable as a replacement for the censored items. However, the said proposal was for the single models. A careful review of literature suggests that the use of CE has never been considered for the censored mixture models.
The generalized exponential distribution (GED) is very important lifetime distribution. It has more features as compared to exponential and Rayleigh distribution as it has the shape parameter. The utilization of GED is convenient as compared to lognormal and gamma distribution as it has closed form expressions for cumulative distribution function and the hazard rate function. Few studies have also concluded that the performance of GED is better as compared to Weibull distribution in modeling censored data . The mixture of GED (MGED) has been introduced more recently. The analysis of MGED under progressive censoring was considered by Wang et al. . Teng and Zhang  showed that the Gaussian mixture and Laplacian mixture can be obtained as special cases of MGED. The estimation of model parameters was considered using EM algorithm. The superiority of MGED over Weibull distribution was explored by Ateya , and the industrial applications of MGED were reported by Ali et al. . Mohamed et al.  introduced the methodology to obtain the Bayesian predictions using MGED. Kazmi and Aslam  considered the Bayesian analysis for right censored using MGED assuming shape parameters to be known. The above discussion suggests that the MGED is very relevant distribution in modeling the censored datasets. However, the suitability of MGED in modeling right-censored mixture datasets from medical fields using Bayesian methods is still to be explored.
We have considered the Bayesian estimation of two-component MGED (2CMGED) when samples are right censored. The applicability of the proposed model in modeling right-censored heterogeneous datasets from medical sciences has been explored using real datasets. The main feature of the paper is the introduction of censoring environment in which the censored items are replaced by the CE of the random variable. We have compared the proposed censoring environment with the existing censoring environment in which the censored items are replaced by the largest observed value. The Bayesian estimation has been carried out assuming noninformative (NIP) and informative priors (IP). Four loss functions, namely, squared error loss function (SELF), precautionary loss function (PLF), entropy loss function (ELF) and, LINEX loss function (LLF) have been used for the analysis. Since the Bayesian estimates (BEs) were unavailable in closed form, the Bayesian approximate methods, namely, Lindley’s approximation (LA) and importance sampling (IS) were used for the numerical computations. The performance of the proposed model was compared with two-component mixture of exponential distribution (2CMED). The said comparison advocated the superiority of the 2CMGED over 2CMED. In addition, the results based on proposed CE censoring environment provided improved estimation as compared to conventional censoring environment.
2. Materials and Methods
The probability density function (PDF) of the generalized exponential distribution is where is random variable and and are the parameters of the distribution.
The CDF of the generalized exponential distribution is
The two-component mixture of generalized exponential distributions (2CMGED) with mixing weights (,) is where ; ; .
The cumulative distribution function for the 2CMGED is
2.1. Bayesian Estimation under Right-Censored Samples
In this section, the right-censored samples have been used to estimate the parameters of 2CMGED. Unfortunately, the proposed BEs do not exist in the explicit form; hence the approximate Bayesian methods have been used for the estimation.
2.1.1. The Likelihood Function under Right-Censored Samples
Consider a sample of size ‘’ from 2CMGED from which and number of observation are assumed to come from component-I and component-II of the mixture. Suppose and number of failed items observed from component-I and component-II, respectively. The remaining and items have been assumed to be censored from each component. Then the likelihood function for such right-censored mixture data can be written as where , is an order statistic.
Putting the results in Equation (5), we have
The conventional censoring schemes replace the censored items by the largest observed value. This assumption is not suitable because the censoring items are surely greater than the largest observed value. The appropriate solution to this issue is to replace the censored items with CE of the random variable. The conditional distribution for model given in Equation (1) is
The CE for Equation (1) is
Since the analytical solutions for CE are not possible. The values for CE have been obtained numerically using numerical integrations.
So, the resulting dataset is of the form
2.1.2. Priors and Posterior Distributions
We have proposed two sets of priors for the parameters of the 2CMGED. One set contains the NIP, while the other set is the combination of IPs. The description of each set of priors is presented in the followings.
The combined NIP for the parametric set is where the model parameters have uniform priors over the rage .
Based on NIP given in Equation (10), the posterior distribution for is
Again, let , , and , where
are the hyperparameters.
Then, the combined prior distribution for is
The posterior distribution under Equation (12) is
As the posterior distributions under both priors do not provide closed form estimators, we have proposed approximate estimation in the coming sections.
2.1.3. Lindley’s Approximation (LA)
This section considers the LA for approximate solution of the model parameters. If sample size is sufficiently large, then according to Lindley, any ratio of the integrals of the form where is any function of , is the log-likelihood function and is the logarithmic of joint prior for the parametric set , can be evaluated as where is the maximum likelihood estimator (MLE) of the parametric set . where is the element of the inverse of the matrix , all evaluated at the MLEs of the parameters.
The log-likelihood function from Equation (6) is
The MLEs of the parameters are obtained by iterative solution of the following:
The MLEs for the parametric set is denoted by . As mentioned in the previous sections, the second order and third order derivatives from the Equation ((13) have not been presented here. Based on the second order derivatives, the elements of the matrix are obtained and denoted by , where .
Using LA, the BEs for , under SELF and NIP, are given in the following:
Similarly considering LA, the BEs for , under SELF and NIP are
The BEs for the model parameters under ELF using NIP are
Finally, the BEs for the model parameters under LLF using NIP are
Similarly, for rest of the cases, the BEs for the parameters and RCs of the 2CMGED under right-censored samples can be obtained.
2.1.4. Importance Sampling (IS)
In this section, the BEs for the parameters of 2CMGED have been considered using IS. For importance sampling, the first step is to identify the marginal and conditional densities from the posterior distribution. Interestingly in Equation (11), the parameter follows beta distribution with parameters and. On the other hand, the parameters have gamma distribution with parameters and. Similarly, the conditional distributions of are again gamma densities with parameters and , where .Let
Now, the posterior distribution given in Equation (49) can be partitioned as follows: where , ,
, and are given in Equation (49).
Based on IS, the BEs for the parametric set , under NIP using SELF, are
For remaining cases, the similar methodology can be used for the estimation using IS.
3. Results and Discussions
In this subsection, the right-censored data have been generated from the 2CMGED for analysis. Based on these simulated data, the comparison among different estimators has been made with respect to various factors such as samples sizes, priors, LFs, and Bayesian approximation methods.
The steps for numerical simulations have been given in the followings:
Step 1. Generate a random sample of size ‘’ from the proposed model.
Step 2. Next generate uniformly distributed random number () corresponding to each value of the sample.
Step 3. The values of sample for have been considered to come from component-1 and the rest from component-2.
Step 4. Determine the censoring rate (Ri).
Step 5. The starting — values have been observed and remaining values have been assumed to be censored.
Step 6. Use the observed values for analysis.
Step 7. Repeat Step 1 to Step 6 10,000 times, and obtain the BEs using either LA or IS given in Subsections 2.1.3 and 2.1.4, respectively.
Step 8. Obtain the BEs by average of the results computed in Step 7.
Tables 1–12 and in Figures 1–10 contain the numerical and graphical results, respectively. The BEs for the parameters of the 2CMGED, under right-censored samples, have been presented in Tables 1–8. The performance of the posterior estimators has been investigated via amounts of associated PRs. From the results, it can be observed that larger samples sizes produced improved estimation for the model parameters. IPs and LLF come up with better estimation as compared to their counter parts. On the other hand, the estimates under IS seem better than those under LA with few exceptions. These trends are also observable from Figures 1–6.
Table 9 and Figure 9 capture the impact of change in mixing parameter () on the estimation of 2CMGED. The samples of size 100 with 20% right censoring with , , , , and IP have been used for the estimation. The larger values of the mixing parameter improve the estimation for the first component of the 2CMGED with still reasonably good estimation for the second component of the 2CMGED. The effect of different censoring rates on the performance of the estimation from 2CMGED has been observed in Table 10 and in Figure 8. For estimation, we assumed , , , , , , LLF, and IP. As per expectations, the lower censoring rates provide the better estimation for the parameters of the 2CMGED. The results for different sets of the parametric values regarding right-censored 2CMGED have been reported in Table 11 and in Figure 10. The IP and LLF with 20% right-censored samples have been considered for this purpose. The smaller choices of the true parametric values from one component of 2CMGED result in improved estimation for the other component of the 2CMGED.
The estimation of the RCs from the right-censored 2CMGED has been given in Table 12. The estimation has been considered under 20% right-censored samples using , , , , , and with , , and . These RCs have been estimated using LA under different situations. The better estimation of the RCs has been observed for larger sample sizes. The advantage of using the IP with LLF has been seen in majority of cases.
Table 13 contains the comparison of results based on conventional censoring schemes and censoring schemes based on CE. From the results, it can be assessed that results based on CE-censored samples are superior to those under conventional censoring schemes. Similarly, Table 14 reports the impact of mixing parameter under CE-censored samples. The results in Table 14 are better than those under conventional censoring schemes given in Table 9. Further, comparing results reported in Table 10 (for conventional censoring schemes) and Table 15 (for CE based censoring schemes), the results under CE based censoring schemes were better irrespective of the choice of censoring rate. The results under conventional censoring schemes and CE based censoring schemes have also been compared for various choices of the true parametric values. The corresponding results have been reported in Table 11 and Table 16, respectively. These results elucidate that the CE-based censored samples provide improved estimation for the model parameters for different choices of the true parametric values.
3.1. Real Life Examples
In this section, two datasets regarding survival times for the cancer patients have been used to evaluate the applicability of the proposed model. The dataset-1 is about the survival times (in months) of 121 breast cancer patients. This dataset has been reported by Lawless. The () denotes the censored times. The observations for the dataset-1 are as follows: 0.3, 0.3, 4.0, 5.0, 5.6, 6.2, 6.3, 6.6, 6.8, 7.4, 7.5, 8.4, 8.4, 10.3, 11.0, 11.8, 12.2, 12.3, 13.5, 14.4, 14.4, 14.8, 15.5, 15.7, 16.2, 16.3, 16.5, 16.8, 17.2, 17.3, 17.5, 17.9, 19.8, 20.4, 20.9, 21.0, 21.0, 21.1, 23.0, 23.4, 23.6, 24.0, 24.0, 27.9, 28.2, 29.1, 30, 31, 31, 32, 35, 35, 37, 37, 37, 38, 38, 38, 39, 39, 40, 40, 40, 41, 41, 41, 42, 43, 43, 43, 44 45, 45, 46, 46, 47, 48, 49, 51, 51, 51, 52, 54, 55, 56, 57, 58, 59, 60, 60, 60, 61, 62, 65, 65, 67, 67, 68, 69, 78, 80, 83, 88, 89, 90, 93, 96, 103, 105, 109, 109, 111, 115, 117, 125, 126, 127, 129, 129, 139, and 154. Out of 121 survival times, 56 are right censored. Therefore, the ratio of censoring in dataset-1 is around 46%.
The dataset-2 also reported by Lawless, contains the survival times (in days) of two groups of cancer patients. The first group contains 51 patients with head and neck cancer (HNC). This group was treated with radiotherapy (RT). The second group is comprised of 45 HNC patients treated with RT and chemotherapy (CT). The observations of both of these groups are as follows.
Group-I: 7, 34, 42, 63, 64, 74, 83, 84, 91, 108, 112, 129, 133, 133, 139, 140, 140, 146, 149, 154, 157, 160, 160, 165, 173, 176, 185, 218, 225, 241, 248, 273, 277, 279, 297, 319, 405, 417, 420, 440, 523, 523, 583, 594, 1101, 1116, 1146, 1226, 1349, 1412, and 1417.
Group-II: 37, 84, 92, 94, 110, 112, 119, 127, 130, 133, 140, 146, 155, 159, 169, 173, 179, 194, 195, 209, 249, 281, 319, 339, 432, 469, 519, 528, 547, 613, 633, 725, 759, 817, 1092, 1245, 1331, 1557, 1642, 1771, 1776, 1897, 2023, 2146, and 2297.
Out of 51 observations in group-I, 7 observations are right censored. So, in group-I, approximately 14% of the observations are right censored. On the other hand, in group-II, 15 out of 45 observations are right censored with censoring rate 33%. On the whole, for dataset-2, out of 96 observations 22 are right censored. Hence, for dataset-2, the censoring rate is approximately 23%. The descriptive statistics for dataset-1 and dataset-2 have been reported in Table 17. The results in Table 17 suggest that the real datasets used in the study are positively skewed and leptokurtic.
These data have been used to illustrate the applicability of the proposed model. The graphical display of goodness of fit for 2CMGED and 2CMED using dataset-1 and dataset-2 has been given in Figure 11. In particular, Figure 11(a) shows the comparison of empirical and theoretical densities for the competing models using dataset-1. Similarly, Figure 11(b) presents the comparison of empirical and theoretical CDFs for the competing models using dataset-1. On the other hand, Figure 11(c) and Figure 11(d) show the comparison of empirical and theoretical densities and CDFs for dataset-2, respectively. On the whole, Figure 11 suggests that the 2CMGED has better able to represent the behavior of both datasets as compared to 2CMED.
The modeling capabilities of the proposed model have been further evaluated using different goodness of fit criteria such as Akaike information criteria (AIC), Bayesian information criteria (BIC), Cramer-von Mises (CM) statistic, Anderson-Darling (AD) statistic, and Kolmogorov-Smirnov (KS) statistic. The results have been reported in Table 18. The results in Table 18 simply indicate that the results for all the goodness of fit statistics are smaller in case of 2CMGED and compared those for 2CMED. So, the performance of the 2CMGED is better as compared to 2CMED.
The study has been conducted to explore the suitability of 2CMGED to model the right-censored medical datasets having mixture behavior. In addition, the estimation of model parameters using CE-based censored samples has been introduced. The comparison of the results based on conventional and CE-based censored samples has also been reported. The Bayesian methods have been proposed to estimate the model parameters. In the first phase of the study, a detailed simulation study has been carried out to evaluate the performance of the proposed estimators. The numerical simulations have been carried out using R software. The results from the simulation study confirm the consistency property of the proposed estimators. The estimates based on IS, IP, and LLF were found superior to their counterparts. The results from the simulated study also advocate that the estimation using CE-based censored samples was superior to that under conventionally censored samples. The supremacy of CE-based censored samples was witnessed for different choices of sample size, true parametric values, mixing weights, and censoring rates. In second phase of the study, two real datasets relating to survival times of the cancer patients have been used to illustrate the applicability of the proposed model in medical field of study. In addition, the performance of the proposed 2CMGED was compared with 2CMED in molding said datasets. Based on various goodness of fit statistics such as AIC, BIC, CM statistic, AD statistic, and KS statistic, 2CMGED was found superior to 2CMED. Hence, 2CMGED was explored to be a very promising candidate for modeling survival times of the patients suffering from cancer.
Since the medical datasets can be left censored and doubly censored in some cases, the study can further be extended for the said censoring schemes. The study can also be extended for using lifetime models with bathtub shape hazard rates, because such models have been shown to fit the medical datasets efficiently.
|2CMED:||Two-component mixture of exponential distribution|
|2CMGED:||Two-component mixture of generalized exponential distribution|
|AIC:||Akaike information criteria|
|BIC:||Bayesian information criteria|
|CDF:||Cumulative distribution function|
|ECDF:||Empirical cumulative distribution function|
|ELF:||Entropy loss function|
|HNC:||Head and neck cancer|
|LLF:||LINEX loss function|
|PLF:||Precautionary loss function|
|SELF:||Squared Error Loss Function.|
The data used in the paper are available in the paper.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
S. Mahmud, W. Y. Lou, and N. W. Johnston, “A probit- log- skew-normal mixture model for repeated measures data with excess zeros, with application to a cohort study of paediatric respiratory symptoms,” BMC Medical Research Methodology, vol. 10, no. 1, pp. 1–12, 2010.View at: Publisher Site | Google Scholar
S. Ali, M. Aslam, D. Kundu, and S. M. A. Kazmi, “Bayesian estimation of the mixture of generalized exponential distribution: a versatile lifetime model in industrial processes,” Journal of the Chinese Institute of Industrial Engineers, vol. 29, no. 4, pp. 246–269, 2012.View at: Publisher Site | Google Scholar
M. M. Mohamed, E. Saleh, and S. M. Helmy, “Bayesian prediction under a finite mixture of generalized exponential lifetime model,” Pakistan Journal of Statistics and Operation Research, vol. 10, no. 4, pp. 417–433, 2014.View at: Google Scholar