Abstract

In this paper we test the statistical probability models for breast cancer survival data for race and ethnicity. Data was collected from breast cancer patients diagnosed in United States during the years 1973–2009. We selected a stratified random sample of Black Hispanic female patients from the Surveillance Epidemiology and End Results (SEER) database to derive the statistical probability models. We used three common model building criteria which include Akaike Information Criteria (AIC), Bayesian Information Criteria (BIC), and Deviance Information Criteria (DIC) to measure the goodness of fit tests and it was found that Black Hispanic female patients survival data better fit the exponentiated exponential probability model. A novel Bayesian method was used to derive the posterior density function for the model parameters as well as to derive the predictive inference for future response. We specifically focused on Black Hispanic race. Markov Chain Monte Carlo (MCMC) method was used for obtaining the summary results of posterior parameters. Additionally, we reported predictive intervals for future survival times. These findings would be of great significance in treatment planning and healthcare resource allocation.

1. Introduction

1.1. Problem Statement: Breast Cancer in Black Hispanic Women

Cancer is defined as a process that induces irreversible mutations in cellular genetic processes resulting in uncontrolled growth and proliferations [1]. Tumor is defined as any abnormal growth of cancer cells that form a lump or mass. Human breast is primarily composed of fat, connective tissues, lymphatic vessels, and organized lobules of milk secreting glands. These lobules are connected exteriorly to the nipple via secretory ducts. Most breast cancers are carcinoma in situ (CIS) because they are confined by epithelial boundaries either to the duct (ductal carcinoma in situ (DCIS)) or to the lobule (lobular carcinoma in situ (LCIS)).

Breast cancer is one of the most common life-threatening cancers in women of any age group. According to the World Health Organization (WHO) report 2004, breast cancer comprises approximately 16% of all cancer types and causes 519,000 deaths annually worldwide. Surprisingly, 69% of these deaths occurred in developing countries, refuting the misconception that breast cancer is a disease of developed world [2]. According to the American Institute of Cancer Research (AICR), over 226,000 cases of breast cancer are diagnosed every year in USA and approximately 40,000 American women die of breast cancer every year [3].

1.2. Breast Cancer according to Race and Ethnicity

In the United States breast cancer is one of the most frequently diagnosed cancers across different racial and ethnic groups [2]. Race/ethnicity specific incidence rates remained fairly constant for all racial and ethnic groups during the years 2004–2008. [1]. Previously, it was believed that family history, socioeconomic status, levels of education, frequency of mammograms, and access to health care resources were some of the major determinants affecting the prognosis of the disease. However, recent studies have shown that racial and ethnic factors also contribute significant risk for the prognosis of the disease.

The American Cancer Society has found evidence that there are notable differences in breast cancer death rates between different states across various socioeconomic strata and between different racial/ethnic groups [1]. Although age is the strongest predictor for breast cancer risk, race/ethnicity could also be a major risk factor [1]. Since the early 1990s, breast cancer mortality rates have decreased among all ethnic groups except the American Indians/Alaska Natives, thereby showing another racial disparity associated with the disease. In the United States, White women are more likely to develop breast cancer than African-American, Hispanic, Asian, or American Indian/Alaska native women [1].

1.3. Hispanic Black Women

Although Hispanics are the fastest growing minority population in the United States, there are not much breast cancer statistics on Black Hispanics, specifically. Breast cancer data for Hispanics are usually tabulated under one ethnic group (Hispanic), therefore, race-specific breast cancer data for the Hispanic population is not readily available. Overall, the incidence and mortality rates of breast cancer among Hispanic (Black and White) women are lower than non-Hispanic White women [3]. Hispanic women (Black and White) show lower levels of awareness about the risk factors associated with the disease and have less access to health care facilities when compared to women of any other ethnicity.

Unfortunately, there are not many studies that elucidate breast cancer disparities among different races within the Hispanic ethnicity. Usual research findings describe breast cancer incidence, mortality, and death rates and other vital statistics associated with the disease among Hispanic women without any details about interracial differences within the ethnicity. Banegas and Li [4] have asserted that further study of specific breast cancer outcomes among the different races of Hispanic women could greatly enhance knowledge about the distribution and determinants of the disease in this high risk ethnic group. Their study showed that non-Hispanic Blacks had a 1.5–2.5-fold greater risk of having stage IV breast cancer types and 10–50% greater risk of breast cancer specific mortality compared to non-Hispanic whites [4]. This finding again shows the need for a study that tries to understand the current state of affairs about breast cancer survival in this subpopulation within United States.

1.4. Statistical Probability Models

Healthcare personnel has collected vast amounts of phenomic and genomic data which should be maximally utilized for research perspectives. These large databases should be tested with newer statistical methods and statistical probability models. This would be very useful to predict future patterns of disease morbidity and mortality, thereby enhancing our understanding the severity and outcomes.

Data may follow several statistical probability models like exponential, gamma, Weibull, normal, half-normal, log-normal, Rayleigh, inverse Gaussian, exponentiated exponential (EE), exponentiated Weibull (EW), beta generalized exponential (BGE), beta inverse Weibull (BIW), and so forth. Statistical models are immensely useful to characterize the data and derive reliable scientific inferences.

The biomedical and engineering fields often use exponentiated exponential model (EEM) for data modeling. The EEM, a generalization of exponential distribution, was introduced by Gupta and Kundu [5] and received rapid and widespread acceptance. The EEM considers two parameters: “shape” and “scale.” Moreover, Gupta and Kundu [6] observed that the EEM is similar to the Weibull family and suggested the possibility of using the EE distribution as a substitute for Weibull model.

A random variable is said to follow the exponentiated exponential distribution if its probability density function (pdf) is given by where is the shape parameter and is the scale parameter. Gupta and Kundu [5] introduced the above mentioned density function for exponentiated exponential distribution. The random variable can be expressed as . Some interesting applications of EEM include designing rainfall estimation in the Coast of Chiapas [7], analysis of Los Angeles rainfall data [8], and software reliability growth models for vital quality metrics [9]. A cure rate model based on the generalized exponential distribution that incorporates the effects of risk factors or covariates for the probability of an individual being a long-time survivor was proposed by Kannan et al. [10]. Also, Gompert’z form of exponentiated exponential model was used to predict squid axon voltage clamp conductance [11].

For a beta generalized exponential model, the probability density function is given by where is the shape parameter and is the scale parameter. There are two additional parameters, and . The role of these parameters is to describe skewness and tail weight [12]. The BGE model generalizes some well-known models, for example, beta exponential and generalized exponential models, as special cases.

A Swedish physicist called Weibull [13] introduced the Weibull distribution primarily for examining the breaking strength of materials. The first EW model with bathtub shaped distribution and unimodal failure rates was introduced by Mudholkar and Srivastava [14]. Since then, application of the EW model to analyze lifetime data has been recommended by Nassar and Eissa [15] and Choudhury [16].

For the EW model, the probability density function (pdf) is given by where and are the shape parameters and is the scale parameter.

The BIW model has several applications for problems in engineering, health, and medical fields. It shows best fit for several data sets, for instance, the amount of time taken for breakdown of insulating fluids subjected to tensions [17]. For the BIW model, the probability density function (pdf) is given by where represents the shape parameter and two parameters, and , represent skewness and tail weight.

A novel Bayesian method can be used to derive the posterior probability for the parameters to calculate posterior inference. Model parameters and data are considered random variables in a Bayesian estimation technique. Their joint probability distribution is stated by a probabilistic model. Data are considered as “observed variables” and parameters as “unobserved variables” in a Bayesian method. Multiplying likelihood and prior gives the joint distribution for the parameters. The “prior” contains information about the parameter. The likelihood depends on the model of underlying process and measured as a conditional distribution which specifies the probability of the observed data. All the information available about the parameters is combined by prior and likelihood. By manipulating the joint distribution of prior and likelihood, inference about parameters of the probability model can be derived from the given data. The Bayesian inference intends to develop the posterior distribution of the parameters for given sets of observed data.

Readers can refer to Berger [18], Geisser [19], Bernardo and Smith [20], Ahsanullah and Ahmed [21], Gelman [22], and Baklizi [23, 24] for further information on Bayesian methods. Khan et al. [25], Thabane [26], Thabane and Haq [27], Ali-Mousa and Al-Sagheer [28], and Raqab [29, 30] have discussed several additional applications of Bayesian method for predictive inferences.

Objectives of this paper include (i) studying some demographic and socioeconomic variables; (ii) reviewing right skewed models EE, EW, BGE, and BIW; (iii) justifying that the given sample data follows a specific model by applying model selection criteria through goodness of fit tests; (iv) performing a Bayesian analysis of the posterior distribution of the parameters; and (v) deriving Bayesian predictive model for future response.

The structural organization of this paper is as follows: Section 2 includes a real example of breast cancer data discussed in detail; Section 3 includes the measure of goodness of fit tests, log-likelihood functions, and the posterior inference for the model parameters for race/ethnicity (Black Hispanic females only); Section 4 includes the Bayesian predictive model which includes the likelihood function, posterior density function, and the predictive density for a future response given a set of observations from the best model; Section 5 includes the results and discussions; and Section 6 includes the conclusion.

2. Real Life Data Example

Breast cancer data () from Surveillance, Epidemiology, and End Results (SEER, 1973–2009) website has been used as a real life data example. Data on breast cancer patients collected from twelve states have been stored in SEER database. Stratified random sampling scheme was used to pick nine sates randomly from these twelve states. Data from these nine states included state-wise race/ethnicity categories for breast cancer distribution.

This data included 4,269 males and 653,443 females. Since breast cancers are rare in males, data from females only were used in our analysis. A simple random sampling (SRS) method was used to select 298 female subjects from Black Hispanic data.

Figure 1 shows the pedigree chart for the selection of Black Hispanic breast cancer patients out of total female breast cancer patients . In the total population, there were 300 Black Hispanic female patients, but data were missing for 2 participants. Figure 2 describes the nine states (dark blue regions), which were randomly selected and were followed by a random selection of Black Hispanic breast cancer patients.

The descriptive statistics (frequency distribution and summary statistics) are shown in Tables 1 and 2, respectively. Table 1 contains the state wise frequency and its corresponding percentages for the selected patients. Table 2 has the descriptive statistics (mean, standard deviation, median, quartiles, and variance) for some demographic characteristics (age at diagnosis, survival times, and marital status at diagnosis) of the selected random sample of Black Hispanic breast cancer patients.

We selected 2,000 non-Hispanic Blacks out of 53,531 Black non-Hispanics for comparing with 298 Black Hispanics in this sample. The mean survival for non-Hispanic Blacks was 66.76 (standard deviation 30.20) and for Black Hispanics 71.38 (standard deviation 61.33). Cox Proportional Regression was used to calculate hazard ratios by ethnicity. Hazard ratios compare the probability of an event occurring in one group versus another and take into account the time elapsed until the event should occur. In survival analysis, the event under consideration is death. A hazard ratio of 1.0 represents an equal risk of death between the groups being compared, a hazard ratio above 1.0 means an increased risk of death, and a hazard ratio below 1.0 represents a decreased risk of death compared to the referent group. In this analysis, we used non-Hispanic Blacks as a referent group. Statistical significance is established if the 95% confidence interval did not include 1. Non-Hispanic Blacks had a significantly increased risk of death compared to Black Hispanics (Hazard ratio: 1.445 95% Wald Robust Confidence Limits 1.210–1.724; 95% Profile Likelihood Confidence Limits 1.265–1.659). When compared to non-Hispanic Blacks, Hispanic Blacks had a significantly decreased risk of death (hazard ratio: 0.692 95% Wald Robust Confidence Limits 0.580–0.826; 95% Profile Likelihood Confidence Limits 0.603–0.791). These results are consistent with the mean survival times for each group as well as the observed survival curve, confirming the longer survival among Hispanic Blacks and shorter survival among non-Hispanic Blacks.

3. Methods of Goodness of Fit

Akaike Information Criterion (AIC), Deviance Information Criterion (DIC), and Bayesian Information Criterion (BIC) are the most commonly used models to measure the goodness of fit. DIC, a Bayesian measure of fit, is used for comparison of different models, for example, the use of public data by Congdon [31, 32]. The values of DIC can be either positive or negative. Models with lower values are considered better than others. DIC is similar to AIC and provides the same results as AIC when models with only fixed effects are fitted. BIC is an asymptotic result which assumes that the data distribution is an exponential family and can only be used to compare estimated models when numerical values of the dependent variable are identical for all estimates being compared. The BIC penalizes free parameters more than AIC. As is the case with AIC, given any two estimated models, the model with lower value of BIC is preferred.

3.1. The Log-Likelihood Function and Reparameterization

A reparameterization method from the Birnbaum-Saunders lifetime model was proposed by Ahmed et al. [33]. Later, Achcar et al. [34] considered a reparameterization from certain skewed models. A reparameterization method may be applied in terms of the log-likelihood functions considering data from the models described earlier which are given in the following.

The log-likelihood function from the EE model is given by Assume and . It is assumed that and are independently distributed. To obtain noninformative prior for and , let a uniform prior distribution for be , for all . Then the joint posterior density is given by The log-likelihood function from the beta generalized exponentiated model is given by Assume ; ; ; and . We further assume that , , , and are independently distributed. To obtain noninformative prior for , , , and , let a uniform prior distribution for be , for all . Then the joint posterior density is given by The log-likelihood function from the EW model is derived by Assume ; ; and . It is further assumed that , , and are independently distributed. To obtain non-informative prior for , , and , let a uniform prior distribution for be , for all .

Then the joint posterior density is derived by The log-likelihood function from the BIW model is given by Assume ; ; and . We further assume that , , and are independently distributed. To obtain non-informative prior for , , and , let a uniform prior distribution for be , for all .

Then the joint posterior density is given by A better performance of the posterior distributions for the parameters can be achieved with the reparameterization method. Table 3 gives the results of the measures of goodness of fit for Black Hispanic females. Tables 47 summarize the results of the posterior parameters. Figures 36 show the posterior kernel densities for the parameters.

3.2. The Results of Goodness of Fit Tests and Posterior Inference for the Parameters from the Black Hispanic Survival Data

Table 3 includes the AIC, BIC, and DIC values for the EE, EW, BGE, and BIW models. Better model fit is inferred if the values of AIC, BIC, and DIC are the least. The data fits EE model better than the other models. The estimated value of AIC is the lowest (3136.72), while the DIC value is very close to AIC. Comparing the estimated values of all AIC, BIC, and DIC for the models, the EEM fits better for the survival days because it produces smaller values for all three criteria AIC, BIC, and DIC.

Table 4 summarizes the results of the posterior distribution of the parameters from the EE for the Black Hispanic breast cancer patients’ survival data. In the Bayesian approach, the knowledge of the distribution of the parameters is updated through the use of observed data, resulting in what is known as the posterior distribution of the parameters. In the case of breast cancer data, we are interested in estimating the posterior distribution of the parameters assuming that observed random sample form an appropriate statistical probability distribution.

The values of the and are generated from the data and the results of the posterior distribution parameters and are estimated using the MCMC method. Samples from a probability distribution can be generated using Markov Chain Monte Carlo which is a class of algorithms used in statistics [35]. The EE model is used to derive the log-likelihood function and the parameter values are assigned to the appropriate theoretical probability distributions. The summary results (mean, SD, MC error, median, and confidence intervals) of the parameters are obtained by using the WinBUGS software. In this process the early iterations up to 1,000 are ignored in order to remove any biases of estimated values of the parameters resulting from the value of utilized to initialize the chain. This process is known as burn-in. The remaining samples are treated as if the samples are from the original distribution (after elimination of the burn-in samples). Fifty thousand (50,000) Monte Carlo repetitions were used to produce the inference for the posterior parameters as shown in Table 4. The graphical representation of the parameters’ behavior is displayed in Figure 3. After 50,000 Monte Carlo repetitions, the kernel densities for both shape and scale parameters follow approximately symmetric distributions.

Table 5 shows the summary results of the posterior distribution of the parameters from the exponentiated Weibull. The Black Hispanic female breast cancer patients’ survival data has been used for these results. The values , , and have been generated from the data. Using the MCMC method, the results of the posterior distribution parameters , , and are estimated by setting the generated values. The log-likelihood function is derived from the EW model. Subsequently, the parameter values which are assigned to appropriate probability distributions are derived. The summary results (mean, SD, MC Error, median, and confidence intervals) of the parameters are derived using the WinBugs software. The graphical representation of distributions of the parameter behaviors has been summarized in Figure 4. The shape parameters andshow a normal distribution, while other model parameters show skewed distributions.

Table 6 shows the summary results of posterior distribution of the parameters from the BGE model. Black Hispanic female breast cancer patients’ data has been used for these results. We used the WinBugs software to obtain the summary results (mean, SD, MC error, median, and confidence intervals) of the parameters. The graphical representation of the parameters for female in the case of BGE has been displayed in Figure 5. A symmetrical pattern of distribution is shown by the parameters and , while a nonsymmetrical distribution is shown by other parameters.

Table 7 shows the summary results of the posterior distribution of the parameters from the BIW model. The Black Hispanic female breast cancer patients’ survival data has been used. The summary results which include (mean, SD, MC error, median, and confidence intervals) of the parameters have been derived by using the WinBugs software. The graphical representations of the parameters for females in the case of BIW model have been displayed in Figure 6. It is noted that the skewed distribution pattern is shown by parameters and from the BIW, while other parameters show approximately uniform distributions.

4. The Bayesian Predictive Survival Model

Due to the current economic crisis, health care costs are increasing tremendously. It is important for health care researchers and providers to promptly identify the high risk population variables for several diseases. The goal is to identify and provide preventive interventions without significantly increasing the cost of management. Currently, predictive modeling is a popular technique used for high-risk assessment at very low costs. Health care providers and researchers will greatly benefit from predictive modeling both to improve present health care services and reduce future health care costs.

Predictive modeling is a process that can be applied to available healthcare data, for instance, identification of people who have high medical need and who are “at risk” for above-average future medical service utilization. We are deriving a novel Bayesian method which can predict the breast cancer survival days based on past data collected from patients.

The Bayesian predictive method is growing extremely popular, finding newer applications in the fields of health sciences, engineering, environmental sciences, business and economics, and social sciences, among others. The Bayesian predictive approach is used for the design and analysis of survival research studies in the health sciences. It is widely used to reduce healthcare costs and to economically allocate healthcare resources.

In this section, a predictive survival model for breast cancer patients is developed by using a novel Bayesian method. It is found that the Black Hispanic female breast cancer patients’ data follow the EE model.

Let us assume that the data represents female breast cancer patients survival days that follow the EE model, and let be a future response (or future survival days). The predictive density of for the observed data is where is the posterior density function and represents the probability density function of a future response () that may be defined from model (1). The posterior density is given by where is the likelihood function, is the prior density for the parameters, and the reciprocal of the normalizing constant is

To derive the likelihood function, let be a random sample of size from model (1). Thus, forms an observed sample. Then given a set of data from (1), the likelihood function is given by

An estimation theory under uncertain prior information was discussed in detail by Ahmed [36]. Further details on Bayes and empirical Bayes estimates of survival and hazard functions of a class of distribution were discussed by Ahsanullah and Ahmed [21]. The estimation of lognormal mean by making use of uncertain prior information was also discussed by Ahmed and Tomkins [37]. The Bayesian predictive model from the Weibull life model, by means of a conjugate prior for the scale parameter and a uniform prior for the shape parameter, has been discussed at length by Khan et al. [25]. The prior density for the scale parameter () can be given by With reference from Khan et al. [25], the shape parameter, , has a uniform prior over the interval (0, ), which is given as follows:

Thus, the joint prior density is Considering the prior density in (19), the posterior density of and is given by where is a normalizing constant.

4.1. Predictive Density for a Single Future Response

Let be a single future response from the model specified by (1), where is independent of the observed data. Then, the predictive density for a single future response () given is where may be defined from model (1). Thus, the predictive density for a single future response is given by where is a normalizing constant.

Figure 7 shows the graphical representation of the predictive density based on the Black Hispanic female breast cancer patients’ survival days. It is noted that the predictive density formed right skewed model.

The summary results of Black Hispanic female predictive means, standard errors, and predictive intervals for future survival days are given in Table 8. The predictive shape characteristics, raw moments, corrected moments, and measures of skewness and kurtosis are also presented in Table 8. These findings are very important for health care researchers to characterize future disease patterns and to make an effective future plans for prevention strategies for the diseases.

5. Results and Discussion

The mean ± SD of age at diagnosis is years for Black Hispanic group. The minimum age at diagnosis for Black Hispanic was 24 years. The mean ± SD of survival time (months) for Black Hispanic females was . The majority of these patients were married.

The EE model is shown to be a better fit compared to other models for Black Hispanic survival data. The lowest DIC value for Black Hispanics is 3136.732. In the case of the EE model, mean ± SD for and values is and , respectively. Rho () values are as follows: and .

We used the Bayesian method to determine the inference for posterior parameters given the breast cancer survival model. Tables 47 summarize the inferences for the posterior parameters using less Markov Chain errors for Black Hispanic females. Figures 36, report the dynamic kernel densities for each of the parameters for Black Hispanic females. This helps us to observe the shapes of the kernel densities.

The graphical representation of Black Hispanic females’ based on future survival times is shown in Figure 7. It should be noted that future survival times for Hispanic Black females show positively skewed distribution. Table 8 summarizes the predictive raw and corrected moments, predictive skewness and kurtosis, and predictive intervals for future response for Black Hispanic female future survival times.

6. Conclusions

There were four types of statistical probability models used to the Black Hispanic females cancer survival data. The exponentiated exponential model was found to be the best fitted model to the Black Hispanic females cancer survival data compared to the other widely used models.

The results of the predictive inference under the fitted model were obtained and it was noticed that the shape of the future survival model for Black Hispanic is positively skewed. Given the patient’s current and past history of reported conditions, these models help the healthcare providers and researchers to predict a patient’s future survival outcomes. Thus a combination of current knowledge and future predictions can be used to enhance and improve the rationales for better utilization of current facilities and planned allocation of future resources.

Descriptive statistics were obtained by using the SPSS software version 19.0. The geographic maps of the randomly selected nine states out of the twelve states were derived using the “Google fusion table” [38]. We used the SPSS version 19.0 software [39] to obtain basic summary statistics for the breast cancer survival times for Black Hispanic subset. To show the graphical representations of the predictive density for a single future response for Black Hispanic women, we used advanced computational software package called “Mathematica version 8.0” [40]. We used the same software to derive additional predictive inferences for the ethnicity about their survival times. We used the WinBugs software to check the goodness of fit tests, to derive the summary results of the posterior parameters, to determine the kernel densities of the parameters, and to carry out all related calculations.

Disclosure

All authors have completed the CITI course in the protection of human research subjects that was required in order to request the SEER’s data from the National Cancer Institute in the United States.

Conflict of Interests

The authors have declared that there is no conflict of interests.

Acknowledgments

The authors would like to thank the editor and the referees for their valuable comments and suggestions.