Table of Contents
ISRN Epidemiology
Volume 2013 (2013), Article ID 750857, 6 pages
Research Article

Developing a Weibull Model Extension to Estimate Cancer Latency

School of Public Health, University at Albany, Rensselaer, NY 12144, USA

Received 20 October 2012; Accepted 6 November 2012

Academic Editors: J. M. Ramon and R. Zhao

Copyright © 2013 Diana L. Nadler and Igor G. Zurbenko. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


The mathematical model discussed in this paper presents a technique to estimate the length of the cancer’s silent growth period. The methodology described utilizes information obtained from observed cancer incidence to reconstruct what is cautiously believed to be the period of time from malignant cancer initiation to diagnosis. Analyses show a decreasing hazard for cancer indicating that the longer a patient survives, the more likely they are to reach the upper limit of their natural lifespan. Based on previous research, the Weibull distribution has been used to describe the mechanisms of cancer development. In contrast to the memoryless exponential distribution which assumes a constant failure rate, the shape of the Weibull distribution is dependent on past events and preserves a memory of prior survival. This provides a simple but powerful way to characterize how the unobserved experience of cancer relates to the observed as a function to estimate the time between onset and diagnosis. The results indicate a window of opportunity for early intervention when cancer is most treatable. The method presented provides useful information to identify cancers with high mortality and prolonged periods of undetected growth to distinguish types of dire public health concern.

1. Introduction

Survival analysis statistics in cancer research are often reported in terms of individual survival from the time of diagnosis. When utilizing cancer registry data, the true time in which malignant cancer cells developed in the body is unknown because there is often no indication. The telltale signs and symptoms characteristic of cancer could be months, if not years, away. Causal factors may act in sequence to initiate or promote carcinogenesis, and ten or more years often pass between exposure to external factors and detectable cancer [1]. More than one-third of all Americans will be diagnosed with cancer sometime in their lives. Though their illness may be invisible now, it presents a great, and largely unexamined, opportunity to find and treat their cancers early [2]. Early detection represents one of the most promising approaches to reduce the growing cancer burden by identifying cancer while it is localized and curable, preventing not only mortality, but also reducing morbidity and costs [3].

The two-parameter Weibull distribution is a popular lifetime model frequently used in biomedical sciences survival analysis to describe age-specific mortality and failure rates [4]. Because the Weibull distribution makes no assumptions about the form of the underlying hazard distribution, combined with its flexibility to model increasing and decreasing hazard functions, it has been used successfully in many applications as a purely empirical model, even in cases when there is little or no theoretical justification [5]. Research performed by Kravchenko et al. [6] and Manton et al. [7] utilized a five-parameter version of the frailty model with a Weibull baseline to characterize carcinogenic mechanisms, including the lag period (i.e., the period between the occurrence of the first malignant cancer cell and the date of cancer onset) for selected cancer histotypes. Mdzinarishvili and Sherman [8] used the Armitage-Doll model and concluded that cancer incidence rate data is consistent with a Weibull model of carcinogenesis adjusted for the age of initial exposure.

In this paper, we describe a methodology that utilizes the popular two-parameter Weibull model as its framework and develop a conditional Weibull survival model that accounts for the assumption that the individual survived up to the time of diagnosis. Using simple linear regression methods, we utilize information obtained from observed incidence data to estimate the length of the cancer latency period. When the hazard rate changes over time, the probability of failure is dependent on time, and the Weibull distribution allows for a “memory” of previous survival time for an observation [9]. The Weibull model extension provides information-driven and population-level inferences about cancer latency times to help develop effective and practical screening guidelines and identify areas for improvement.

2. Methods

2.1. Introduction to Survival Analysis

The statistical analysis of lifetime data is an important topic in many areas, including the biomedical, engineering, and social sciences [10]. Survival analysis generally involves the modeling of time-to-event data where the outcome is the time until failure from some disease or condition. Subjects entering the study at different times have varying lengths of followup for the observed failure time. A distinguishing feature of survival analysis is that it successfully incorporates information from censored and truncated, or incomplete, observations which makes it the most practical method for type of analysis.

2.2. Estimating the Survival Function

The concepts of survival and hazard are essential to understanding survival analysis. The survival curve expresses the cumulative effect of the risks faced by an individual, and the hazard function characterizes the rate of change of the survival function over time. This indicates that where survival is quickly decreasing, hazard is high; if the survival curve is constant, the hazard is zero [11].

Let us suppose that is a nonnegative continuous random variable that represents the lifetimes of individuals in some population. Survival time can be expressed as the length of time from cancer initiation until cancer-specific death, when death occurs. It is also assumed that survival times are independent and the censoring mechanism that occurs is uninformative. The survival function, , is used to estimate the probability of surviving beyond time, , as follows: The survivor function is a monotone decreasing continuous function where and [10]. It is assumed at the start of the study that all subjects are alive and at time equal to , and the probability of survival is 0 since eventually all persons must succumb to death.

The hazard function, denoted by , provides the instantaneous risk of dying at time , given that the event has not yet occurred and can be defined as The hazard function describes how the risk of failure varies with time and provides a useful tool for understanding the underlying distribution of the survival times [12]. For example, if denotes the time from cancer initiation until death from cancer and the corresponding hazard function, , decreases over time, the conditional probability of dying from cancer decreases each month the patient survives, given survival up to the time of interest.

To estimate the survival function, the Kaplan-Meier product-limit estimator method was used. This method is a nonparametric maximum likelihood estimator of the survival function used to estimate survival probabilities as a function of time. This method is favorable since it makes no assumption about the underlying distribution of the survival times and has become the most commonly used approach to survival analysis in medicine [13, 14]. This method works by estimating the survival probability at each interval using the number of patients that survived, divided by the number of patients at risk. At each interval, patients are considered to be “at risk” if they have not yet experienced the event. Patients lost to followup, or right censored, are excluded from the “at risk” pool. Finally, the probability of surviving to any point in time is estimated using the cumulative probability of surviving each of the preceding time intervals.

We assume a sample of independent observations with available survival times denoted by . Letting be the ordered failure times, the Kaplan-Meier estimator can be defined as Here, represents the number of individuals at risk of dying just before time , including those who will die at time , and is the number of deaths observed at time . At any specified time point, the observed probability of death is [15]. The nonparametric Kaplan-Meier estimates of the survivorship function were used to develop a log survival time model and a log-log survival time model. The parameter values for the linear regression models were used to approximate the latency period which is further discussed in Section 2.4.

2.3. Weibull Distribution

The Weibull model is widely applied in survival analysis and has been shown to fit data involving the time to appearance of tumors or death in animals subject to carcinogenic insults over time [15]. Pike [16] and Peto and Lee [17] gave a theoretical motivation for the application of the Weibull model to fit data involving the time until the appearance of a tumor or death in animals subjected to carcinogenic insults over time [15].

As previously stated, we assume that observations are available on the independent failure times of individuals where represents the time until failure. Let be a Weibull random variable representing the failure time of an arbitrary individual. The probability distribution of can be described by the probability density function (pdf), , such that The corresponding Weibull survival function is Different values of the shape parameter, , can have a significant effect on the behavior of the Weibull distribution and even cause it to reduce to other distributions [10]. If , the distribution reduces to the exponential distribution which assumes constant hazard over a lifetime and is memoryless. The memoryless property indicates that a future event, measured from any instant in time, is expected to occur in time regardless of when the last event occurred [18].

When , the probability of survival at each successive is dependent on past survival, and the Weibull distribution retains a “memory” of prior survival times. Typical values vary depending upon the application; however, distributions with in the range of 0.5 to 3 are appropriate [10]. In this analysis, we found indicating a decreasing hazard. This indicates that the shape of the distribution for the observed survival failure times is a function of the underlying distribution for the unobserved survival times, which permits us to estimate the length of the cancer latency period.

2.4. Weibull Model Extension

Utilizing conditional probability theory and the popular two-parameter Weibull model as our framework, we developed a mathematical model to account for the assumption of individual survival up to the time of diagnosis. By introducing this additional parameter and utilizing the memory property of the Weibull distribution, we restore what we cautiously believe to be the time between cancer initiation and diagnosis. Using information from observed incidence data available from cancer registries, our analysis showed that the Weibull shape parameter, , was strictly less than 1 for all cancers. Because the hazard function decreases over time, the distribution has a strong memory of prior survival times which is a crucial factor in this analysis.

To illustrate the timeline of events, Figure 1 presents a diagram of the succession of cancer events. The first event to occur is the initiation of disease, the second event is the cancer diagnosis where the case is reported to the local cancer registry, and the third event is the time of death or the study endpoint. The latency period is defined as the time between cancer initiation and diagnosis, which we seek to estimate. The individual’s true lifetime can be represented by the length of time from cancer initiation until death.

Figure 1: Timeline of events which demonstrate the unobserved and observed periods of cancer beginning at the time of disease initiation.

Using the Weibull survival function in (5) as our starting point, the conditional probability of surviving beyond time, , given patient survival up to the time of diagnosis, can be represented by the function In this model, the length of the latency period is designated by the lag parameter, , the shape parameter, , and the scale parameter, . Each represents the time an individual was observed from the time of cancer diagnosis until cancer-specific death or the study endpoint.

2.5. Estimating Model Parameters

The Kaplan-Meier method was used to estimate the survival probabilities which were used as the outcome variable in our model. For this analysis, we use linear regression methods to estimate the Weibull model parameters because of their computational simplicity and ease of graphical interpretation [1921]. The Weibull model has the key property that ln (−ln) is linear with ln where the regression equation has slope, , and intercept, ln () [22, 23].

Utilizing the methods outlined by Nadler and Zurbenko [24], the approximate likelihood value of the latency period, , can be estimated using the following formula: This function represents the time where the log-transformed survival estimate, , regressed on , equals the correction factor . To find the value of , we plot the log negative log of theKaplan-Meier estimates against the natural log of time and determine the slope of the regression equation To determine the model parameters, and , the log-transformed conditional Kaplan-Meier estimates were regressed on time with intercept, , and slope, , This approximation provides a simple and fast way to estimate the latency cancer period when the hazard rate changes over time.

2.6. Data

Monthly observations of newly diagnosed adult cancer cases in the United States were obtained for the period of 1973–2008 available through the Surveillance, Epidemiology and End Results (SEER) Program. SEER is a national registry for cancers that is commissioned by the National Cancer Institute which began maintaining records of patients with cancer in 1973 [25]. From this dataset, cancer site, date of diagnosis, summary stage, tumor sequence number, and vital status were used in the analysis.

The types of cancer chosen for this analysis were restricted to those with high mortality rates and limited availability of effective treatment options allowing the disease to follow its natural course, which minimizes potential biases. High mortality rates maximize the amount of information known to the researcher allowing more precise estimates. Overall, 6 in situ and invasive primary cancers were selected and analyzed with a total sample size of 556,696. These cancers include acute myeloid leukemia, brain, liver, lung and bronchus, pancreas, and stomach. Events were considered in cases where the cause of death was cancer-specific.

3. Results and Discussion

The conditional survival plot in Figure 2 suggests that the Weibull shape parameter, , is decreasing and then stabilizing over time for all cancer types which verifies that the Weibull distribution allows for a memory of prior survival observations. The survival curves indicate that the risk of failure is decreasing over time which can be attributed to weak individuals perishing quickly after diagnosis and stronger individuals surviving for long periods of time. Melanoma has the highest rate of survival with 80% of patients surviving 30 years after diagnosis. Breast cancer also has a relatively high 30-year survival rate with 58% of patients surviving 30 years after diagnosis. Lung cancer has a grim prognosis with 6.4% of patients surviving 30 years after diagnosis; pancreatic cancer is similar with 98% of patients dying within thirty years of diagnosis.

Figure 2: Observed conditional survival plots with Weibull shape parameter for melanoma, breast, lung, and pancreatic cancers.

Early diagnosis of cancers can occur from increased screening practices and can alter the natural course of disease. The collection of SEER data began in 1973, and the availability of cancer screenings and effective treatments for breast and melanoma cancers has increased dramatically in the last 20 years. In some cases, routine screenings can identify lesions in patients who otherwise may have never been diagnosed in their lifetime. These biases, known as lead-time bias and overdiagnosis bias, can interfere with our ability to generalize results from a sample to the population. In an attempt to avoid these potential biases, cancers with low death rates and known treatment courses (i.e., breast and melanoma) were excluded from this analysis.

In Figure 3, a graphical representation of the method is shown where the approximate estimate is used to determine the time between lung cancer initiation and diagnosis for all stages of the disease combined. The Weibull shape parameter for all-stage lung cancer was 0.57, and the correction factor was 0.735. By extending the linear regression equation to point 0.735 on the -axis, we estimate the latency period for lung cancer to be 13.6 years. The model parameter estimates were obtained using (7), (8), and (9). Overall, we found the Weibull regression model fitting the data remarkably well with an average -square value of 93.3. The model residuals were randomly distributed about the regression line suggesting no underlying trends.

Figure 3: Estimating the approximate time of lung cancer initiation using the Weibull model extension.

Applying the Weibull model extension to a subset of cancers in the SEER data, we determined the length of the latency periods and presented these estimates in Figure 4. Please note that these estimates are stratified by cancer type but include all stages combined. The model can be further stratified as necessary, as long as the sample sizes remain large. In Figure 4, acute myeloid leukemia has the longest estimated latency period of 25.75 years, stomach cancer has the second longest latency period of 22.86 years, and brain cancer has an estimated latency period of 21.87 years. Pancreatic, liver, and lung cancers have relatively short latency periods ranging from 8.59 years to 13.57 years. These cancers are often diagnosed in late stages when the prognosis is poor and the chance of long-term survival is bleak. Although these estimates may not be truly exact because they are a mathematical approximation, they provide a meaningful ranking of cancers with the longest periods of undetected growth.

Figure 4: Estimated interval between first cancer-related mutation and diagnosis obtained using the Weibull model extension.

A biological study published in Nature collected genetic materials from 7 patients who died of end-stage pancreatic cancer and determined the timing of carcinogenesis. Researchers found that it took 11.7 years, on average, for a mature pancreatic tumor to form after the appearance of the first cancer-related mutation in a pancreatic cell. Another 6.8 years passed, on average, before the primary tumor sent out a metastatic lesion to another organ. From that point, the patient died in 2.7 years, on average. In total, more than 20 years elapsed between the appearance of the first mutated pancreatic cell and death [1, 2628]. The estimate obtained using the Weibull model extension indicates that 8.59 years passed, on average, from the time of cancer initiation to diagnosis for patients with all-stage pancreatic cancers combined.

As mentioned earlier, Manton et al. [7] utilized the five-parameter version of the frailty model with a Weibull baseline to investigate the relationship between the heterogeneity in age-related patterns of cancer incidence and the mechanisms of carcinogenesis. The estimates obtained for the “lag” period between the occurrence of the first malignant cancer cell and the date of cancer onset for selected cancer histotypes are shown in Table 1 and [7].

Table 1: Estimated lag determined by Manton et al. [7] for selected cancer histotypes.

Overall, our results are consistent with those obtained by Manton et al. [7]; however, exact comparison is not possible since the researchers provide histotype-specific estimates and the measurement periods may not be exact. Another factor making the comparison difficult is the “lag” referred to by Manton et al. [7] represents the period between the occurrence of the first malignant cancer cell and cancer onset. Our estimate reflects the period between cancer onset and diagnosis, which may or may not be equivalent.

For lung and bronchus cancers, our results fall within the estimates provided in Table 1. We estimate that 13.57 years passed from cancer initiation to diagnosis for all stages and histotypes of lung cancer combined. Results obtained by Manton et al. [7] range from a lag of years and years for lung cancer histotypes 804, 807, and 814. Manton et al. [7] estimate years passed on average between the occurrence of the first malignant cancer cell and cancer onset for pancreatic cancer. Our results for pancreatic cancer fall in line with those of Manton et al. [7] and suggest that 8.59 years passed between cancer initiation and diagnosis. The latency estimate for liver cancer indicates that 10.81 years elapsed from cancer initiation to diagnosis which falls slightly beyond the range proposed by Manton et al. [7] of years. This difference may be due to the comparison of all histotypes to a specific histotype as well as any differences in the measurement of carcinogenic periods as referenced previously.

In this paper, a new algorithm is presented that uses survival information obtained strictly after disease diagnosis to estimate what we cautiously believe to be the time between cancer onset and diagnosis. The ability to “retrace” the progression of prior survival histories is dependent on the shape of the hazard increasing or decreasing over time. Overall, our quantitative analysis indicates that there is a large window of opportunity for diagnosis while the disease is still in the curative stage. Although the Weibull model extension may not provide exact estimates because it is an approximation solution, it undoubtedly allows the medical community to identify cancer types by increasing risk to distinguish the “silent killers” with long undetected periods of growth and a high risk of death. By making this information available, we present a multitude of opportunities for new research on early detection and preventative screening, improving the prognosis of the disease. The main advantages of the conditional Weibull model are its simplicity, utilizing only simple linear regression methods, and ability to permit further research of medical issues through mathematical modeling.


  1. American Cancer Society, Cancer Facts & Figures, American Cancer Society, Atlanta, Ga, USA, 2011.
  2. T. Goetz, “Why Early Detection Is the Best Way to Beat Cancer,” WIRED MAGAZINE: 17. 01,, 2012.
  3. R. Etzioni, N. Urban, S. Ramsey et al., “The case for early detection,” Nature Reviews Cancer, vol. 3, no. 4, pp. 243–252, 2003. View at Google Scholar · View at Scopus
  4. F. Diamond, “What if?” Managed Care, vol. 16, no. 4, 2007. View at Google Scholar · View at Scopus
  5. M. Paterno, “On Random-Number Distributions for C++0x,”
  6. J. Kravchenko, I. Akushevich, V. L. Seewaldt, A. P. Abernethy, and H. K. Lyerly, “Breast cancer as heterogeneous disease: contributing factors and carcinogenesis mechanisms,” Breast Cancer Research and Treatment, vol. 128, no. 2, pp. 483–493, 2011. View at Publisher · View at Google Scholar · View at Scopus
  7. K. G. Manton, I. Akushevich, and J. Kravchenko, Cancer Mortality and Morbidity Patters in the U.S. Population, Springer Science+Business Media, New York, NY, USA, 2009.
  8. T. Mdzinarishvili and S. Sherman, “Weibull-like model of cancer development in aging,” Cancer Informatics, vol. 9, pp. 179–188, 2010. View at Google Scholar · View at Scopus
  9. T. Bergemann, “Extended Survival Methods,”, 2008.
  10. J. F. Lawless, Statistical Models and Methods For Lifetime Data, John Wiley & Sons, Hoboken, NJ, USA, 2003.
  11. K. Bull and D. J. Spiegelhalter, “Tutorial in biostatistics: survival analysis in observational studies,” Statistics in Medicine, vol. 16, pp. 1041–1074, 1997. View at Google Scholar
  12. D. W. Hosmer, S. Lemeshow, and S. May, Applied Survival Analysis: Regression Modeling of Time to Event Data, John Wiley & Sons, Hoboken, NJ, USA, 2008.
  13. D. W. Hosmer and S. Lemeshow, Applied Survival Analysis: Regression Modeling of Time to Event Data, Wiley Series in Probability and Statistics, New York, NY, USA, 1999.
  14. J. F. Jekel, Epidemiology, Biostatistics, and Preventive Medicine, Elsevier, Philadelphia, Pa, USA, 2007.
  15. E. L. Kaplan and P. Meier, “Nonparametric estimation from incomplete observations,” Journal of the American Statistical Association, vol. 53, no. 282, pp. 457–481, 1958. View at Google Scholar
  16. M. C. Pike, “A method of analysis of a certain class of experiments in carcinogenesis,” Biometrics, vol. 22, no. 1, pp. 142–161, 1966. View at Google Scholar · View at Scopus
  17. R. Peto and P. Lee, “Weibull distributions for continuous carcinogenesis experiments,” Biometrics, vol. 29, no. 3, pp. 457–470, 1973. View at Google Scholar · View at Scopus
  18. R. E. Giachetti, Design of Enterprise Systems: Theory, Architecture, and Methods, CRC, Boca Raton, Fla, USA, 2010.
  19. D. N. P. Murthy, M. Bulmer, and J. A. Eccleston, “Weibull model selection for reliability modelling,” Reliability Engineering and System Safety, vol. 86, no. 3, pp. 257–267, 2004. View at Publisher · View at Google Scholar · View at Scopus
  20. L. F. Zhang, M. Xie, and L. C. Tang, “A study of two estimation approaches for parameters of Weibull distribution based on WPP,” Reliability Engineering and System Safety, vol. 92, no. 3, pp. 360–368, 2007. View at Publisher · View at Google Scholar · View at Scopus
  21. L. F. Zhang, M. Xie, and L. C. Tang, “Bias correction for the least squares estimator of Weibull shape parameter with complete and censored data,” Reliability Engineering and System Safety, vol. 91, no. 8, pp. 930–939, 2006. View at Publisher · View at Google Scholar · View at Scopus
  22. D. G. Kleinbaum and M. Klein, Survival Analysis: A Self-Learning Text, Spring Science+Business Media, New York, NY, USA, 2nd edition, 2005.
  23. G. Rodriguez, “Lecture Notes on Generalized Linear Models,”, 2007.
  24. D. L. Nadler and I. G. Zurbenko, “Model prediction of the length of Cancer prior to diagnosis with application to Cancer registry data,” in JSM Proceedings, Biometrics Section, pp. 5404–5414, American Statistical Association, Miami Beach, Fla, USA, 2010.
  25. Surveillance, Epidemiology, and End Results (SEER) Program Research Data. (1973–2008). National Cancer Institute, DCCPS, Surveillance Research Program, Cancer Statistics Branch,, 2011.
  26. S. Yachida, S. Jones, I. Bozic et al., “Distant metastasis occurs late during the genetic evolution of pancreatic cancer,” Nature, vol. 467, no. 7319, pp. 1114–1117, 2010. View at Publisher · View at Google Scholar · View at Scopus
  27. “Pancreatic Cancer Grows Over 20 Years,” from PressTV,, 2011.
  28. R. Parker, “Pancreatic Cancer Develops For 20 Years Before Killing,”, 2010.