Robust Statistical Modeling and Machine Learning with Applications in Data ScienceView this Special Issue
Influence Diagnostics in Log-Normal Regression Model with Censored Data
Dealing with the biological data, the skewed distribution is approximated by the Log-Normal Regression model (LNRM). Traditional estimation techniques for the LNRM are sensitive to unusual observations. These observations greatly affect the model analysis, which makes imprecise conclusions. To overcome this issue, we proposed to develop diagnostics measures based on local influence diagnostics to identify such curious observations in the LNRM under censoring. The proposed measures are derived by perturbing the case weight, response, and explanatory variables. Furthermore, we also consider the One-Step Newton-Raphson method and generalized cook’s distance. We study the Monte Carlo simulation and its application to real data to illustrate the developed approaches.
Survival analysis has seen a great surge of research interests over the last three decades [1, 2]. It is usually used to analyze an event of interest that occurs within a specified period of time. The accelerated failure time (AFT) model is a very common method because it directly expresses the failure time rather than the probability as in the proportional hazard model and therefore would be an important alternative to the proportional hazard model [3, 4]. The AFT model makes modeling simple as it takes the logarithm of the failure time linearly to the covariates [5, 6].
The log-normal distribution (LND) has a wide application in biology , hydrology , and social sciences . It has also been used to fit a different kind of cancer survival data . It is also evident from articles [11–14] and the references cited therein. Very recently, the LND has also been applied in COVID-19  and financial decisions . Sweet  considered the hazard rate in the LND.
The detection of unusual observations is difficult, and the impacts of their presence on various aspects of the linear statistical model are very well-known . When there are influential observations in a data set, various diagnostic methods have been suggested, which are common in practice. Among these methods, case omission has got special attention. Cook  has introduced the case omission, and a large number of articles on the subject have been published since then. Zhu et al. [20–22] provided extended versions of case omission for several statistical models. Recently, the case omission approach has been extended to a censored log-linear regression model. One may use Student-t censored regression model in the different versions of the case omission technique . Apart from this paper, no attention has been given to the impacts of influential observations on various aspects of the estimates of LNRM.
Cook  provided an approach known as curvature diagnostics. Escobar and Meeker  described the implication method of local influence methods for the detection of data/model perturbations to get significant effects on maximum likelihood estimates based on censoring. Weissfeld and Schneider [26, 27] discussed local influence (LI) trends for Weibull models and normal linear models. Leiva et al.  developed LI methods for the generalized linear model (GLM) having the error distribution as log-Birnbaum-Saunders distribution, and Ortega et al.  investigated the exponentiated Weibull distributed model. For censoring, Liu  investigated LI in elliptical linear models, Ortega et al.  adopted the LI technique for the GLM, and Venezuela et al.  discussed the diagnostics technique for the GLM based on LI. Lachos et al.  used the LI approach in Grubb’s model, Paula et al.  investigated the LI technique in linear models with the first-order autoregressive elliptical error. Russo et al.  employed the LI technique in nonlinear-mixed-effects elliptical models. Vanegas et al.  examined the performance of LI for linear regression models in Weibull distribution. Very recently, Khaleeq et al.  used influence diagnostic techniques in the case of censoring by using a log-logistic regression model.
In this paper, curvature diagnostics were developed for the LNRM based on LI and case omission approach when censoring is present in the data. Diagnostic performance for the comparison of the developed approaches with the available measures is also presented.
The structure of this paper is as follows: Section 2 is based on the LND. Section 3 presents the formulation of the LNRM under censoring. The diagnostic techniques with derived approaches are in Section 4. Section 5 employs Monte Carlo simulation and real-world examples for the empirical performance of the derived techniques. The last section presents the findings and conclusion.
2. Log-Normal Distribution
The LND is also known as the Galton or Galton’s distribution, named after Francis Galton (1879), a statistician during the English Victorian Era. The focus of interest is the occurrence of a particular event of interest within a given period of time. The response of variable is a nonnegative random variable which gives the survival time of an object or an individual, which can be expressed as a probability density function denoted by with parameters and ,where is the scale parameter and is the shape parameter. Survival function corresponding to the random variable with the LND density is given by
This family of distribution is suitable when the hazard rate initially increases and then decreases at times, can be hump-shaped, and has a nonmonotonic failure rate.
3. The LNRM for Censored Data
Let be covariate vector associated by a regression model to responses . Considering the transformation and , it follows that the density function of can be written aswhere and Using (3), AFT model is given aswhere and variable follows the densitywith survival function given by
Now consider the regression model by using the LND in (4). Presenting responses and as the covariate vectors, can be represented aswhere , , is the vector of explanatory variables, and follows the distribution in (6).
Moreover, corresponding to the sample from (3), symbolizes the logarithm of and symbolizes the covariate vector of individual. We can explain the log-likelihood function aswhere . The maximum likelihood estimates (MLE) for the parameter vector can be attained by Newton-Raphson (NR) approach. Covariance for MLEs can also be obtained by using the Hessian matrix. The asymptotic covariance matrix is given by with such that .
Due to the fact that the mechanism of censoring determination of Fisher information matrix is not possible, then in its place, the matrix of the second derivatives of the log-likelihood is used, which is given aswith the submatrices in Appendix.
4. Diagnostic Analysis
Influence diagnostic methods evaluate the sensitivity of the parameter estimation particular model under the perturbation either in the data set or in the essential assumptions for the model. In this section, a few popular approaches to diagnostics are discussed.
4.1. Global Influence Diagnostics
The alternative name of the case omission approach is global influence diagnostics; a popular technique is an estimation process to assess the removing effect of an observation from the data set. The common methods are given as follows.
4.1.1. Generalized Cook Distance
The coefficient estimation has major importance in regression modeling. Cook distance  defines the effect of omitting cases on estimated coefficients [24, 38]. The cook distance () is the standardized norm of for LNRM depending on vector which describe recognize the global influence, i.e., the generalized , defined as,where and shows the number of coefficients of regression.
Remark 1. Larger value for states the significant influence of observation on MLEs of . Zhu et al.  presented that the average of is almost , such that can be used as a cut-off point for , .
4.1.2. One-Step Newton-Raphson Method
Single observation influence can be ascertained by the deletion approach on estimation. In this approach, observation is removed from data, and estimation of the parameter is calculated [38, 39]. In LNRM estimation, consider iteration and one-step approximation; that is, the NR approach to the eliminated value of iswhere is the element of and with point eliminated presented in Appendix.
4.2. Local Influence
The LI approach is based on geometric differentiation in spite of case deletion. A differential comparison of estimation is required before and after perturbing the model and data. Different schemes exist to determine the LI.
4.2.1. The Local Influence Diagnostics
For LI diagnostics, is postulated log-likelihood with vector and vector is a vector of perturbation to some open subset . The log-likelihood function under the perturbation of case weight iswhere and is the no perturbation vector. Note that . Cook  weighed the outcome of a definite perturbation by likelihood displacement by taking the performance of the surface. We are concerned with the elimination of conceivably unusual (influential) observations. The likelihood displacement, that is, , measures the distance between and , where is the vector of perturbed MLE. The normal curvature for at the direction is given by , where is attended hypothesize Fisher information matrix with and is ordered matrix with elementsevaluated at and , where and are the perturbed vectors . The of the matrix has the largest eigenvalue. Point out the unusual cases with the cut-off point for LI as
4.2.2. Case-Weights Perturbation
The log-likelihood function of the LNRM with (vector of weights) takes the formwhere and . We find, after some algebraic manipulation, the following expressions for the weighted log-likelihood function and the elements of the matrix .
4.2.3. Response Perturbation
Consider the regression model (7) by assuming now that each is perturbed as , , where is a scale parameter. For the perturbation, in response to the LNRM, the log-likelihood takes the formwhere . When the response is perturbed, the log-likelihood function with manipulated expressions for matrix is as follows:
Vector is created on , which corresponding to the vector
A large value for the component of the above expression, , indicates that the observation requires a considerable LI on . The index plot of vector identifies the high influence on fitted values.
4.2.4. Covariate Perturbation
Consider now an additive perturbation on a particular continuous covariate, namely, by making , where is a scaled factor and . This perturbation scheme leads to the following expressions for the log-likelihood function:where , where . The covariate perturbation is manipulated elements of the matrix is
The assessment of the larger curvature at leads toand consequently,
To see to which observed values of the prediction is most sensitive under small changes in , we can perform the plot of against . The index plot of the vector can indicate those observations for which a small perturbation in the value of leads to a substantial change in the prediction.
5. Empirical Evaluation
For the performance of the derived results, we provide a Monte Carlo simulation scheme and illustrative example with results and discussions by showing the results in tables and figures.
5.1. Simulation Study
We provide the Monte Carlo simulation scheme for the performance of developed diagnostics for LND by following a similar simulation scheme used by Ortega et al.  in the given section. The response variable is generated by , where and the arbitrary values are ; we select arbitrary values of true vector in such a manner that , and the explanatory variables with no influential observations are generated by , where and . is set to be 50, 100, 150, and 200, and the number of explanatory variables is 1, 2, and 4. Then, we make an influential observation in X’s; that is, the 20th observation is replaced in the complete data set as , , where is the standard deviation of response.
For this study, right random censoring is used considering the censoring observations for each of the samples generated in which as the minimum of the survival time and the censored time of the observed time where
The censoring level is set to be , and . Now the performance of these diagnostics for the identification of generated influential observation with various censoring levels and with different values of dispersions is performed on the basis of the generated samples. These simulation results are performed on R software.
Tables 1–3 show the percentage results of the diagnostics measures for different explanatory variables; that is, , and from the simulation study. The results showed that, by increasing the levels of censoring (0%, 10%, 20%, and 30%) and , the diagnostic percentage shows a decreased trend, and by increasing the sample size , a great increasing trend could be seen. By the simulation study for LNRM, the developed approaches (i.e., ) diagnosed unusual observation better in the comparison of and One-Step . In the developed approaches, the performance of was better in all cases at 0% censored observations. By increasing the level of censoring (10%, 20%, and 30%), performs superior in terms of diagnosing the unusual observation in the highest percentage. Figures 1–3 display the diagnostics percentages with different level of censoring with different value of at , and . These figures show the performance of diagnostics measures graphically, as already discussed in Tables 1–3.
5.2. Example: Ovarian Cancer Survival Data
A sample from a clinical trial of 26 ovarian cancer patient’s survival time was taken from cancer treatment report , Mayo Clinic, Rochester, USA, to assess the effectiveness of various chemotherapies for women with ovarian cancer who had minimal residual disease after having undergone surgery to excise all tumors greater than 2 cm in diameter. For this study, noninformative censoring is used. The LNRM has four explanatory variables . The first two are Age and Residual disease present in which 1 = no, 2 = yes. The other two are ECOG performance with dependent variable survival or censoring time and censoring status in which 1 = lifetime observed, 0 = censoring, respectively. ANR iteration is used for estimation .
The proposed model is as follows:where follows the model given in (5). The MLEs for the model parameters are estimated in the Survreg Survival package (https://rdrr.io/cran/survival/man/survreg.html) in the R language. The MLEs and the absolute change with respect to the unusual observations are given in Table 4.
In Table 4, regression estimates for the full and after omitting the 1st, 13th, 14th, 21st, 22nd, 24th, and 25th observations, which was noted as influential observations and absolute changes in the estimates, were also noticed. Eliminating the 1st observation from data, absolute change results in as 598.6296%, which displays a high influence effect for omitting observation (i.e., 1st observation). After omitting the 13th observation, 436.4282% is the absolute change in . Similarly, after deleting the 14th, 21st, 22nd, 24th, and 25th observations from the data set, shows the maximum absolute changes as 381.0066%, 92.6017%, 694.1877%, 183.8186%, and 260.3238%, respectively.
Table 5 and Figure 4 show that the influence and curvature diagnostics for LNRM for the example data are based on , One-Step NR diagnostics, and curvature diagnostics. detects 1st and 6th as influential observation. One-Step NR diagnostics based on detect 1st, 2nd, 3rd, 8th, 11th, 22nd, 24th, and 25th as influential observations, while based on other three variables, NR detects observations 9th, 13th, 14th, 20th, 21st, and 23rd as potential influential observations. The case weight perturbation shows 1st, 13th, 14th, 21st, 22nd, and 24th observations are the most distinguished as compared to other observations.
Next, the influence of perturbations on the observed survival times will be examined. For response curvature, the 14th and 21st observations are distinguished from the other observations.
The perturbation of vectors for covariates, , , , and , is investigated here. For perturbation of covariates, observations 2nd, 8th, 11th, and 20th are observations with high influence.
6. Concluding Remarks
This paper developed new diagnostic approaches for the LNRM with censored data to identify the influential observations by using the LI technique. The curvatures were obtained as a measure of local influence under the perturbation scheme of case weight , response , and explanatory variables . We accomplish the global influence methods based on and One-Step NR method. is observed superior in simulation when there is 0% censoring in the data. While increasing the level of censoring, the performance of was better than the others.
A. First-Order Partial Derivatives
Here, we derive the necessary computation formulas to obtain the first-order partial derivatives of the log-likelihood function. After a few algebraic manipulations,where .
B. Second-Order Partial Derivatives
Here, we derive the necessary formulas to obtain the second-order partial derivatives of the log-likelihood function. After some algebraic manipulations, we obtain
The data are available from the first author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
C. B. Williams, “An analysis of four years captures of insects in a light trap. Part 1I.1 the effect of weather conditions on insect activity; and the estimation and forecasting of changes in the insect population,” Transactions of the Royal Entomological Society of London, vol. 90, no. 8, pp. 227–306, 1940.View at: Google Scholar
R. Shanker, “Lognormal distribution and its applications in biological and medical sciences,” in Proceedings of the 4th International Conference and Exhibition on Biometrics and Biostatistics, San Antonio, TX, USA, November 2015.View at: Google Scholar
Z. Jin, D. Y. Lin, L. J. Wei, and Z. Ying, “Rank-based inference for the accelerated failure time model,” Biometrika, vol. 90, no. 2, pp. 341–353, 2003.View at: Publisher Site | Google Scholar
Z. Ma and E. J. Bechinski, “Accelerated failure time (AFT) modeling for the development and survival of Russian wheat aphid, Diuraphis noxia (Mordvilko),” Population Ecology, vol. 51, no. 4, pp. 543–548, 2009.View at: Publisher Site | Google Scholar
D. R. Cox and D. Oakes, Analysis of Survival Data, vol. 21, CRC Press, Boca Raton, FL, USA, 1984.
J. D. Kalbfleisch and R. L. Prentice, The Statistical Analysis of Failure Time Data, vol. 360, John Wiley & Sons, Hoboken, NJ, USA, 2011.
J. Aitchison and J. Brown, The Lognormal Distribution, Cambridge University Press, Cambridge, UK, 1963.
E. Limpert, W. A. Stahel, and M. Abbt, “Log-normal distributions across the sciences: keys and clues,” BioScience, vol. 51, no. 5, pp. 341–352, 2001.View at: Publisher Site | Google Scholar
M. Mitzenmacher, “A brief history of generative models for power law and lognormal distributions,” Internet Mathematics, vol. 1, no. 2, pp. 226–251, 2004.View at: Publisher Site | Google Scholar
P. Tai, J. Tonita, E. Yu, and D. Skarsgard, “Twenty-year follow-up study of long-term survival of limited-stage small-cell lung cancer and overview of prognostic and treatment factors,” International Journal of Radiation Oncology, Biology, Physics, vol. 56, no. 3, pp. 626–633, 2003.View at: Publisher Site | Google Scholar
T. Kuroishi, S. Tominaga, T. Morimoto et al., “Tumor growth rate and prognosis of breast cancer mainly detected by mass screening,” Japanese Journal of Cancer Research, vol. 81, no. 5, pp. 454–462, 1990.View at: Publisher Site | Google Scholar
H. Weedon-Fekjær, B. H. Lindqvist, L. J. Vatten, O. O. Aalen, and S. Tretli, “Breast cancer tumor growth estimated through mammography screening data,” Breast Cancer Research, vol. 10, no. 3, 2008.View at: Google Scholar
W. D. Stein, W. D. Figg, W. Dahut et al., “Tumor growth rates derived from data for patients in a clinical trial correlate strongly with patient survival: a novel strategy for evaluation of clinical trial data,” The Oncologist, vol. 13, no. 10, pp. 1046–1054, 2008.View at: Publisher Site | Google Scholar
J. Wilkerson, K. Abdallah, C. Hugh-Jones et al., “Estimation of tumour regression and growth rates during treatment in patients with advanced prostate cancer: a retrospective analysis,” The Lancet Oncology, vol. 18, no. 1, pp. 143–154, 2017.View at: Publisher Site | Google Scholar
N. M. Linton, T. Kobayashi, Y. Yang et al., “Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: a statistical analysis of publicly available case data,” Journal of Clinical Medicine, vol. 9, no. 22, p. 538, 2020.View at: Google Scholar
J. Odhiambo, P. Weke, and J. Wendo, “Modeling of returns of nairobi securities exchange 20 share index using log-normal distribution,” Research Journal of Finance and Accounting, vol. 11, no. 8, 2020.View at: Google Scholar
A. L. Sweet, “On the hazard rate of the lognormal distribution,” IEEE Transactions on Reliability, vol. 39, no. 3, pp. 325–328, 1990.View at: Publisher Site | Google Scholar
S. Chatterjee and A. S. Hadi, “Impact of simultaneous omission of a variable and an observation on a linear regression equation,” Computational Statistics & Data Analysis, vol. 6, no. 2, pp. 129–144, 1988.View at: Publisher Site | Google Scholar
R. D. Cook, “Detection of influential observation in linear regression,” Technometrics, vol. 19, no. 1, pp. 15–18, 1977.View at: Publisher Site | Google Scholar
H. Zhu, S. Y. Lee, B. C. Wei, and J. Zhou, “Case-deletion measures for models with incomplete data,” Biometrika, vol. 88, no. 3, pp. 727–737, 2001.View at: Publisher Site | Google Scholar
H. Zhu, J. G. Ibrahim, S. Lee, and H. Zhang, “Perturbation selection and influence measures in local influence analysis,” Annals of Statistics, vol. 35, no. 6, pp. 2565–2588, 2007.View at: Publisher Site | Google Scholar
H. Zhu, J. G. Ibrahim, and X. Shi, “Diagnostic measures for generalized linear models with missing covariates,” Scandinavian Journal of Statistics, vol. 36, no. 4, pp. 686–712, 2009.View at: Publisher Site | Google Scholar
M. B. Massuia, C. R. B. Cabral, L. A. Matos, and V. H. Lachos, “Influence diagnostics for Student-tcensored linear regression models,” Statistics, vol. 49, no. 5, pp. 1074–1094, 2015.View at: Publisher Site | Google Scholar
R. D. Cook, “Assessment of local influence,” Journal of the Royal Statistical Society: Series B, vol. 48, no. 2, pp. 133–155, 1986.View at: Publisher Site | Google Scholar
L. A. Escobar and W. Q. Meeker Jr, “Assessing influence in regression analysis with censored data,” Biometrics, vol. 48, no. 2, pp. 507–528, 1992.View at: Publisher Site | Google Scholar
L. A. Weissfeld and H. Schneider, “Influence diagnostics for the Weibull model fit to censored data,” Statistics & Probability Letters, vol. 9, no. 1, pp. 67–73, 1990.View at: Publisher Site | Google Scholar
L. A. Weissfeld and H. Schneider, “Influence diagnostics for the normal linear model with censored data,” Australian Journal of Statistics, vol. 32, no. 1, pp. 11–20, 1990.View at: Publisher Site | Google Scholar
V. Leiva, M. Barros, G. A. Paula, and M. Galea, “Influence diagnostics in log-Birnbaum-Saunders regression models with censored data,” Computational Statistics & Data Analysis, vol. 51, no. 12, pp. 5694–5707, 2007.View at: Publisher Site | Google Scholar
E. M. Ortega, V. G. Cancho, and H. Bolfarine, “Influence diagnostics in exponentiated-Weibull regression models with censored data,” Statistics and Operations Research Transactions, vol. 30, pp. 171–192, 2006.View at: Google Scholar
S. Liu, “On local influence for elliptical linear models,” Statistical Papers, vol. 41, no. 2, pp. 211–224, 2000.View at: Publisher Site | Google Scholar
E. M. Ortega, H. Bolfarine, and G. A. Paula, “Influence diagnostics in generalized log-gamma regression models,” Computational Statistics and Data Analysis, vol. 42, no. 1-2, pp. 165–186, 2003.View at: Publisher Site | Google Scholar
M. K. Venezuela, M. C. Sandoval, and D. A. Botter, “Local influence in estimating equations,” Computational Statistics & Data Analysis, vol. 55, no. 4, pp. 1867–1883, 2011.View at: Publisher Site | Google Scholar
V. H. Lachos, F. Vilca, and M. Galea, “Influence diagnostics for the Grubbs’s model,” Statistical Papers, vol. 48, no. 3, pp. 419–436, 2007.View at: Publisher Site | Google Scholar
G. A. Paula, M. Medeiros, and F. E. Vilca-Labra, “Influence diagnostics for linear models with first-order autoregressive elliptical errors,” Statistics & Probability Letters, vol. 79, no. 3, pp. 339–346, 2009.View at: Publisher Site | Google Scholar
C. M. Russo, G. A. Paula, and R. Aoki, “Influence diagnostics in nonlinear mixed-effects elliptical models,” Computational Statistics & Data Analysis, vol. 53, no. 12, pp. 4143–4156, 2009.View at: Publisher Site | Google Scholar
L. H. Vanegas, L. M. Rondón, and G. M. Cordeiro, “Diagnostic tools in generalized Weibull linear regression models,” Journal of Statistical Computation and Simulation, vol. 83, no. 12, pp. 2315–2338, 2013.View at: Publisher Site | Google Scholar
J. Khaleeq, M. Amanullah, A. T. Abdulrahman, E. H. Hafez, and M. M. Abd El-Raouf, “Influence diagnostics in Log-Logistic regression model with censored data,” Alexandria Engineering Journal, vol. 61, no. 3, 2021.View at: Publisher Site | Google Scholar
R. D. Cook and S. Weisberg, Residuals and Influence in Regression, Chapman and Hall, New York, NY, USA, 1982.
D. A. Belsley, E. Kuh, and R. E. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity, John Wiley & Sons, Hoboken, NJ, USA, 2005.
H. J. Edmonson, J. Su, and J. E. Krook, “Treatment of ovarian cancer in elderly women: Mayo clinic–north central cancer treatment group studies,” Cancer, vol. 71, no. S2, pp. 615–617, 1993.View at: Google Scholar