Abstract

Dealing with the biological data, the skewed distribution is approximated by the Log-Normal Regression model (LNRM). Traditional estimation techniques for the LNRM are sensitive to unusual observations. These observations greatly affect the model analysis, which makes imprecise conclusions. To overcome this issue, we proposed to develop diagnostics measures based on local influence diagnostics to identify such curious observations in the LNRM under censoring. The proposed measures are derived by perturbing the case weight, response, and explanatory variables. Furthermore, we also consider the One-Step Newton-Raphson method and generalized cook’s distance. We study the Monte Carlo simulation and its application to real data to illustrate the developed approaches.

1. Introduction

Survival analysis has seen a great surge of research interests over the last three decades [1, 2]. It is usually used to analyze an event of interest that occurs within a specified period of time. The accelerated failure time (AFT) model is a very common method because it directly expresses the failure time rather than the probability as in the proportional hazard model and therefore would be an important alternative to the proportional hazard model [3, 4]. The AFT model makes modeling simple as it takes the logarithm of the failure time linearly to the covariates [5, 6].

The log-normal distribution (LND) has a wide application in biology [7], hydrology [8], and social sciences [9]. It has also been used to fit a different kind of cancer survival data [10]. It is also evident from articles [1114] and the references cited therein. Very recently, the LND has also been applied in COVID-19 [15] and financial decisions [16]. Sweet [17] considered the hazard rate in the LND.

The detection of unusual observations is difficult, and the impacts of their presence on various aspects of the linear statistical model are very well-known [18]. When there are influential observations in a data set, various diagnostic methods have been suggested, which are common in practice. Among these methods, case omission has got special attention. Cook [19] has introduced the case omission, and a large number of articles on the subject have been published since then. Zhu et al. [2022] provided extended versions of case omission for several statistical models. Recently, the case omission approach has been extended to a censored log-linear regression model. One may use Student-t censored regression model in the different versions of the case omission technique [23]. Apart from this paper, no attention has been given to the impacts of influential observations on various aspects of the estimates of LNRM.

Cook [24] provided an approach known as curvature diagnostics. Escobar and Meeker [25] described the implication method of local influence methods for the detection of data/model perturbations to get significant effects on maximum likelihood estimates based on censoring. Weissfeld and Schneider [26, 27] discussed local influence (LI) trends for Weibull models and normal linear models. Leiva et al. [28] developed LI methods for the generalized linear model (GLM) having the error distribution as log-Birnbaum-Saunders distribution, and Ortega et al. [29] investigated the exponentiated Weibull distributed model. For censoring, Liu [30] investigated LI in elliptical linear models, Ortega et al. [31] adopted the LI technique for the GLM, and Venezuela et al. [32] discussed the diagnostics technique for the GLM based on LI. Lachos et al. [33] used the LI approach in Grubb’s model, Paula et al. [34] investigated the LI technique in linear models with the first-order autoregressive elliptical error. Russo et al. [35] employed the LI technique in nonlinear-mixed-effects elliptical models. Vanegas et al. [36] examined the performance of LI for linear regression models in Weibull distribution. Very recently, Khaleeq et al. [37] used influence diagnostic techniques in the case of censoring by using a log-logistic regression model.

In this paper, curvature diagnostics were developed for the LNRM based on LI and case omission approach when censoring is present in the data. Diagnostic performance for the comparison of the developed approaches with the available measures is also presented.

The structure of this paper is as follows: Section 2 is based on the LND. Section 3 presents the formulation of the LNRM under censoring. The diagnostic techniques with derived approaches are in Section 4. Section 5 employs Monte Carlo simulation and real-world examples for the empirical performance of the derived techniques. The last section presents the findings and conclusion.

2. Log-Normal Distribution

The LND is also known as the Galton or Galton’s distribution, named after Francis Galton (1879), a statistician during the English Victorian Era. The focus of interest is the occurrence of a particular event of interest within a given period of time. The response of variable is a nonnegative random variable which gives the survival time of an object or an individual, which can be expressed as a probability density function denoted by with parameters and ,where is the scale parameter and is the shape parameter. Survival function corresponding to the random variable with the LND density is given by

This family of distribution is suitable when the hazard rate initially increases and then decreases at times, can be hump-shaped, and has a nonmonotonic failure rate.

3. The LNRM for Censored Data

Let be covariate vector associated by a regression model to responses . Considering the transformation and , it follows that the density function of can be written aswhere and Using (3), AFT model is given aswhere and variable follows the densitywith survival function given by

Now consider the regression model by using the LND in (4). Presenting responses and as the covariate vectors, can be represented aswhere , , is the vector of explanatory variables, and follows the distribution in (6).

Moreover, corresponding to the sample from (3), symbolizes the logarithm of and symbolizes the covariate vector of individual. We can explain the log-likelihood function aswhere . The maximum likelihood estimates (MLE) for the parameter vector can be attained by Newton-Raphson (NR) approach. Covariance for MLEs can also be obtained by using the Hessian matrix. The asymptotic covariance matrix is given by with such that .

Due to the fact that the mechanism of censoring determination of Fisher information matrix is not possible, then in its place, the matrix of the second derivatives of the log-likelihood is used, which is given aswith the submatrices in Appendix.

4. Diagnostic Analysis

Influence diagnostic methods evaluate the sensitivity of the parameter estimation particular model under the perturbation either in the data set or in the essential assumptions for the model. In this section, a few popular approaches to diagnostics are discussed.

4.1. Global Influence Diagnostics

The alternative name of the case omission approach is global influence diagnostics; a popular technique is an estimation process to assess the removing effect of an observation from the data set. The common methods are given as follows.

4.1.1. Generalized Cook Distance

The coefficient estimation has major importance in regression modeling. Cook distance [24] defines the effect of omitting cases on estimated coefficients [24, 38]. The cook distance () is the standardized norm of for LNRM depending on vector which describe recognize the global influence, i.e., the generalized , defined as,where and shows the number of coefficients of regression.

Remark 1. Larger value for states the significant influence of observation on MLEs of . Zhu et al. [20] presented that the average of is almost , such that can be used as a cut-off point for , .

4.1.2. One-Step Newton-Raphson Method

Single observation influence can be ascertained by the deletion approach on estimation. In this approach, observation is removed from data, and estimation of the parameter is calculated [38, 39]. In LNRM estimation, consider iteration and one-step approximation; that is, the NR approach to the eliminated value of iswhere is the element of and with point eliminated presented in Appendix.

4.2. Local Influence

The LI approach is based on geometric differentiation in spite of case deletion. A differential comparison of estimation is required before and after perturbing the model and data. Different schemes exist to determine the LI.

4.2.1. The Local Influence Diagnostics

For LI diagnostics, is postulated log-likelihood with vector and vector is a vector of perturbation to some open subset . The log-likelihood function under the perturbation of case weight iswhere and is the no perturbation vector. Note that . Cook [24] weighed the outcome of a definite perturbation by likelihood displacement by taking the performance of the surface. We are concerned with the elimination of conceivably unusual (influential) observations. The likelihood displacement, that is, , measures the distance between and , where is the vector of perturbed MLE. The normal curvature for at the direction is given by , where is attended hypothesize Fisher information matrix with and is ordered matrix with elementsevaluated at and , where and are the perturbed vectors [24]. The of the matrix has the largest eigenvalue. Point out the unusual cases with the cut-off point for LI as

4.2.2. Case-Weights Perturbation

The log-likelihood function of the LNRM with (vector of weights) takes the formwhere and . We find, after some algebraic manipulation, the following expressions for the weighted log-likelihood function and the elements of the matrix .

4.2.3. Response Perturbation

Consider the regression model (7) by assuming now that each is perturbed as , , where is a scale parameter. For the perturbation, in response to the LNRM, the log-likelihood takes the formwhere . When the response is perturbed, the log-likelihood function with manipulated expressions for matrix is as follows:

Vector is created on , which corresponding to the vector

A large value for the component of the above expression, , indicates that the observation requires a considerable LI on . The index plot of vector identifies the high influence on fitted values.

4.2.4. Covariate Perturbation

Consider now an additive perturbation on a particular continuous covariate, namely, by making , where is a scaled factor and . This perturbation scheme leads to the following expressions for the log-likelihood function:where , where . The covariate perturbation is manipulated elements of the matrix is

The assessment of the larger curvature at leads toand consequently,

To see to which observed values of the prediction is most sensitive under small changes in , we can perform the plot of against . The index plot of the vector can indicate those observations for which a small perturbation in the value of leads to a substantial change in the prediction.

5. Empirical Evaluation

For the performance of the derived results, we provide a Monte Carlo simulation scheme and illustrative example with results and discussions by showing the results in tables and figures.

5.1. Simulation Study

We provide the Monte Carlo simulation scheme for the performance of developed diagnostics for LND by following a similar simulation scheme used by Ortega et al. [29] in the given section. The response variable is generated by , where and the arbitrary values are ; we select arbitrary values of true vector in such a manner that , and the explanatory variables with no influential observations are generated by , where and . is set to be 50, 100, 150, and 200, and the number of explanatory variables is 1, 2, and 4. Then, we make an influential observation in X’s; that is, the 20th observation is replaced in the complete data set as , , where is the standard deviation of response.

For this study, right random censoring is used considering the censoring observations for each of the samples generated in which as the minimum of the survival time and the censored time of the observed time where

The censoring level is set to be , and . Now the performance of these diagnostics for the identification of generated influential observation with various censoring levels and with different values of dispersions is performed on the basis of the generated samples. These simulation results are performed on R software.

Tables 13 show the percentage results of the diagnostics measures for different explanatory variables; that is, , and from the simulation study. The results showed that, by increasing the levels of censoring (0%, 10%, 20%, and 30%) and , the diagnostic percentage shows a decreased trend, and by increasing the sample size , a great increasing trend could be seen. By the simulation study for LNRM, the developed approaches (i.e., ) diagnosed unusual observation better in the comparison of and One-Step . In the developed approaches, the performance of was better in all cases at 0% censored observations. By increasing the level of censoring (10%, 20%, and 30%), performs superior in terms of diagnosing the unusual observation in the highest percentage. Figures 13 display the diagnostics percentages with different level of censoring with different value of at , and . These figures show the performance of diagnostics measures graphically, as already discussed in Tables 13.

5.2. Example: Ovarian Cancer Survival Data

A sample from a clinical trial of 26 ovarian cancer patient’s survival time was taken from cancer treatment report [40], Mayo Clinic, Rochester, USA, to assess the effectiveness of various chemotherapies for women with ovarian cancer who had minimal residual disease after having undergone surgery to excise all tumors greater than 2 cm in diameter. For this study, noninformative censoring is used. The LNRM has four explanatory variables . The first two are Age and Residual disease present in which 1 = no, 2 = yes. The other two are ECOG performance with dependent variable survival or censoring time and censoring status in which 1 = lifetime observed, 0 = censoring, respectively. ANR iteration is used for estimation [6].

The proposed model is as follows:where follows the model given in (5). The MLEs for the model parameters are estimated in the Survreg Survival package (https://rdrr.io/cran/survival/man/survreg.html) in the R language. The MLEs and the absolute change with respect to the unusual observations are given in Table 4.

In Table 4, regression estimates for the full and after omitting the 1st, 13th, 14th, 21st, 22nd, 24th, and 25th observations, which was noted as influential observations and absolute changes in the estimates, were also noticed. Eliminating the 1st observation from data, absolute change results in as 598.6296%, which displays a high influence effect for omitting observation (i.e., 1st observation). After omitting the 13th observation, 436.4282% is the absolute change in . Similarly, after deleting the 14th, 21st, 22nd, 24th, and 25th observations from the data set, shows the maximum absolute changes as 381.0066%, 92.6017%, 694.1877%, 183.8186%, and 260.3238%, respectively.

Table 5 and Figure 4 show that the influence and curvature diagnostics for LNRM for the example data are based on , One-Step NR diagnostics, and curvature diagnostics. detects 1st and 6th as influential observation. One-Step NR diagnostics based on detect 1st, 2nd, 3rd, 8th, 11th, 22nd, 24th, and 25th as influential observations, while based on other three variables, NR detects observations 9th, 13th, 14th, 20th, 21st, and 23rd as potential influential observations. The case weight perturbation shows 1st, 13th, 14th, 21st, 22nd, and 24th observations are the most distinguished as compared to other observations.

Next, the influence of perturbations on the observed survival times will be examined. For response curvature, the 14th and 21st observations are distinguished from the other observations.

The perturbation of vectors for covariates, , , , and , is investigated here. For perturbation of covariates, observations 2nd, 8th, 11th, and 20th are observations with high influence.

6. Concluding Remarks

This paper developed new diagnostic approaches for the LNRM with censored data to identify the influential observations by using the LI technique. The curvatures were obtained as a measure of local influence under the perturbation scheme of case weight , response , and explanatory variables . We accomplish the global influence methods based on and One-Step NR method. is observed superior in simulation when there is 0% censoring in the data. While increasing the level of censoring, the performance of was better than the others.

Appendix

A. First-Order Partial Derivatives

Here, we derive the necessary computation formulas to obtain the first-order partial derivatives of the log-likelihood function. After a few algebraic manipulations,where .

B. Second-Order Partial Derivatives

Here, we derive the necessary formulas to obtain the second-order partial derivatives of the log-likelihood function. After some algebraic manipulations, we obtain

Data Availability

The data are available from the first author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.