Abstract

The traditional variable selection methods for survival data depend on iteration procedures, and control of this process assumes tuning parameters that are problematic and time consuming, especially if the models are complex and have a large number of risk factors. In this paper, we propose a new method based on the global sensitivity analysis (GSA) to select the most influential risk factors. This contributes to simplification of the logistic regression model by excluding the irrelevant risk factors, thus eliminating the need to fit and evaluate a large number of models. Data from medical trials are suggested as a way to test the efficiency and capability of this method and as a way to simplify the model. This leads to construction of an appropriate model. The proposed method ranks the risk factors according to their importance.

1. Introduction

Sensitivity analysis (SA) plays a central role in a variety of statistical methodologies, including classification and discrimination, calibration, comparison, and model selection [1]. SA also can be used to determine which subset of input factors (if any) accounts for most of the output variance (and in what percentage); those factors with a small percentage can be fixed to any value within their range [2]. In such usage, the focus is on determination of the important variables to simplification of the model; the original motivation for our research lay in a search for how to best arrive at such a determination. Although SA has been widely used in normal regression models to extract important input variables from a complex model so as to arrive at a reduced model with equivalent predictive power, it has limited use for selection of risk factors despite the presence of a large number of risk factors in survival regression models. The limited use of these methods to select appropriate subsets in survival regression models illustrates the desirability of development of a new method of SA-based variable selection to avoid the drawbacks of traditional methods and also simplify survival regression models by choosing the appropriate subsets of risk factors.

A considerable number of methods of variable selection have been proposed in the literature. The fundamental developments are squarely in the context of normal regression models and particularly in the context of multivariate linear regression models [3]. A comprehensive review of many variable selection methods is represented in [4]. Methods such as forward, backward, and stepwise selection and subset selection (Akaike information criterion (AIC)) and Bayesian information criterion (BIC)) are available; however none of these methods can be recommended for use in either a logistic regression model or in other survival regression models. They give incorrect estimates of the standard errors and P-values. They also can delete variables whose inclusion is critical [3]. In addition, these methods regard all the risk factors of a situation as equal, and they seek to identify the candidate subset of variables sequentially; furthermore, most of these methods focus on the main effects and ignore higher-order effects (interactions of variables).

New methods of variable selection, such as least absolute shrinkage and selection operator (LASSO) in [5], and the smoothly clipped absolute deviation (SCAD) method in [6], are at the center of attention recently in the field of survival regression models. These methods use the penalized likelihood estimation and the shrinkage regression approaches. These two approaches differ from traditional methods in their deletion of the nonsignificant covariates in the model by estimating their effects as 0. A nice feature of these methods is that they perform estimation and variable selection simultaneously, but, nevertheless, these methods suffer from some calculation and characteristics problems that are dealt with in more detail in [7, 8].

This study aims to use SA to extend and develop an effective, efficient, and time-saving variable selection method in which the best subsets are identified according to specified criteria without resorting to fitting all the possible subset regression models in the field of survival regression models. The remainder of this study is organized as follows: Section 2 gives the background of building a logistic regression model, and Section 3 deals with the proposed method. The results of implementing this method and logistic regression model are the subject of Section 4, and Section 5 consists of the discussion and conclusions.

2. Background of Constructing a Logistic Regression Model

Often the response variable in clinical data is not a numerical value but a binary one (e.g., alive or dead, diseased or not diseased). When the latter occurs, a binary logistic regression model is an appropriate method to present the relationship between the disease’s measurements and its risk factors. It is a form of regression used when the response variable (the disease measurement) is a dichotomy and the risk factors of the disease are of any type [9]. A logistic regression model neither assumes the linearity in the relationship between the risk factors and the response variable, nor does it require normally distributed variables. It also does not assume homoscedasticity, and in general has less stringent requirements than linear regression models. However, it does require that observations are independent and that the independent risk factors are linearly related to the logit of the response variable [10]. However, models involving the association between risk factors and binary response variables are found in various disciplines such as medicine, engineering, and the natural sciences. How do we model the relationship between risk factors and binary response variable? The answer to this question is the subject of the next subsections.

2.1. Constructing a Logistic Regression Model

The first step in modeling binomial data is a transformation of the probability scale from range to instead of using the linear model for the response variable of the probability of success on risk factors. The logistic transformation or logit of the probability of success is , which is written as and defined as the log odds of success. It is easily seen that any value of in the range corresponds to the value of in . Usually, binary data results from a nonlinear relationship between and , where a fixed change in has less impact when is near (0 or 1) than when is near (0.5). This is illustrated in Figure 1 (see [11]).

Thus, the appropriate link is the log odds transformation (the logit). Then if there are n binomial observations of the form for , where the expected value of the random variable associated with th observation, , is . The logistic regression model for association of on the values of k risk factors is such that [10] and the equation of success probability is The linear logistic model is a member of a family of generalized linear models (GLM). The next subsection explains this model fitting process.

2.2. Fitting Logistic Regression Models

The mechanics of maximum likelihood (ML) estimation and model fitting for logistic regression model are a special case of GLM fitting, and then fitting the model requires estimation of the unknown parameters of the ML function of this model using the Bernoulli ML as in the following [12]: The problem now is to obtain those values that maximize or its equivalent where it is as follows: where and represent all values of . The derivative of this log-likelihood function with respect to the unknown -parameters is given by Then the likelihood equations are where is the ML estimate of . There are two methods to solve (6) and obtain the maximum likelihood estimation of . The one most often used is known as the Newton-Raphson method. This method begins with determination of the score matrix and the information matrix as in the following [13]: Here is obtained from through (2), then we use and with the following formula to obtain the next value as where , this is to obtain , and so on.

2.3. Evaluating the Fitted Model

A simple model that fits adequately has the advantage of model parsimony. If a model has relatively little bias and describes reality well, it tends to provide more accurate estimates of the quantities of interest. Agresti [9] stated that “we are mistaken if we think that we have found the true model, because any model is a simplification of reality.” In light of this assertion, what then is the logic of testing the fit of a model when we know that it does not truly hold? The answer lies in the evaluation of the specific properties of this model by using criteria such as deviance, the , the Wald Score test, the Person chi-square, and the Hosmer-Lemshow chi-square tests; for more details, see [9, 10]. Usually the first stage of construction of any model presents a large number of risk factors. Inclusion of all of them may lead to an unattractive model from a statistical viewpoint. Thus, as an important step towards an acceptable model, a decision should be made early about the proper methodology to use to select the appropriate and important risk factors. Because traditional methods of selecting variables have many limitations in their applicability to survival regression models, a new method of variables selection will be developed by using GSA to select the most influential factors in the model. This is the subject of the following section.

3. Sensitivity Analysis to Select the Most Influencing Risk Factors

There are two key problems in variable selection procedure: (i) how to select an appropriate number of risk factors from the set of risk factors, and (ii) how to improve final model performance based on the given data. So answering these questions is the objective of our proposed method by applying GSA to select the influential risk factors in the logistic regression model.

3.1. General Concept of GSA

GSA was defined in [14] as “the study of how the uncertainty in the output of a model (numerical or otherwise) can be apportioned to different sources of uncertainty in the model input.” Hence one possible way to apportion the importance of the input factors with respect to the model response is to apply GSA. In general the importance of a given risk factor can be measured via the so-called sensitivity index, which is defined as the fractional contribution to the model output variance because of the uncertainty in . For k risk factors, the sensitivity indices can be computed using the following decomposition formula for the total output variance of the output Y [15]: where where is the unconditional variance of output of the model (incidence of CHD), is the conditional variance of risk factor , and is the variance of interaction between and , and so on. A first measurement of the fraction of the unconditional output variance that is accounted for by the uncertainty in , which is the first-order sensitivity index for the factor is given as The second terms in (9) are known as the effect of interactions. It is a fact that the number and importance of the interaction terms usually grow (i) with the number of risk factors and (ii) with the range of variation of the risk factors [16]. This means that if all of the terms are computed, then most likely would still be lower than the total , because the difference is a measure of the impact of the interactions. Consequently, when , then the model is additive (i.e., without interactions among its input factors), and thus the first order of conditional variances of (10) are all we need to decompose the model output variance. For a nonadditive model, higher-order sensitivity indices account for interaction effects among sets of input factors. However, higher-order sensitivity indices are usually not estimated directly because if the model consists of k risk factors, then the total number of indices (including the 's) that should be estimated is as high as . For this reason, a more compact sensitivity measurement is used; this measurement is the total effect sensitivity index, which concentrates in one single term on all the interactions involving a given factor . For example, for a model of risk factors, the three total sensitivity indices would be [2] and analogously where the conditional variance in (12) expresses the total contribution to the variance of Y because of non-, (i.e., to the remaining factors), so that includes all terms (i.e., a first order as well as interactions in (9)) that involve risk factor . For a given risk factor , thecoefficient of importance is the difference between and that reflects an important role of interactions for that risk factor in Y, Explaining the interactions among risk factors helps us to improve our understanding about the model structure. Estimators for both are provided by a variety of methods such as Sobol, the Fourier amplitude sensitivity test (FAST), and others; for more details, see [17].

3.2. GSA in a Logistic Regression Model

In this study, partitioning the total variance of the objective function is the way to estimate and so as to perform a GSA. How can this model be extended to deal with a binary response variable? Although partitioning of variances is uncomplicated in models with a continuous response variable and a normal error distribution, the extension of this partitioning to models with binary responses is not simple [18]. Consequently, to extend the variance partitioning method to our binary response variable (incident of coronary heart disease (CHD)), suppose that the data is consisting of , the number of people who have CHD. The actual response probability of the incidence of CHD for the th observation will have a Bernoulli distribution with a mean of , where is the proportion of the patients who have a disease. This response probability is therefore a random variable where . The variance of must be equal to zero when is zero or unity, and then a relationship between the unknown probability and our risk factors can be fitted. Typically a logistic regression model represents this relationship between for a sample with n people who have a binomial distribution (i.e., , and the corresponding response probability of the incidence of the disease is for ith observation and the k risk factors as in (4) and (5). This model assumes independence between the n observations, and then all the variations conditional on the estimates of the probabilities will be binomial with equal variance: The binomial is not the only possible distribution for fitting proportion data. Other distributions exist that have greater variation (known as overdispersion) or less variation (known as underdispersion) than the binomial distribution conditional on the values of 's. The simplest function for the true probability of the ith observation uses a multiplicative scale factor to determine the variance of the response as where r is a scale factor that is equal to 1. If we have a binomial variation, it will be greater than 1 if there is overdispersion and less than 1 if there is underdispersion, and is an unobservable random variable [19]. The advantages of the multiplicative approach are that it will allow both over- and underdispersions. The random variable is associated with the observed number of incidences of the disease for the ith unit, . It will have a binomial distribution, and then the mean of , conditionalon , is and the conditional variance of is Since cannot be calculated, then the observed proportion of the disease incidence has to be an estimate of as According to a standard result from the conditional probability theory, the unconditional expected value of a random variable Y can be obtained from the conditional expectation of Y given X using the equation and the unconditional variance of Y is given by [20] Applying these two results on our response variable gives now also and so in the absence of random variation in the response probability, would have a binomial distribution, , and in this case when as required, then If, on the other hand, r is greater than 1, then a variation in the response probability occurs and the variance of will exceed , the variance under binomial sampling that leads to overdispersion. But if r is less than 1, then the variation in the response probability and the variance of will be less than , the variance under binomial sampling that leads to underdispersion. To use GSA to select the important covariates from the available set of covariates and construct an appropriate logistic regression model, it involves three steps.

(1) The first step is identification of the probability distribution of each covariate in the model. Usually sensitivity analysis starts from probability distribution functions (pdfs) given by the experts. This selection makes the use of the best information available of the statistical properties of the input factors. One of the methods used to obtain the pdfs starts with visualizing the observed data by examining its histogram to see if it is compatible with the shape of any distribution, as illustrated in Figure 2.

A visual approach is not always easy, accurate, or valid, especially if the sample size is small. Thus it would be better to have a more formal procedure for deciding which distribution is “best.” A number of significance tests are available for this such as the Kolmogorov-Smirnoff and chi-square tests. For more details, see [21].

(2) In the second step, the logistic regression model as in (1) and the information about the covariates obtained in step one are used to create a Monte Carlo simulation to generate the sample that will be used in the decomposition and to estimate the unconditional variance of response probability and the conditional variation for covariates as in (23) to (26).

(3) These results from step two will be used in performing GSA in the binary logistic regression model using (11), and in the result of decomposing as in (24) and (26), where the main effect indices are and the total effect indices are where are all X's but , and the coefficients of importance are These results and the two datasets are used to test and compare the performance of the proposed GSA method as a variable selection method to identify the important risk factors obtained from these datasets with the results obtained from other existing methods of selecting variables.

4. Numerical Comparisons

The purpose of this section is to compare the performance of the proposed method with existing ones. We also use a real data example to illustrate our SA approach as a variable selection method. In the first examples in this section, we used the dataset and the results of the penalized likelihood estimate of best subset (AIC), bust subset (BIC), SCAD, and LASSO that were computed by [7] as a way to compare the performance of the proposed method with these methods.

4.1. The First Example

In this example, Fan and Li [7] applied the proposed penalized likelihood methodology to burn data collected by the General Hospital Burn Center at the University of Southern California. The dataset consists of 981 observations. The binary response variable Y is 1 for those victims who survived their burns and 0 otherwise. Risk factors are   =  age,   =  sex,   =  log (burn area + 1), and binary variable   =  oxygen (0 normal, 1 abnormal) was considered. Quadratic terms of and , and all interaction terms were included. The intercept term was added, and the logistic regression model was fitted. The best subset variable selection with the AIC and the BIC was applied to this dataset. The unknown parameter was chosen by generalized cross-validation: it is 0.6932 and 0.0015, respectively, for the penalized likelihood estimates with the SCAD and LASSO. The constant a in the SCAD was taken as 3.7. With the selected , the penalized likelihood estimator was obtained at the sixth, 28th, and fifth step iterations for the penalized likelihood with the SCAD and LASSO. Table 1 contains the estimated coefficients and standard errors for the transformed data, based on the penalized likelihood estimators, and the calculation of the sensitivity indices obtained by using SimLab software to compare the performance of GSA as a variable selection method with other methods. The first five columns were calculated by [7].

In addition to GSA indices, Table 1 consists of the results of two traditional methods of variable selection (AIC and BIC) and two new methods (LASSO and SCAD). The traditional method, best subset procedure via minimizing the BIC scores, chooses five of 13 risk factors, whereas the SCAD chooses four risk factors. The difference between them is that the best subset keeps . Neither SCAD nor the best subset variable selection (BIC) includes and in the selected subset, but both LASSO and the best subset variable selection (AIC) included them. LASSO chooses the quadratic terms of and rather than their linear terms. It also selects an interaction term , which may not be statistically significant. LASSO shrinks noticeably large coefficients. The last column in Table 1 shows that GSA selected the variables , and , in addition to the intercept, which resembles the SCAD method, and differs from the other methods. According to the results in the last column of Table 1, the risk factors can be ranked according to sensitivity indices and . Age is the first and the most influential risk factor, with a percent of contribution of 0.487, and the second most important risk factor is the interaction between and , with a percentage of contribution of 0.362. The third influential risk factor is the log (area of burn + 1) with a percentage of contribution of 0.143 as shown in Table 1. Consequently, we find that the proposed GSA variable selection method resembles SCAD in choosing the same risk factors.

4.2. The Second Example

A new dataset emerges from the original dataset prepared in [22] as a way to compare SA and the traditional method (backward elimination) as variable selection methods. Originally this study was undertaken to determine the prevalence of CHD risk factors among a population-based sample of 403 rural African-Americans in Virginia. Community-based screening evaluations included the determination of exercise and smoking habits, blood pressure, height, weight, total and high-density lipoprotein (HDL) cholesterol, and glycosylated hemoglobin, and other factors. The results of this study were presented as percentages of prevalence for most factors such as diabetes (13.6% of men, 15.6% of women), hypertension (30.9% of men, 43.1% of women), and obesity (38.7% of men, 64.7% of women), without building any models to study the relationship between CHD and its risk factors. For more details, see [8]. A new dataset was generated based on the first one as a way to calculate SA indices to extract the important risk factors for CHD from among these new factors, and then implement the logistic regression model to test the performance of the proposed method as follows.

(1)CHD (Y) 10-year percentage risk is generated according to Framingham Point Scores. This risk is classified as 1 if the percentage of the risk is 20% and 0 otherwise [23].(2)Diabetes (debt, ): According to the criteria published by American College of Endocrinology (ACE) & American Association of Clinical Endocrinologists (AACE) [24] the participant has diabetes 1 if the Stabilized Glucose 140 mg/dL or Glycosylated Hemoglobin 7% or both of them more than these limits, and he has no diabetes 0 otherwise.(3)Total cholesterol (Chol, ): if a participant has total cholesterol of 200 mg/dL, he will be given a 1 and a 0 otherwise [25].(4)High density lipoprotein (HDL, ): a participant with HDL of 40 mg/dL will be given a 1 and a 0 otherwise [25].(5)Age : standardized values are used .(6)Gender (Gan, ): 1 is for a male and 2 for a female.(7)Body mass index (BMI, ): values for this standard are calculated from the following equation: BMI = height/(weight)2, and the participant gets 1 if BMI is 30 and a 0 otherwise [25].(8)Blood pressure (hypertension, Hyp, ): a participant has Hyp (1) if systolic blood pressure is 140 or if diastolic blood pressure is 90 or if both of them exceed these limits and 0 otherwise [25].(9)Waist/hip ratio , in addition to BMI, is a second factor in the determination of obesity. This dataset was used to perform SA through the use of SimLab software and the partitioning variance methodology discussed in Section 3. An evaluation of the efficiency of the proposed method was performed by fitting all factors into logistic regression models so as to obtain comparisons of factors chosen by the proposed method with those selected by traditional variable selection method (backward elimination). SPSS software was used to get the results that follow from fitting logistic regression models.

4.2.1. The Important Risk Factors

Implementation of the GSA method for this dataset gave the results in Table 2, which shows the ranking of the risk factors in order of importance and the contribution of each one to explaining the total variance of the CHD response variable.

According to the first order of sensitivity indices , the BMI is the first and the most influential factor, and the waist-hip ratio ranks second. Both are components of the obesity factor. Age is the third influential factor and so on through the other factors as listed in Table 2. The total sensitivity index for a given risk factor provides a measure of the overall contribution of that risk factor to the output variance, taking into account all possible interactions with the other risk factors. The difference between the total sensitivity index and the first-order sensitivity index for a given risk factor is a measure of the contribution to the output variance of the interactions involving that factor; see (12) and (13). The second column in Table 2 shows the values of , which gives the same rank as for the risk factors. These indices point to the simple interaction between these risk factors as illustrated in the third column in the same table. Figure 3 shows the compression between the first order , the total sensitivity indices, and the interactions between risk factors.

4.2.2. Implementing the Logistic Regression Model

Does the proposed method yield a reliable model? To investigate the reliability of the proposed method, we compared the results of the fitted models. Basically, when the full logistic regression model is fitted, the results are These results showed the significance of the overall fit of the model according to the values of and in spite of the low value of ; also showed that the individual effect for all risk factors is not significant, which means that cannot be rejected from the following null hypothesis: Second, application of the logistic regression model by using those risk factors that appear in Table 2 as highly ranked by the proposed method also shows that this method ranks each risk factor according to its contribution to the incidence of the CHD response variable. The question also becomes how many variables must be selected in order to apply the logistic regression model. The possibility exists that the selection procedure may tend to underfit or overfit the model by selecting too few or too many variables. In the face of such a possibility, our objective becomes to find the model that uses the least number of variables while simultaneously explaining a reasonable percentage of variance in the dependent variable relative to the percentage explained by all the variables in the full model. Thus two models may be fitted from Table 2 to compare the results. The first logistic regression model consisted of the obesity factors (BMI, and W/H ratio), age, and total cholesterol factors that explained 74% of the total variance of the CHD response variable according to the individual effect as in Table 2. The results of fitting this model in this manner and applying SPSS software were The results in (34) showed that using these criteria for the overall fit for this model demonstrated their significance collectively and individually as risk factors that influence the incidence of CHD and raise the value of to 71% in comparison with the full model in (31) The second logistic regression model is fitted by adding another risk factor, HDL, to increase the percentage of explanation to 87%. The results of fitting this model as in (36) are These results showed that adding the HDL risk factor does not improve the results of the first logistic regression model, but the parameter of this risk factor is not significant when we test the following hypothesis: Note that the difference between the deviances of the two models is minor. Furthermore, the value of does not improve. Thus, according to the principle of parsimony, the first model should be considered the best model and the risk factors used to construct this model are those that are the most influential in causing CHD. Moreover, showing the different results obtained from these two models demonstrates the differences between fitting the full model with all risk factors and fitting it with only selected risk factors.

The efficiency of the proposed method of variable selection (GSA) can be measured by comparing its results as in (34) with the results gained from fitting the logistic regression model by using the backward elimination method (BEM). These results are shown in Tables 3 and 4.

Table 3 shows the overall fitting criteria required for the last three steps of a logistic regression model fitted by the use of the BEM.

Also Table 4 shows the last three steps of iteration to choose the important risk factors. These results represent the sequential elimination of the factors, which requires eight steps to rank these risk factors according to their importance; however, the proposed method does not need these iterations.

5. Conclusions

The results in Tables 1 to 4 and (31) to (36) for the two examples confirm that the proposed method is capable of distinguishing between important and unimportant risk factors. The proposed method ranked the risk factors according to their decreasing importance as shown in Tables 1 and 2. In the example in which we compared the proposed method with those methods that are typically used, we found that its performance very much resembled the SCAD method in which the same risk factors are selected. From the first example, we found that the important risk factors are age, the area of the burns, and the interaction between them. In the second example we found that the obesity factors (BMI and W/H) are the most influential risk factor on the incidence of CHD, the second risk factor is age, and the third risk factor is the total cholesterol. These play the major roles, representing approximately 74% of the incidence of CHD. Thus, they are considered the most important risk factors according to their individual percentages of contribution in the incidence of CHD as shown in Table 1. Compression between the results of the fitting of the full logistic regression model as in (31) and the chosen models as in (34) and (36) confirm the efficiency of the proposed method in its selection of the most important risk factors. Equation (34) represents the best model, according to the model evaluation criteria, because it consists of the most influential risk factors. Therefore, a medical care plan and medical interventions should comply with this ordering of these factors. Also, to further confirm these results, one of the traditional variable selection methods was used (backward elimination method), which yields different results after eight steps, but the proposed method orders the risk factors without iteration and without the need to fit multiple regression models. Finally, these results together confirm and emphasize the importance of GSA as a variable selection method.

Acknowledgment

This work was supported by USM fellowship.