Research Article  Open Access
S. Razmyan, F. Hosseinzadeh Lotfi, "An Application of MonteCarloBased Sensitivity Analysis on the Overlap in Discriminant Analysis", Journal of Applied Mathematics, vol. 2012, Article ID 315868, 14 pages, 2012. https://doi.org/10.1155/2012/315868
An Application of MonteCarloBased Sensitivity Analysis on the Overlap in Discriminant Analysis
Abstract
Discriminant analysis (DA) is used for the measurement of estimates of a discriminant function by minimizing their group misclassifications to predict group membership of newly sampled data. A major source of misclassification in DA is due to the overlapping of groups. The uncertainty in the input variables and model parameters needs to be properly characterized in decision making. This study combines DEADA with a sensitivity analysis approach to an assessment of the influence of banks’ variables on the overall variance in overlap in a DA in order to determine which variables are most significant. A MonteCarlobased sensitivity analysis is considered for computing the set of firstorder sensitivity indices of the variables to estimate the contribution of each uncertain variable. The results show that the uncertainties in the loans granted and different deposit variables are more significant than uncertainties in other banks’ variables in decision making.
1. Introduction
The classification problem of assigning observations to one of different groups plays an important role in decision making. When observations are restricted to one of two groups, the Binary classification has wide applicability in business environments.
Discriminant analysis (DA) is a classification method that can distinguish the group membership of a new observation. A group of observations for which the memberships have already been identified is used for the estimation of a discriminant function by some criteria, such as the minimization of misclassification. A new sample is classified into one of the groups based on the gained results [1].
Mangasarian [2] identified that linear programming (LP) could be used to determine separating hyperplanes, namely, when two set of observations are linearly separable use linear discriminant function. Freed and Glover [3] and Hand [4] using objectives such as minimization of the sum of deviations (MSD) or maximization of the minimum deviation (MMD) of misclassified observations from the separating hyperplane, when sets of observations that are not necessarily linearly separable, proposed LP methods for generating linear discriminant function.
Then a model is based on the goal programming (GP) extension of LP by choosing different criteria, such as minimizing the maximum deviation, maximizing the minimum deviation, minimizing the sum of interior deviation, minimizing the sum of deviations, minimizing misclassified observations, minimizing external deviations, maximizing internal deviations, maximizing the ratio of internal to external, and hybrid models for which there are both advantages and deficiencies [5–9].
In DA, LP and other mathematical programming (MP) based approaches are nonparametric and more flexible than statistical methods [6, 10]. RetzlaffRoberts [11, 12] and Tofallis [13] proposed the use of DEAratio model for DA. Sueyoshi [14] using a data envelopment analysis (DEA) additive model, described a goal programming formulation of DA in which the proposed model is more directly linked to minimizing the sum of deviations from the separating hyperplane; this method was named DEADA to distinguish it from other DA and DEA approaches. The original GP version of DEADA could not deal with negative data. Therefore, Sueyoshi [15] extended DEADA to overcome this deficiency. This approach was designed to minimize the total distance of misclassified observations and formulated by twostage GP formulations. The number of misclassifications can, however, be considered as measures of misclassification, in which binary variables indicate whether observations are correctly or in correctly classified. Bajgier and Hill [16] proposed a Mixed Integer Programming (MIP) model that included the number of misclassifications in the objective function for the twogroup discriminant problem. Gehrlein [17] and Wilson [18] introduced a MIP approach for minimizing the number of misclassified observations in multigroup problems. Chang and Kuo [19] proposed a procedure based on benchmarking model of DEA to solve the twogroup problems. Sueyoshi [20] reformulated DEADA by MIP to minimize the total number of misclassified observations. When an overlap between two groups is not a serious problem, dropping the first stage of the twostage MIP approach simplifies the estimation process [21].
Sensitivity analysis provides an understanding of how the model outputs is affected by changes in the inputs, therefore it can assist to increase the confidence in the model and its predictions. sensitivity analysis can use in deciding whether inputs estimates are sufficiently precise to give reliable predictions or we can find the model parameters that can be eliminated.
Two classes in sensitivity analysis have been distinguished [22].
Local Sensitivity Analysis
Studies how some small variations of inputs around a given value, change the value of the output. This approach is practical when the variation around a baseline of the inputoutput relationship to be assumed linear.
Global Sensitivity Analysis
Takes into account all the variation range of the inputs, and has for aim to apportion output uncertainty to inputs’ ones. It quantified the output uncertainty due to the uncertainty in the input parameters. Global sensitivity analysis apportions the output uncertainty to the uncertainty in the input factors, described typically by probability distribution functions that cover the factors’ ranges of existence.
Local methods are less helpful when sensitivity analysis is used to compare the effect of various factors on the output, as in this case the relative uncertainty of each input should be weighted. A global sensitivity analysis technique thus incorporates the influence of the whole range of variation and the form of the probability density function of the input. The variancebased methods can be considered as a quantitative method for global sensitivity analysis. In this Study, the Sobol’ decomposition in the framework of Monte Carlo simulations (MCS), [22], which is from the family of quantitative methods for global sensitivity analysis, is applied to study of the effect of the variability in DA due to the uncertainty in the variables. The results of the sensitivity analysis can determine which of the variables have a more dominant influence on the uncertainty in the model output.
This paper is organized as follows: Section 2 briefly introduces the DEADA model; Section 3 describes the sensitivity analysis based on a MonteCarlo simulation. Section 4 contains an example and the conclusion is provided in Section 5.
2. Data Envelopment AnalysisDiscriminant Analysis (DEADA)
The twostage MIP approach [20] is used in this study to describe DEADA. We considered two groups ( and ) for which the sum of the two groups has observations . Each observation has independent factors , denoted by . It is necessary to identify the group membership of each observation before its computation. In the two stage approach, the computation process consists of classification and overlap identification, and handling overlap. The first stage is formulated as follows [20].
Stage 1.
Here “” indicates a discriminant score for group classification and “” indicates the size of an overlap between two groups.
Let (=) and and be an optimal solution of the model (2.1). Then, the original data set () is classified into the following subsets , where
Then, we determine that observations in belong to and the observations of belong to because their location is identified from model (2.1). The two subsets and consist of the observations have not yet been classified in the first stage.
Stage 2. If , then the existence of an overlap is identified in the fist stage. In this stage, we reclassify all of the observations belonging to the overlap because the group membership of these observations is still undetermined. The second stage is reformulated as follows [20]:
Here, the binary variable () counts the number of observations classified incorrectly. The objective function minimizes the number of such misclassifications. The weight () identifies the importance between and in terms of the number of observations. In the presented model (2.3), it is necessary to prescribe a large number () and a small number (). The equation indicates that some pairs avoid the occurrence of and .
After gaining an optimal solution on and , the second stage classifies observations in the overlap as follows: if , then the th observation belongs to , or if , then it belongs to . Thus, all of the observations in are classified into or at the end of the second stage.
3. Sensitivity Analysis Based on MonteCarlo Simulation (MCS)
Sensitivity analysis was created to deal simply with uncertainties in the input variables and model parameters [22]. The results of an sensitivity analysis can determine which of the input parameters have a more dominant influence on the uncertainty in the model output [23]. A variancebased sensitivity analysis, which addresses the inverse problem of attributing the output variance to uncertainty in the input, quantifies the contribution that each input factor makes to the variance in the output quantity of interest. A global sensitivity analysis of complex numerical models can be performed by calculating variancebased importance measures of the input variables, such as the Sobol’ indices. These indices are calculated by evaluating a multidimensional integral using a MonteCarlo technique. This approach allows analyzing the influence of different variables and their subsets, the structure of , and so forth.
It is assumed that a mathematical model having input parameters gathered in an input vector with a joint probability density function (pdf) can be presented as a model function: where . Because of the variables are affected by several kinds of heterogeneous uncertainties that reflect the imperfect knowledge of the system, it is assumed that input variables are independent and that the probability density function is known, even if the are not actually random variables.
The Sobol’ sensitivity method explores the multidimensional space of the unknown input parameters with a certain number of MC samples. The sensitivity indices are generated by a decomposition of the model function in an dimensional factor space into summands of increasing dimensionality [22]: where the constant is the mean value of the function, and the integral of each summand over any of its independent variables is zero. Due to this property, the summands are orthogonal to each other in the following form:
The sensitivity index, , represents the fractional contribution of a given factor to the variance in a given output variable, . To calculate the sensitivity indices, the total variance, , in the model output, , is apportioned to all of the input factors, ,…, , as follows:
By integrating the square of (2.2) and with (2.3), it is possible to decompose the total variance (3.1) as follows [24]: where , and so on. is referred to as the variance of the conditional expectation and is the variance over all of the values of in the expectation of given that has a fixed value??. This is an intuitive measure of the sensitivity of to a factor , as it measures the amount by which ) varies with the value of whilst averaging over the ’s, . Following the above definition for the partial variances, the sensitivity indices are defined as
Higher order indices can be calculated with a similar approach. With regard to (3.2), the decomposition of the sensitivity indices can be written in the following form:
The Sobol’ indices are usually computed with a MC simulation. The mean value and total and partial variance can be derived with samples in the following [22]: In the later equations, is a sampled variable in , and The superscripts (1) and (2) indicate that two different samples are generated and mixed.
4. Illustrative Examples
Classification methods are widely used in economic and finance. They are useful for classifying the sectors based upon their performance in different groups and predict the group memberships of new firms. Most of researchers used classification methods to classifying the firms based upon their performance assessment. DA is a classification method that is used in this study. The purpose of the first stage in DEADA is to determine whether there is an overlap between the two groups. The existence of an overlap is the main source of misclassification in DA. By identification of the overlap between two groups, it is possible to increase the number of observations classified correctly. If there is no overlap, any DA method may produce an almost perfect classification. However, if there is an overlap, an additional computation process is needed to deal with such an overlap [20]. So, there is a tradeoff between computational effort/time and a high level of classification capability.
Misclassification can result as a consequence of an intersection between two groups. Many researchers have proposed approaches that try to reveal the advantage of identifying the minimized overlap of two groups for risk management on the classification problem [19, 20, 25, 26]. Given the importance of the banking sector for, in general, the whole economy and, in particular, for the financial system, in this section, we present an application of the sensitivity analysis to overlaps, , on data from a commercial bank of Iran. This assertion is illustrated numerically for bank branches that have more than 20 and 30 personnel, in two different examples.
If we wish to take into account the inherent randomness with respect to what the criteria might experience, we have to bring stochastic characterization into play. The stochastic efficiency assessment of banking branches normally requires performing a set of analyses on DMUs with a suit of variables as criteria.
At first, we use the additive model to discriminate banking branches. Most models need to examine both a DEA efficiency score and slacks or an efficiency score measured by them, which depends upon inputbased or outputbased measurement. The additive model [27] aggregates inputoriented and outputoriented measures to produce the efficiency score. Consequently, the efficiency status is more easily determined by the additive model than by the radial model.
In two different examples, the real data set consists of 78 and 18 banking branches. This study selects 31 and 8 branches as inefficient branches, and 47 and 10 branches as efficient branches, respectively, in example 1 and 2, documented in Tables 1 and 2. For determining the classifications based on the additive model, three variables of personnel, payable interest, and nonperforming loans are considered as inputs. Nine variables of loans granted, longterm deposit, current deposit, nonbenefit deposit, shortterm deposit, received interest, and received fee are assumed as outputs.
