Research Article  Open Access
Soulef Smaoui, Habib Chabchoub, Belaid Aouni, "Mathematical Programming Approaches to Classification Problems", Advances in Operations Research, vol. 2009, Article ID 252989, 34 pages, 2009. https://doi.org/10.1155/2009/252989
Mathematical Programming Approaches to Classification Problems
Abstract
Discriminant Analysis (DA) is widely applied in many fields. Some recent researches raise the fact that standard DA assumptions, such as a normal distribution of data and equality of the variancecovariance matrices, are not always satisfied. A Mathematical Programming approach (MP) has been frequently used in DA and can be considered a valuable alternative to the classical models of DA. The MP approach provides more flexibility for the process of analysis. The aim of this paper is to address a comparative study in which we analyze the performance of three statistical and some MP methods using linear and nonlinear discriminant functions in twogroup classification problems. New classification procedures will be adapted to context of nonlinear discriminant functions. Different applications are used to compare these methods including the Support Vector Machines (SVMs) based approach. The findings of this study will be useful in assisting decisionmakers to choose the most appropriate model for their decisionmaking situation.
1. Introduction
Discriminant Analysis (DA) is widely applied in many fields such as social sciences, finance and marketing. The purpose of DA is to study the difference between two or more mutually exclusive groups and to classify this new observation into an appropriate group. The popular method used in DA is a statistical approach. The pioneer of these methods is Fisher [1] who proposed a parametric method introducing linear discriminant functions for twogroup classification problems. Somewhat later, Smith [2] introduced a quadratic discriminant function, which along with other discriminant analyses, such as logit and probit, has received a good deal of attention over the past several decades. Some recent researches raise the fact that standard assumptions of DA, such as the normality of the data distribution and the equality of the variancecovariance matrices, are not always verified. The MP approach has also been widely used in DA and it can be considered a valuable alternative to the classical models of DA. The aim of these MP models is either to minimize the violations (distance between the misclassified observations and the cutoff value) or to minimize the number of misclassified observations. They require no assumptions about the population’s distribution and provide more flexibility for the analysis by introducing new constraints, such as those of normalization, or by including weighted deviations in the objective functions including higher weightings for misclassified observation deviations and lower weightings for correctly classified observation deviations. However, special difficulties and even anomalous results restrict the performance of these MP models [3]. These difficulties may be classified under the headings of “degeneracy” and “stability” [4, 5]. The solutions can be classed as degenerate if the analysis presents unbounded solutions in which improvement of the objective function is unconstrained. Similarly, the results can be classed as unstable if, for example, they depend on the position of the data in relation to the origin. A solution would be deemed unacceptable in a situation where all of the coefficients of the discriminant function were equal to zero, thus leading to of all the observations being incorrectly classified in the same group [6, 7]. To overcome these problems, different normalization constraints have been identified and variants of MP formulations for classification problems have been proposed [4, 8–11].
For any given discriminant problem, the choice of an appropriate method for analyzing the data is not always an easy task. Several studies comparing statistical and MP approaches have been carried out by a number of researchers. A number of comparative studies using both statistical and MP approaches have been performed on real data [12–14] and most of them use linear discriminant functions. Recently, new MP formulations have been developed based on nonlinear functions which may produce better classification performance than can be obtained from a linear classifier. Nonlinear discriminant functions can be generated from MP methods by transforming the variables [15], by forming dichotomous categorical variables from the original variables [16], based on piecewiselinear functions [17] and on kernel transformations that attempt to render the data linearly separable, or by using Multihyperplanes formulations [18].
The aim of this paper is, thus, to conduct a comparative study in which we analyze the performance of three statistical methods: () the Linear Discriminant Function method (LDF), () the Logistic function (LG), and () Quadratic Discriminant Function (QDF) along with five MP methods based on linear discriminant functions: the MSD model, the Ragsdale and Stam [19] (RS) model, the model of Lam et al. [12] (LPM), the Lam and Moy [10] model MC, and the MCA model [20]. These methods will be compared to the secondorder MSD model [15], the popular SVMbased approach, the piecewiselinear models, and the Multihyperplanes models. New classification procedures adapted to the last models based on nonlinear discriminant functions will be proposed. Different applications in the financial and medicine domains are used to compare the different models. We will examine the conditions under which these various approaches give similar or different results.
In this paper, we report on the results of the different approaches cited above. The paper is organized as follows: first, we discuss the standard MP discriminant models, followed by a presentation of MP discriminant models based on nonlinear functions. Then, we develop new classification model based on piecewisenonlinear functions and hypersurfaces. Next, we present the datasets used in the analysis process. Finally, we compare the performance of the classical and the different MP models including the SVMbased approach and draw our conclusions.
2. The MP Methods
In general, DA is applied to two or multiple groups. In this paper, we discuss the case of discrimination with two groups. Consider a classification problem with attributes. Let be an matrix representing the attributing scores of a known sample of objects from the group . Hence, is the value of the attribute for the object, is the weight assigned to the attribute in the linear combination which identifies the hyperplane, and is the variancecovariance matrices of group .
2.1. The Linear MP Models for Classification Problem
In this section, we will present seven MP formulations for classification problem. These formulations assume that all group G_{1 }(G_{2}) cases are below (above) the cutoff score . This score defines the hyperplane which allows the two groups to be separated as follows: ( and are free), with : the cutoff value or the threshold.
The MP models provide unbounded, unacceptable solutions and are not invariant to a shift of origin. To remove these weaknesses, different normalization constraints are proposed: (N1) ; (N2) [4]; (N3) the normalization constant 1, that is, = 1, by defining binary variables and such as with and (N4) the normalization for invariance under origin shift [11]. In the normalization (N4), the free variables are represented in terms of two nonnegative variables ( and ) such as and constraining the absolute values of the to sum to a constant as follows: By using the normalization (N4), two binary variables and will be introduced in the models in order to exclude the occurrence of both and [11]. The definition of and requires the following constraints: The classification rule will assign the observation into group G_{1} if and into group otherwise.
2.1.1. MSD Model (Minimize the Sum of Deviations)
The problem can be expressed as follows: subject to(and are free and for all ), where is the external deviation from the hyperplane for observation .
The objective takes zero value if the two groups can be separated by the hyperplane. It is necessary to introduce one of the normalization constraints cited above to avoid unacceptable solutions that assign zero weights to all discriminant coefficients.
2.1.2. Ragsdale and Stam TwoStage Model (RS) [19]
subject to ( are free and for all ), where and are two predetermined constants with . The values chosen by Ragsdale and Stam are and . Two methods were proposed to determine the cutoff value. The first method is to choose the cutoff value equal to . The second method requires the resolution of another LP problem which minimizes only the observation deviations whose classification scores lie between and . The observations that have classification scores below or above are assumed to be correctly classified. The advantage of this latter method is to exclude any observation with classification scores on the wrong side of the hyperplane. However, for simplicity, we use the first method in our empirical study. Moreover, we will solve the model by considering and decision variables by adding the constraints:
2.1.3. Lam et al. Method [9, 12]
This model abbreviated as LPC is defined by the following.
subject to(are free), where and are, respectively, the external and the internal deviations from the discriminant axis to observation in group 1.
and are, respectively, the internal and the external deviations from the discriminant axis to observation in group 2.
and are defined as decision variables and can have different significant definitions.
A particular case of this model is that of Lee and Ord [21] which is based on minimal absolute deviations with and .
A new formulation of LPC is to choose () as the mean group of the classification scores of the group , as follows (LPM): subject to
with as the mean of all through the group and is the number of observations in the group .
The objective of the LPM model is to minimize the total deviations of the classification scores from their group mean scores in order to obtain the attribute weights which are considered to be more stable than those of the other LP approaches. The weighting obtained from the resolution of LPM will be utilized to compute the classification scores of all the objects. Lam et al. [12] have proposed two formulations to determine the cutoff value . One of these formulations consists of minimizing the sum of the deviations from the cutoff value (LP2).
The linear programing model LP2 is illustrated as follows:
subject to (is free, and ).
2.1.4. Combined Method [10]
This method combines several discriminant methods to predict the classification of the new observations. This method is divided into two stages: the first stage consists of choosing several discriminant models. Each method is then applied independently. The results from the application of each method provide a classification score for each observation. The group having the higher groupmean classification score is denoted as and the one having the lower groupmean classification score is denoted as . The second stage consists of calculating the partial weights of the observations using the scores obtained in the first stage. For group , the partial weight of the th observation obtained from solving the method ( where is the number of methods utilized) is calculated as the difference between the observation’s classification scores and the groupminimum classification score divided by the difference between the maximum and the minimum classification scores: The largest partial weight is equal to 1 and the smallest partial weight is equal to zero.
The same calculations are used for each observation of the group , but in this case, the partial weight of each observation is equal to one minus the obtained value in the calculation. Thus, in this group, the observations with the smallest classification scores are the observations with the greatest likelihood of belonging to this group:
The same procedure is repeated for all the discrimination methods used in the combined method. The final combined weight is the sum of all the partial weights obtained. The final combined weights of all observations are used as the weighting for the objective function of the LP model in the second stage. A larger combined weight for one observation indicates that there is little chance that this observation has been misclassified. For each combined weight, the authors add a small positive constant in order to ensure that all the observations are entered in the classification model, even for those observations with the smallest partial weights obtained by all the discriminant methods.
The LP formulation which combines the results of different discriminant methods is the following weighting MSD (WMSD) model:
subject to (andare freefor all and for all ). The advantage of this model is its ability to weight the observations. Other formulations are also possible, for example, the weighting RS model (WRS).
In our empirical study, the three methods LDF, MSD, and LPM are combined in order to form the combined method MC1. Methods LDF, RS, and LPM are combined in order to form the combined method MC2. Other combined methods are also possible.
2.1.5. The MCA Model [22]
subject to (are free, ), with if the observation is classified correctly, , is very small, and is large. The model must be normalized to prevent trivial solutions.
2.1.6. The MIP EDEADA Model (MIP EDEADA) [23]
Two stages characterize this model:
First stage (classification and identification of misclassified observations) is to
subject to with with and being the optimal solution of the model (2.15). There are two cases.
(i)If , then there is no misclassified observations and all the observations are classed in either group 1 or group 2 by . We stop the procedure at this stage.(ii)If , then there are misclassified observations and then comes stage 2 after classifying the observations in these appropriate ensembles (, ).The classification rule is
then the appropriate group of observation is determined by the second stage.Second stage (classification) is to
subject towhere, with
The classification rule is
The advantage of this model is to minimize the number of misclassified observations. However, the performance of the model depends on the choice of numbers M and which are subjectively determined by the searcher and depends also on the choice of the computer science used for resolving the model.
2.2. The Nonlinear MP Models
2.2.1. The SecondOrder MSD Formulation [15]
The form of the secondorder MSD model is to subject to where (are free, ), is the coefficient for the linear terms of attribute j, are the coefficients for quadratic terms of attribute j, are the coefficients for the crossproduct terms involving attributes h and m, are the external deviations of group r observations, and is the cutoff value.
The constraint (2.17c) is the normalization constraints which prevent trivial solution. Other normalization constraints are also possible [4, 11]. It is interesting to note that the crossproduct terms can be eliminated from the model when the attributes are uncorrelated [15].
In order to reduce the influence of the group size and give more importance to each group deviation costs, we propose the replacement of the objective function (2.17) by the following function ( ): with a constant representing the relative importance of the cost associated with misclassification of the first and the second groups.
The classification rule is
2.2.2. The PiecewiseLinear Models [17]
Recently, two piecewiselinear models are developed by Glen: the MCA and MSD piecewise models. These methods suppose the nonlinearity of the discriminant function. This nonlinearity is approximated by piecewiselinear functions. The concept is illustrated in Figure 1.
In Figure 1, the piecewiselinear functions are ACB’ and BCA’, while the component linear functions are represented by the lines AA’ and BB’. Note that for the piecewiselinear function ACB’, respectively, BCA’, the region of correctly classified group 2 (group 1) is convex. However, the region for correctly classified group 1 (group 2) observations is nonconvex. The optimal of the linear discriminant function is obtained when the two precedent cases are considered separately. The MP must be solved twice: once to constrain all of group 1 elements to a convex region and once to constrain all of group 2 elements to a convex region. Only the second case is considered in developing the following MP models.
(a) The PiecewiseLinear MCA Model [17]
The MCA model for generating a piecewiselinear function in s segment is:
subject towhere and with being a small interval, within which the observations are considered as misclassified, and is a positive large number,
if the observation is correctly classified,
(), if the group 1 observation is correctly classified by function on its own.
The correctly classified group 2 observation can be identified by the s constraints of type (2.19b). An observation of group 1 is correctly classified only if it is correctly classified by at least one of the s segments of the piecewiselinear function (constraint (2.19c)).
The classification rule of an observation is
A similar model must also be constructed for the case in which the nonconvex region is associated with group 2 and the convex region is associated with group 1.
(b) The PiecewiseLinear MSD Model [17]:
subject towhere is free, and with being a small interval and being an upper bound on .
is the deviation of group 1 observation from component function of the piecewiselinear function, where if the observation is correctly classified by function on its own and if the observation is misclassified by function on its own.
is the deviation of group 2 observation from component function of the piecewiselinear function, where if the observation is correctly classified by function on its own and if the observation is misclassified by function on its own. A group 2 observation is correctly classified if it is classified by each of the s component functions.
is the lower bound on the deviation of group 2 observation from the segment piecewiselinear discriminant function, where if the observation is correctly classified and if the observation is misclassified.
The binary variable is introduced in the model to determine by detecting the minimal deviation .
The classification rule is the same as that of the piecewiselinear MCA.
The two piecewiselinear MCA and MSD models must be solved twice: once to consider all group 1 observations in convex region and once to consider all group 2 observations in convex region, in order to obtain the best classification. Other models have been developed by Better et al. [18]. These models are more effective for more complex datasets than for the piecewiselinear models and do not require that one of the groups belong to a convex region.
2.2.3. The Multihyperplanes Models [18]
The multihyperplanes models can be interpreted as models identifying many hyperplanes used successively. The objective is to generate tree conditional rules to separate the points. This approach constitutes an innovation in the area of Support Vector Machines (SVMs) in the context of successive perfect separation decision tree. The advantage of this approach is to construct a nonlinear discriminant function without the need for kernel transformation of the data as in SVM. The first model using multihyperplanes is the Successive Perfect Separation decision tree (SPS).
(a) The Successive Perfect Separation Decision Tree (SPS)
The specific structure is developed in the context of SPS decision tree. The decision tree is a tree which results from the application of the SPS procedure. In fact, this procedure permits, at each depth , to compel all the observations of either group 1 or group 2 to lie on one side of the hyperplane. Thus, at each depth the tree has one leaf node that terminates the branch that correctly classifies observations in a given group. In Figure 2, the points represented as circles and triangles must be separate. The PQ, QR, and RS segments of the three hyperplanes separate all the points. We can remark that the circles are correctly classified either by or by and . However, the triangles are correctly classified by the tree if it is correctly classified by H_{1} and H_{2} or by H_{1} and H_{3}. Several tree types are possible. Specific binary variables called “slicing variables” are used to describe the specific structure of the tree. These variables define how the tree is sliced in order to classify an observation correctly.
The specific structure SPS decision tree model is formulated as follows:
subject tonoting that and are free where is large, while is very small constant. Consider the following:
The (2.22c) and (2.22d) constraints represent the type of tree and are activated when . Similarly, the (2.22e) and (2.22f) constraints for tree type will only be activated when . However, for the tree types and corresponding to (2.22g)–(2.22n) constraints, a binary variable is introduced in order to activate or deactivate either of the constraints relevant to these trees. In fact, when , the (2.22g)–(2.22j) constraints for tree type will be activated so that an observation from group 1 will be correctly classified by the tree if it is correctly classified either by the first hyperplane or by both the second and the third hyperplanes. On the other hand, an observation from group 2 is correctly classified by the tree if it is correctly classified either by the hyperplanes 1 and 2 or by the hyperplanes 2 and 3.
This classification is established in the case where which permits to activate constraints (2.22g) and (2.22h). The case that corresponds to tree type is just a “mirror image” of previous case. However, the model becomes difficult to resolve when the number of possible tree types increases (D large). In fact, as D increases, the number of possible tree types increases and so does the number of constraints. For these reasons, Better et al. [18] developed the following model.
(b) The General Structure SPS Model (GSPS)
subject towhere and are free The variables and are not included in the final hyperplane (D). The variable is defined as
The constraints (2.24c) and (2.24d) permit to lie all group 1 or group 2 observations on one side of the hyperplane according to value. In fact, due to constraint (2.24c), if , all group 1 observations and possibly some group 2 observations lie on one side of the hyperplane d. However, only observations of group 2 will lie on the other side of hyperplane d and so these observations can be correctly classified. Conversely, due to constraint (2.24d), if , the observations correctly classified by the tree will be those belonging to group 1. The variables permit to identify the correctly classified and misclassified observations of each group from the permanent value . In fact, in the case where , the permanent values to establish are those of group1 observations such that , because these particular observations are separate in such manner that we do not need to consider them again. Thus, for these last observations, the fact that and forces the to equal 1. If we consider the case to force for group 1 observations, it means that these observations have not yet permanently separated from group 2 observations and one or more hyperplanes are necessary to separate them. Thus, if or (verified by the constraints (2.24e) and (2.24g)).
For the empirical study, the SPS and GSPS model will be resolved using the two following normalization constraints:
The developed models presented previously are based either on piecewiselinear separation or on the multihyperplanes separation. New models based on piecewisenonlinear separation and on multihypersurfaces are proposed in the next section.
3. The Proposed Models
In this section different models are proposed. Some use the piecewisenonlinear separation and the others use the multihypersurfaces.
3.1. The PiecewiseNonlinear Models (Quadratic Separation)
The piecewiselinear MCA and MSD models are based on piecewiselinear functions. To ameliorate the performance of the models, we propose two models based on piecewisenonlinear functions. The base concept of these models is illustrated in Figure 3.
The curves AA’ and BB’ represent the piecewisenonlinear component functions: ACB’ and BCA’. The interpretations are the same as those in Figure 1. However, we can remark that the use of piecewisenonlinear functions permits to minimize the number of misclassified observations. Based on this idea, we suggest proposing models based on piecewisenonlinear functions. In these models we suggest to replace the first constraints of piecewiselinear MCA and MSD models by the linear constraints which are nonlinear in terms of the attributes as follows: where , , are unrestricted in sign, are the linear terms of attribute for the function , are the quadratic terms of attribute for the function , are the crossproduct terms of attributes and for the function .
Note that if the attributes are uncorrelated, the crossproduct terms can be excluded from the models. Other general nonlinear terms can, also, be included in the models. On the other hand, the normalization constraint is replaced by the following constraint: The piecewisequadratic separation models obtained are the following.
3.1.1. The PiecewiseQuadratic Separation MCA Model (QSMCA)
subject to where ,, are unrestricted in sign( and and The classification rule of an observation is
3.1.2. The PiecewiseQuadratic Separation MSD Model (QSMSD)
subject towhere is free, and The interpretation of this model is the same as that of piecewiselinear MSD model. The classification rule is the same as that of QSMCA model.
The construction of piecewise QSMSD and QSMCA models, using the case in which group 1 is in the convex region and the group 2 in the nonconvex region, is also valuable. However, despite the complexity of these models (especially when the datasets are very large), the advantage of piecewise QSMSD and QSMCA models is accelerating the reach of an optimal solution using a reduced number of arcs than segments. But, the disadvantage of the models remains the necessity to resolve twice these models: the case in which group1 is convex and the case in which group 2 is convex. The following quadratic specific structure models could be a way of solving these problems, accelerating the reach of possible solutions to the different models and finding answers to problems of large size.
3.2. The Quadratic Specific Structure Models (QSS)
The quadratic specific structure models are based on the use of nonlinear separation. The following figure illustrates a particular case of the QSS models.
In Figure 4, the points are separated using two curves and . The circle are well classified by or by and the triangles are well classified by and . As for SPS and GSPS, many treespecific structures are possible for QSS models. Based on this idea, the quadratic SPS and the quadratic GSPS models are proposed.
3.2.1. The Quadratic SPS Model (QSPS)
Similar to the piecewise QSMSD and QSMCA models, the first constraints of SPS model are replaced by the linear constraints which are nonlinear in terms of the attributes. The QSPS model is the following: subject towhere and are free
3.2.2. The Quadratic GSPS (QGSPS)
Replacing the first constraints of GSPS model by the linear constraints which are nonlinear in terms of the attributes like those of the QSPS model, we obtain the following QGSPS model: subject to
where and are free
As mentioned above, the crossproducts terms can be excluded from the quadratic models if the attributes are uncorrelated and other types of nonlinear functions are possible.
4. A Comparative Study
4.1. The Datasets
In this study we choose four datasets.
(i)The first dataset (D1) is data presented by Johnson and Wichern [24] used by Glen [11] who were trying to apply new approaches to the problem of variable selections using an LP model. This dataset consists of 46 firms (21 bankrupt firms and 25 nonbankrupt firms). The four variables measured were the following financial ratios: cash flow to total debt, net income to total assets, current assets to current liabilities, and current assets to net sales.(ii)The second dataset (D2) is a Tunisian dataset. The data concerns 62 tumors of breast. Five variables characterize these tumors: four proteins expression scores (EGFR, Her2, Her3, and estrogens) and the size of these tumors in cm. The tumors are divided into two groups according to the SBR grad (grads II and III) which reflects the advancement state of the cancer (source: Centre of Biotechnology of Sfax).(iii)The third dataset is a Japanese dataset (D3). This data contains 100 Japanese banks divided into two groups of 50. Seven financial ratios (return on total assets, equity to total assets, operating costs to profits, return on domestic assets, bad loan ratio, loss ratio on bad loans, and return on equity) characterize this data [25]. (iv)The fourth dataset is the Wisconsin Breast Cancer data (D4). This data consist of 683 patients screened for breast cancer divided into two groups: 444 representing a benign case and 139 representing a malignant tumor. Nine attributes characterize this data (clump thickness, uniformity of cell size, uniformity of cell shape, Marginal Adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses) (ftp://ftp.ics.uci.edu/pub/machinelearningdatabases/breastcancerwisconsin/).The objective is to discriminate between the groups of each dataset using the various methods cited above.
4.2. The Results
Different studies have shown that the reliability of the LDF method depends on the verification of certain hypotheses such as the normality of the data and the equality of the variancecovariance matrices. The results obtained from testing these hypotheses in our datasets are shown in Table 1.

The computer program AMOS 4 is used to verify the normality. To verify the equality of the variancecovariance matrices and to determine the classification rates, the SPSS program is used. According to Table 1, the normality hypothesis is not verified for all datasets, but the equality of variancecovariance matrices is verified for D1 and D3 datasets and not verified for the second and the fourth datasets. The results of the statistical approaches are obtained using SPSS program. The SVMbased approach is solved by the WinSVM package. The experiments are conducted by an Intel (R) Celeron (R) M, processor 1500 MHz in environment. Various MPs were solved by CPLEX 10.0 package. For the experiment, we have chosen and . Microsoft Excel is used to determine the apparent hit rates (proportion of observations classified correctly), the LeaveOneOut (LOO) hit rates, and the holdout sample hit rates which represents the performance measures to be compared between the models. In fact, in order to evaluate the performance of the different approaches, a LeaveOneOut (LOO) procedure is used for the first three datasets. The advantage of this procedure is to overcome the problem of the apparent hit rates bias. The LOO hit rate is calculated by omitting each observation in turn from the training sample and using the remaining observations to generate a discriminant function which is then used to classify the omitted observations. Although the computational efficiency of this procedure can be improved in statistical discriminant analysis, it is not practical in MP analysis unless only a relatively number of observations are included. For this reason, the LOO hit rate was not used for the fourth dataset. The performance of the different MP methods using this dataset (D4) is, then, addressed by considering the “splithalf” technique. In fact, the important number of observations available in this dataset permits to adopt this latter approach by partitioning the complete observations (386) into training and holdout samples. The training sample of dataset (D4) consisted of a random sample of of the observations in each group, with 340 observations in group1 and 160 in group 2 (500 observations in total). The remaining of the observations (104 group1 observations and 79 group 2 observations) formed the holdout sample. To evaluate the performance of the various approaches, the training sample was used to generate classification models using the different methods and these classification models were then used to determine the holdout sample hit rates. The performance of LDF using this dataset (D4) was also evaluated in the same way.
Furthermore, the “splithalf” technique is also employed for the first three datasets in order to evaluate the performance of the SVMbased approach. Similar to the dataset D4, the three datasets D1, D2, and D3 are partitioned into training and holdout samples. The training sample size of the first dataset is equal to 24 observations (11 observations in group 1 and 13 observations in group 2) and its holdout sample size is equal to 22 observations (10 observations in group 1 and 12 observations in group 2). For the second dataset, the training sample contains 45 observations (15 observations in group 1 and 30 observations in group 2). The remaining 17 observations (7 observations in group 1 and 10 observations in group 2) formed the holdout sample. The third dataset is partitioned into 70 observations (35 observations for each group) forming the training sample and 30 observations formed the holdout sample.
In this study, the complete set of observations of each dataset was first used as the training sample giving the apparent hit rate in order to demonstrate the computational feasibility of the different approaches. The use of the “splithalf” and LOO procedures permits to allow the performance of classification models generated by the different methods.
4.2.1. The Results of the Linear Programing Model
The results of the correct classification rates (apparent rates) using MCA and MSD methods with various normalization constraints are presented in Table 2.
 
The values between parentheses are the numbers of misclassified observations. 
According to this table, the MCA model performs better than the MSD model for the different normalization constraints used. However, the best classification rates for the MSD model are given by using the constraints (N3) and (N4), except in the case of D1 and D3, where the difference between the three normalization constraints (N2), (N3), and (N4) is not significant. The classification rates of dataset D2 using these constraints are different. This is may be due to the nature of the data, to the fact that the group size is different, or to the fact that the model with (N2) normalization will generate a discriminant function in which the constant term is properly zero, but it will also exclude solutions in which the variable coefficients sum to zero, and rather should be solved with positive and negative normalization constants [4, 11]. However, the performance of the MCA model remains unchanged using the different normalization constraints. For each of the two methods using the normalization constraint (N4), the LOO hit rates for the three datasets D1, D2, and D3 and the holdout sample hit rates for the dataset D4 are presented in Table 3.

From Table 3, we can conclude that the difference between MSD and MCA models is not significant for the first and the third datasets. However, the MCA model performs better than the MSD model for the second and the fourth datasets. Furthermore, the computational time of the MCA models is less than the computational time of the MSD model especially for the fourth dataset. In fact, by using the complete set of observations, the MSD model was solved in less than 7 seconds while the MCA model required less than 4 seconds to obtain the estimated coefficients of the discriminant function. However, for the other datasets, the difference of the solution time between the two models is not significant (less than 2 seconds).
On the other hand, to solve the RS models, two cases are proposed: first and take, respectively, the value 0 and 1 (Case 1), and second, the cutoff values and are considered decision variables (Case 2). The RS model, for the complete set of observations of the dataset D4, was solved in 3 seconds. The computational time of this model using the other datasets is less than 2 seconds. The apparent and the LOO hit rates for the disriminant function generated by the RS models are shown in Table 4.
 
The values in parentheses are the numbers of misclassified observations. 
The difference between the apparent and LOO hit rates of the RS model in the two cases is well improved particularly for the second and the fourth datasets. For D1 and D2, the difference between the numbers of misclassified observations in the two cases is marginally significant; only one or two misclassified observations are found. However, for D2 and D4, there is a difference. So, when normality and/or equality of the variancecovariance matrices are not verified, it would be most appropriate to consider the cutoff values decision variables. The results of the three combined models are given in Table 5.

The MSD weighting model (WMSD) and the RS weighting model (WRS) are used in the second stage to solve the MC1 and MC2 models. The results show that the choice of model used in the second stage affects the correctly classified rates. These rates are higher when one uses a WRS model in the second stage. The difference between the models is not very significant for the first and the third datasets when equality of variancecovariance matrices is verified. However, for dataset D2, the MC1 model which combines the LDF, LPM, and RS models performs better than the MC2 model which combines the LDF, RS, and MSD models. In fact, LPM model used in MC1 model has the advantage to force the observations classification scores to cluster around the mean scores of their own groups. The application of the MC1 and MC2 models required a computational effort. In fact, to determine the classification rate, the combined method required to solve each model used in this approach separately. Then, the computational time important is more than 10 seconds for dataset D4, for example. For this reason, the use of such method can not be benefit if the dataset is sufficiently large.
The results of the various models for the four datasets are presented in Table 6.
 
The values in parentheses are the numbers of misclassified observations. 
Table 6 shows that the correctly classified rates (apparent hit rates) obtained by MCA, RS, and MIPEDEDA are superior to those obtained by the other models especially when the normality and equality of variancecovariance matrices hypotheses are violated. The two combined methods, MC1 and MC2, give similar results for the first dataset. While for the other datasets, the MC1 performs better than MC2. It must be noted that the performance of the combined method can be affected by the choice of the procedures used in this method. Furthermore, the difference between these models is significant especially for dataset D2. In terms of the computational time, we can remark that the resolution of the statistical methods LDF and LG using the complete dataset is less than one second which is faster than the resolution of the other MP models.
On the other hand, it is important to note that the correct classification rate of the RS model may be changed by selecting the most appropriate cutoff value for c. This cutoff value can be obtained by solving an LP problem in the second stage using a variety of objective functions such as MIP or MSD, instead of simply using the cutoff value equal to [19]. In fact, for the third dataset D3, the apparent hit rate found by Glen [14] using the RS model is equal to which is marginally below the apparent hit rate of found in our study. Effectively, Glen [14] used the 0 and 1 cutoff value in the first stage and the MSD in the second stage of the RS model. Then, we can conclude that RS model can be most performing if the cutoff values are chosen as decision variables and simply using the cutoff value equal to in the second stage. Consequently, we do not need to use any binary variables like the case in which MSD or MIP models are applied in the second stage of the RS model. This result is interesting in the sense that the resolution of such model is very easy and does not require much computational time (in general less than 3 seconds). In fact, Glen [14] mentioned that the computational time for the RS model using MIP model in the second stage excluding the process for identifying the misclassified observations of G1 and G2 was lower than the computational time for the MCA model. Indeed, this latter model involves more binary variables than those of the first model (RS). In addition, for the datasets D1 and D3, we remark that the RS, MCA, and MIPEDEADA models produce the same apparent hit rates. However, for the second dataset, the MIPEDEADA followed by the MCA model performs better than the other approaches. On the other hand, the result obtained by the LOO procedure shows that the MIPEDEADA model performs better than the other models for the second and the third datasets, while for the first dataset, the difference between the models is not significant. In terms of the holdout sample hit rate obtained using the classification models generated from the training sample of dataset D4, the statistical method LDF performs better than the other approaches followed by the LG, the RS, and the MIEDEADA models.
4.2.2. The Result of Nonlinear MP Models
(a) Comparison of SPS and GSPS Models Using the Two Normalization Constraints N’1 and N’2
To compare between the two normalization constraints N’1 and N’2, the model SPS and GSPS were solved using the first three datasets (D1, D2, and D3). The results are presented in Table 7.
According to Table 7, the models using the second normalization constraint can perform better than the one using the first normalization constraint. An important result found concerns the SPS models which can not give any solution for the second dataset, while the QSPS models perform very well and especially the one using the second normalization constraint. Furthermore, the GSPSN’2 model performs better than the GSPSN’1 model especially for the first dataset. Thus, compared to the normalization (N’1) used by Better et al. [18], our proposed normalization (N’2) can produce better results. The comparison of the different models developed will be discussed in the following section.
 
The values in parentheses are the numbers of misclassified observations. 
(b) Comparison of Different Models
The results of the different models are presented in Table 8. From Table 8, the nonlinear MP models outperform the classical approaches. This result may be due to the fact that the performance of these latter approaches requires the verification of some standard hypotheses. In fact, the LDF and QDF have the best performance if the data distribution is normal. However, this hypothesis is not verified for these datasets. Although the LG model does not need the verification of such restriction, this model has not provided higher hit rates compared to those of the other approaches especially for the second dataset. On the other hand, the secondorder MSD model, also, performs worse than the other models. Furthermore, the performance of the piecewise QSMCA and QGSPSN’2 models is better than the performance of the piecewiselinear models (MCA and MSD) for the first and second datasets. In fact, the optimal solution is rapidly reached using these models rather than the piecewiselinear approaches (the hit rates are equal on using and ). While, for the second data D2, the piecewisequadratic models (QSMCA and QSMSD), the multihyperplanes and the multihypersurfaces models perform better than the other approaches. Moreover, the difference between these models and the standard piecewise models is not significant for dataset D3 but we can remark that the piecewise QSMCA and QSMSD can reach optimality rapidly using only . Comparing the nonlinear MP models in terms of computational time, we can remark that the resolution of the QGSPS providing the estimated coefficients of the discriminant function is better than the solution time obtained by GSPS model for all datasets. Using dataset D4, for example, the solution time of the QGSPS with is equal to 11 seconds (for , the solution time is equal to 21 seconds) while the resolution of the GSPS takes more than 960 seconds. For the other datasets, the solution time of the QGSPS model is less than 3 seconds. On the other hand, employing piecewise models using only the case where the group1 is in the convex region, the optimal solution time is obtained in more than 7 seconds. Otherwise, the time for the resolution of these models would have been approximately double. In fact, using dataset D4 the resolution of piecewise QMCA in the case where G1 is in the convex region, for example, required 8 seconds using three arcs (). However, to obtain the optimal solution, the model must be solved also in the case where G2 is in the convex region and then the computational time will double.
 
The values in parentheses are the numbers of misclassified observations. 
To judge the performance of piecewiselinear MSD and MCA, piecewise QSMSD, QGSPSN’2, GSPSN’2, and QSPSN’2, the LOO (LeaveOneOut) procedure is first used for the dataset D2 which is considered the dataset in which the nonlinear separation is the most adequate. The LOO classification rate was determined by omitting each observation in turn, solving the piecewise models in convex region for group 2. In fact, the optimal solution is attained in this group when convex region and the associated piecewise function are then used to classify the omitted observation. This same procedure (LOO) is applied for the multihypersurfaces and mulihyperplanes models but without solving twice these models. The classification rates obtained by LOO procedure with , and , are presented in Table 9.
