Abstract

We propose a robust probability classifier model to address classification problems with data uncertainty. A class-conditional probability distributional set is constructed based on the modified -distance. Based on a “linear combination assumption” for the posterior class-conditional probabilities, we consider a classification criterion using the weighted sum of the posterior probabilities. An optimal robust minimax classifier is defined as the one with the minimal worst-case absolute error loss function value over all possible distributions belonging to the constructed distributional set. Based on the conic duality theorem, we show that the resulted optimization problem can be reformulated into a second order cone programming problem which can be efficiently solved by interior algorithms. The robustness of the proposed model can avoid the “overlearning” phenomenon on training sets and thus keep a comparable accuracy on test sets. Numerical experiments validate the effectiveness of the proposed model and further show that it also provides promising results on multiple classification problems.

1. Introduction

Statistics classification has been extensively studied in the field of machine learning and statistics. A typical classification problem is to design a linear or nonlinear classifier based on a known training set such that a new observation can be assigned to one of the known classes. Many classification models have been proposed, such as the naive Bayes classifiers (NBC) [1, 2], artificial neural network [3], and support vector machines (SVM) [4].

In real-world classification problems, it is often the case that the data of training set are imprecise due to unavoidable observational noises in the process of data collection or data approximation from incomplete samples. One way to handle the data uncertainty is to design a robust classifier in the sense that it has the minimal worst-case misclassification probability for the training sets. The idea of robustness has been widely applied in many traditional machine learning and statistics techniques, such as robust Bayes classifiers [5], robust support vector machines [6], and robust quadratic regressions [7]. Robust classifiers are highly related to the recently flourished research on robust optimization. For more recent developments on robust optimization, we refer the readers to the excellent book [8] and reviews [9, 10].

Recently [11, 12] have proposed a robust minimax approach called the minimax probability machine to design a binary classifier. Unlike the traditional methods, they make no assumption on the class-conditional distributions, but only the mean and covariance matrix of each class are assumed to be known. Under this assumption, the designed classifier is determined by minimizing the worst-case probability of misclassification under all possible choices of class-conditional distributions with the given mean and covariance matrix. By reformulating the classifier design problem into second order cone programming, they show that the computational complexity of the proposed approach is similar to that of SVM. Because of its computational advantage and competitive performance with other current methods, this approach has been further extended to incorporating other features. El Ghaoui et al. [13] propose a robust classification model by minimizing the worst-case value of a given loss function over all possible choices of the data in a bounded hyperrectangles. Three loss functions from SVM, logistic regressions, and minimax probability machines are studied in [13]. Based on the same assumption of known mean and covariance matrix, [14, 15] propose the biased minimax probability machines to address the biased classification problem and further generalize it to obtain the minimum error minimax probability machines. Hoi and Lyu [16] study a quadratic classifier with positive definite covariance matrices and further consider the problem of finding a convex set to cover known sampled data in one class while minimizing the worst-case misclassification probability. The minimax probability machines have also been extended to solve multiple classification problems; see [17, 18].

In this paper, we propose a robust probability classifier (RPC) based on the modified -distance. Specifically, for a given training set, we first estimate the probability of each sample belonging to each class based on a feature, which is called a nominal class-conditional distribution. Then a -confidence probability distributional set is constructed based on the nominal class-conditional distributions and the modified -distance, where parameter controls the size of the constructed set. Unlike the “conditional independence assumption” in NBC, we introduce a “linear combination assumption” for the posterior class-conditional probabilities, and the proposed classifier takes a linear combination form of these probabilities based on different features and it will assign the sample to the class with the maximal posterior probability. To get a robust classifier, we minimize the worst-case loss function value over all possible choices of class-conditional distributions over the distributional set . The underlying assumption is that, due to observational noises, we cannot obtain the true probability distribution of each class, but it can be well estimated by the nominal distribution such that it belongs to the distributional set .

Our two major contributions are as follows. First in our model, the proposed distributional set is based on the nominal distribution and the modified -distance. As pointed in [19], such distributional set can make use of more information conveyed in the training set compared with traditional robust approaches which only use the information of mean and covariance matrix. To the best of our knowledge, this is among the first study of classification models considering complex distribution information. Although [20] considers a -contaminated robust support vector machine model, its distributional set is defined by easily handled linear constraints and its analysis is highly dependent on characterization of the extreme points of this set. Here our proposed distributional set is defined by nonlinear quadratic function and is analyzed by the conic duality theorem. Second by taking the absolute error function as the loss function, we show how to transform our robust minimax optimization problem into computable second order cone programming. The absolute error function in the objective function also distinguishes our model from other existing models, such as the soft-margin support vector machine which uses the Hinge loss function [21, 22] and the logistic regression which uses the negative log likelihood function [23]. Note that the absolute error function is essential in our model to obtain a tractable optimization problem for the proposed model. Numerical experiments on real-world application validate the effectiveness of the proposed classifier and further show that the proposed classifier also performs well for multiple classification problems.

The paper proceeds as follows. Section 2 introduces the proposed robust minimax probability classifier based on the modified -distance and discusses how to construct the desired distributional set . Section 3 provides an equivalent reformulation by handling the robust constraints and robust objective separately. Numerical experiments on real-world data set are carried out to validate the effectiveness of the proposed classifier in Section 4. Section 5 concludes this paper and gives future research directions.

2. Classifier Models

In this section, a simple probability classifier is first presented and then it is extended to handle data uncertainty by introducing a distributional set . We also discuss how to construct this distributional set based on training data set.

Consider a multiclass multifeature classification problem in which each sample contains features, and there are classes and samples. Specifically, given a training set , where denotes the th feature of the th sample and if the th sample belongs to th class; otherwise, . In the following context, we will also use the term to denote the th sample; that is, .

2.1. Probability Classifier

Bayes classifiers assign an observation to the th class which has the maximal posterior probability; that is, and is the posterior probability function, that is, the conditional probability that the sample belongs to the th class, given that we know it has feature vector .

Using Bayes’ theorem, we have where is the prior probability of the th class, is the conditional probability for the th class, and is the probability that a sample has feature vector . Note that is a constant if the values of the feature variables are known and thus can be omitted. To design an effective Bayes classifier, the key issue is estimating the class-conditional probability or the joint probability . Theoretically, using the chain rule, we have However such estimating method leads to the problem of “dimension disaster.”

To address this issue, the naive Bayes classifier makes the following “conditional independence assumption”: where is the class-conditional probability that the observation belongs to the th class based on the th feature. Here we introduce another “linear combination assumption” for the class-conditional probability: where is a coefficient. Compared with the “conditional independence assumption,” which uses the probabilistic information in terms of multiplication, the proposed “linear combination assumption” uses the probabilistic information in terms of weighted sum. We will further discuss the rationality of this assumption at the end of this subsection.

Under this assumption, we have where denotes the probability weight of the th feature for the th class.

To obtain the optimal probability classifier based on the “linear combination assumption,” it is natural to consider the following optimization problem: where is a prespecified loss function. In the following context, we will take the absolute error function as our loss function; that is, . In view of its probability property, it is straightforward to impose the following constraints on the posterior probability: Under such constraints, we have that where .

Thus the optimal probability classifier (PC) problem can be formulated as follows:

It is no doubt that the “linear combination assumption” may not work sometimes. However, we justify the proposed classifier by the following facts.(1)As an intuitive interpretation, note that estimates the probability of the observation belonging to the th class only based on the th feature; thus it provides partial probabilistic information of the sample. Hence we can interpret the weight as certain degree of trust on the information, and in this sense, the “linear combination assumption” is a way of combining evidence from different sources. Similar ideas can also be found in the theory of evidence; see the Dempster-Shafer theory [24, 25].(2)In terms of the classification performance, in the worst case, the proposed classifier may put all weight on one feature; thus in such case, it is equivalent to a Bayes classifier based on a well-selected feature. If each class has its “typical” feature which can distinguish it from other classes, the proposed classifier has the ability to learn this property by putting different weights on different features for different classes and thus provides better classification performance. A real-life application on lithology classification problems also validates its classification performance by comparison with support vector machines and the naive Bayes classifier.(3)Another advantage of the proposed classifier is its high computability. As we show in Section 3, the proposed classifier and its robust counterpart problems can be reformulated as second order cone programming problems and thus can be solved by interior algorithms in polynomial time.

2.2. Robust Probability Classifier

Due to observational noises, the true class-conditional probability distribution is often difficult to obtain. Instead we can construct a confidence distributional set which contains the true distribution. Unlike the traditional distributional sets in minimax probability machines, which only utilize mean and covariance matrix, we construct our class-conditional probability distributional set based on the modified -distance which uses more information from the samples.

The modified -distance is used to measure the distance between two discrete probability distribution vectors in statistics. For given and , it is defined as Based on the modified -distance, we present the following class-conditional probability distributional set: where is the nominal class-conditional distribution probability for the th sample belonging to the th class based on the th feature and the prespecified parameter is used to control the size of the set.

To design a robust classifier, we need to consider the effect of data uncertainty on the objective function and constraints. The robust objective function is to minimize the worst-case loss function value over all the possible distributions in the distributional set ; the robust constraints ensure that all the original constraints should also be satisfied for any distribution in . Thus the robust probability classifier problem is of the following form:

Note that the above optimization problem has an infinite number of robust constraints and its objective function is also an embedded subproblem. We will show how to solve such minimax optimization problem in Section 3.

2.3. Construct the Distributional Set

To get the distributional set , we need to define the parameter and the nominal probability . The selection of parameter is application based and we will discuss this issue in the numerical experiment section; next we will provide a procedure to calculate .

For the th feature, the following procedure takes an integer indicating the number of data intervals as an input and will output the estimated probability of the th sample belonging to the th class.(1)Sort samples in the increased order and divide them into intervals such that each interval has at least number of samples. Denote the th interval by .(2)Calculate the total number of samples in the -class, , the total number of samples in the th interval, , and the total number of samples belonging to the -class in the th interval, .(3)For the th sample, if it falls into the th interval, the class-conditional probability is calculated by Note that from the definition of , we easily compute the upper bound and lower bound for the true class-conditional probability as follows: The above problems can be efficiently solved by a second order cone solver such as SeDuMi [26] or SDPT3 [27].

3. Solution Methods for RPC

In this section, we first reduce the infinite number of robust constraints to a finite set of linear constraints and then transform the inner robust objective function into a minimization problem by the conic duality theorem. At last, we obtain an equivalent computable second order cone programming for the RPC problem. The following analysis is based on the strong duality result in [8].

Consider a conic program of the following form: and its dual problem where is a cone in and is its dual cone defined by A conic program is called strictly feasible if it admits a feasible solution such that ,  , where denotes the interior point set of .

Lemma 1 (see [8]). If one of the problems (CP) and (DP) is strictly feasible and bounded, then the other problem is solvable, and (CP) = (DP) in the sense that both have the same optimal objective function value.

3.1. Robust Constraints

The following lemma provides an equivalent characterization for the infinite number of robust constraints in terms of a finite set of linear constraints which can be solved efficiently.

Lemma 2. For given , the robust constraint is equal to the following constraints:

Proof. First note that the distributional set can be represented as the Cartesian product of a series of projected subsets where the projected subset on index is defined by
Then for given , since the robust constraint is only associated with variables , we can further split the projected subset into subsets where and are computed by (15) and (16), respectively.
For constraint , it is equal to the following constraint: where the last equivalence comes from the strong duality between these two linear programs.
For the constraint , the same technique applies; thus we complete the proof.

3.2. Robust Objective Function

In the RPC problem, the robust objective function is defined by an inner maximization problem. The following proposition shows that it can be transformed into a minimization problem over second order cones. To prove the following result, we utilize the concept of conjugate function of the modified -distance: where the function   is defined as if ; otherwise . For more details about conjugate functions, see [28].

Proposition 3. The following inner maximization problem is equivalent to a second order cone programming where a second order cone is defined as

Proof. For given feasible satisfying the robust constraints, it is straightforward to show that the inner maximum problem is equal to the following minimization problem (MP): The above constraint can be further reduced to the following constraint:
By assigning Lagrange multipliers and to the constraints in the left optimization problem, we obtain the following Lagrange function: where . Its dual function is given as Note that, for any feasible , the primal maximization problem (31) is bounded and has a strictly feasible solution ; thus there is no duality gap between (31) and the following dual problem: Next we show that the constraint about the conjugate function can be represented by second order cone constraints: By reinjecting the above constraints into (MP), the robust objective function is equivalent to the following problem: By eliminating variable , we complete the proof.

Based on the Lemma 2 and Proposition 3, we obtain our main result.

Proposition 4. The RPC problem can be solved as the following second order cone programming:

4. Numerical Experiments on Real-World Applications

In this section, numerical experiments on real-world applications are carried out to verify the effectiveness of the proposed robust probability classifier model. Specifically we consider lithology classification data sets from our practical application. We compare our model with the regularized SVM (RSVM) and the naive Bayes classifier (NBC) on both binary and multiple classification problems.

All the numerical experiments are implemented in Matlab 7.7.0 and run on Intel(R) Core(TM) i5-4570 CPU. SDPT3 solver [27] is called to solve the second order cone programs in our proposed method and the regularized SVM.

4.1. Data Sets

Lithology classification is one of the basic tasks for geological investigation. To discriminate the lithology of the underground strata, various electromagnetic techniques are applied to the same strata to obtain different features, such as Gamma coefficients, acoustic wave, striation, density, and fusibility.

Here numerical experiments are carried out on a series of data sets: the borehole T1, Y4, Y5, and Y6. All boreholes are located in Tarim Basin, China. In total, there are 12 data sets used for binary classification problems and 8 data sets used for multiple classification problems. For each data set, based on a prespecified training rate , it is randomly partitioned into two subsets: a training set and a test set, such that the size of training set accounts for of the total number of samples.

4.2. Experiment Design

The parameters in our models are chosen based on the size of data set. The parameter depends on the number of the classes and defined as , where . The choice of can be explained in this way: if there are classes and the training data are uniformly distributed, then, for each probability , its maximal variation range is between and . The number of data intervals is defined as such that if the training data are uniformly distributed, then in each data interval there are samples in each class. In the following context, we set and .

We compare the performances of the proposed RPC model with the following regularized support vector machine model [6] (take the th class for example): where and is a regularization parameter. As pointed by [8], represents a trade-off between the number of training set errors and the amount of robustness with respect to spherical perturbations of the data points. To make a fair comparison, in the following experiments we will test a series of values and choose the one with best performance. Note that if , we refer to this model as the classic support vector machine (SVM). See also [6] for more details on RSVM and its applications to multiple classification problems.

4.3. Test on Binary Classification

In this subsection, RSVM, NBC, and RPC are implemented on 12 data sets for the binary classification problems using the cross-validation methods. To improve the performances of RSVM, we transform the original data by the popularly used polynomial kernels [6].

Tables 1 and 2 show the averaged classification performances of RSVM, NBC, and the proposed RPC (over 10 randomly generated instances) for binary classification problems on Y5 and T1 data sets, respectively. For each data set, we randomly partition it into a training set and a test set based on the parameter tr which varies from 0.5 to 0.9. The highest classification accuracy on a training set among these three methods is highlighted in bold while the best classification accuracy on a test set is marked with an asterisk

Tables 1 and 2 validate the effectiveness of the proposed RPC for binary classification problems compared with NBC and RSVM. Specifically, for most of the cases, RSVM has the highest classification accuracy on training sets but its performance on test sets is unsatisfactory. For most of the cases, the proposed RPC provides the highest classification accuracy on test sets. NBC provides better performances on test sets as the training rate increases. The experimental results also show that for given training rate, PRC can provide better performances on test sets than that on training sets; thus it can avoid the “overlearning” phenomenon.

To further validate the effectiveness of the proposed RPC, we test it on additional 10 data sets, that is, T41–T45 and T61–T65. Table 3 reports the averaged performances of three methods over 10 randomly generated instances when the training rate is set to 70%. Except for data sets T45, T63, and T64, RPC provides the highest accuracy on the test sets, and, for all the data sets, its accuracy is higher than 80%. As shown in Tables 1 and 2, the robustness of the proposed RPC guarantees its scalability on the test sets.

4.4. Test on Multiple Classification

In this subsection, we test the performances of on multiple classification problems by comparison with RSVM and NBC. Since the performance of RSVM is determined by its regularization parameter , we run a set of RSVM with varying from to a big enough number and select the one with the best performance on test sets.

Figures 1 and 3 plot the performances of three methods on Y5 and T1 training sets, respectively. Unlike the case of binary classification problems, we can see that RPC provides a competitive performance even on the training sets. One explanation is that RSVM can outperform the proposed RPC on training sets by finding the optimal separation hyperplane for binary classification problem S while RPC is more robust to extend to solve multiple classification problems since it uses the nonlinear probability information of the data sets. The accuracy of NBC on the training sets also improves as the training rate increases.

Figures 2 and 4 show the performances of both methods on Y5 and T1 test sets, respectively. We can see that, for most of the cases, RPC provides the highest accuracy among three methods. The accuracy of RSVM outperforms that of NBC on Y5 test set while the latter outperforms the former on the T1 test set.

To further test the performance of PRC on multiple classification problems, we carry out more experiments on data sets M1–M6. Table 4 reports the averaged performances of three methods on these data sets when the training rate is set to 70%. Except for the M5 data set, PRC always provides the highest classification performances among three methods, and even for the M5 data set, its accuracy (88.0%) is very close to the best one (88.1%).

From the tested real-life application, we conclude that the proposed RPC has the robustness to provide better performance for both binary and multiple classification problems compared with RSVM and NBC. The robustness of PRC enables it to avoid the “overlearning” phenomenon, especially for the binary classification problems.

5. Conclusion

In this paper, we propose a robust probability classifier model to address the data uncertainty in classification problems. To quantitatively describe the data uncertainty, a class-conditional distributional set is constructed based on the modified -distance. We assume that the true distribution lies in the constructed distributional set centered in the nominal probability distribution. Based on the “linear combination assumption” for the posterior class-conditional probabilities, we consider a classification criterion using the weighted sum of the posterior probabilities. The optimal robust probability classifier is determined by minimizing the worst-case absolute error value over all the possible distributions belonging to the distributional set.

Our proposed model introduces the recently developed distributionally robust optimization method into the classifier design problems. To obtain a computable model, we transform the resulted optimization problem into an equivalent second order cone programming based on conic duality theorem. Thus our model has the same computational complexity as the classic support vector machine and numerical experiments on real-life application validate its effectiveness. On the one hand, the proposed robust probability classifier provides a higher accuracy compared with RSVM and NBC by avoiding overlearning on training sets for binary classification problems; on the other hand, it also has a promising performance for multiple classification problems.

There are still many important extensions in our model. Other forms of loss function, such as the mean squared error function and Hinge loss functions, should be studied to obtain tractable reformulations and the resulted models may provide better performances. Probability models considering joint probability distribution information are also interesting research directions.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.