Advances in Artificial Intelligence

Volume 2013, Article ID 176890, 12 pages

http://dx.doi.org/10.1155/2013/176890

## Imprecise Imputation as a Tool for Solving Classification Problems with Mean Values of Unobserved Features

Department of Control, Automation and System Analysis, St. Petersburg State Forest Technical University, Institutski per. 5, St. Petersburg 194021, Russia

Received 11 October 2012; Revised 9 February 2013; Accepted 10 March 2013

Academic Editor: Wolfgang Faber

Copyright © 2013 Lev V. Utkin and Yulia A. Zhuk. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

A method for solving a classification problem when there is only partial information about some features is proposed. This partial information comprises the mean values of features for every class and the bounds of the features. In order to maximally exploit the available information, a set of probability distributions is constructed such that two distributions are selected from the set which define the minimax and minimin strategies. Random values of features are generated in accordance with the selected distributions by using the Monte Carlo technique. As a result, the classification problem is reduced to the standard model which is solved by means of the support vector machine. Numerical examples illustrate the proposed method.

#### 1. Introduction

There are several major data mining techniques including classification, clustering, and novelty detection. We consider classification as a data mining technique used to predict an unobserved output value based on an observed input vector . This requires us to estimate a predictor from training data or a set of example pairs of . A special very important problem of the statistical machine learning is the binary classification problem which can be regarded as a task of classifying some objects into two classes (groups) in accordance with their properties or features. In other words, we have to classify each pattern into one of the classes by means of a discriminant function .

A common assumption in supervised learning is that training and predicted data are drawn from the same (unknown) probability distribution; that is, training and predicted data come from the same statistical model. As a result, most machine learning algorithms and methods exploit this assumption which, unfortunately, does not often hold in practice. This may lead to a performance deterioration in the induced classifiers [1, 2]. This problem may arise if we have imbalanced data [3] or in case of rare events or observations [4]. The assumption does not hold also in case of partially known or observed features. For instance, it may take place when we know only some mean values of the features but cannot get their actual values during training.

One of the approaches to handle the above problem and to cope with the imbalance and possible inconsistencies of training and predicted data is the minimax strategy for which the classification parameters are determined by minimizing the maximum possible risk of misclassification [1, 2]. This is an “extreme” strategy of decision making. As pointed out in [1], the minimax classifiers may be seen as overconservative since its goal is to optimize the performance under the least favorable conditions. Therefore, it is interesting to simultaneously study the so-called minimin or optimistic strategy for which the classification parameters are determined by minimizing the minimum possible risk of misclassification. This is another “extreme” strategy.

By taking into account the above, we propose a classification model using the minimax and minimin strategies for situations when a part of features are observed and there are precise values of the features corresponding to different classified classes, but our initial information about other parts of features is restricted by mean values of the features for every class. In other words, we know only mean values (expectations) of some features and do not have any observations. This is a very restrictive piece of information which should be exploited. The features with this information will be called unobserved for simplicity. A typical example of the above situation is a mode of production of reinforced concrete beams whose quality and strength depend on a number of parameters such as the weight of reinforcement bars and concrete materials. If we have not observed or measured some of the parameters before, it is difficult to reject new beams or to classify them into two classes: defective (rejected) or of high quality, because we do not have the learning set of beams with the measured parameters. However, if we know, for instance, how much steel has been used up by manufacturing beams, then we are able to evaluate the average weight of steel in a beam. The information can be elicited, for instance, from experts. Often, it is easy for experts to provide judgments about some average values of a feature for every class because this information is the most simple and understandable.

One of the simplest ways to solve the classification problem with the partial information is to assume that the mean values are observed values. In fact, we replace in this case an unknown probability distribution of data of a feature by the deterministic variable which takes one value corresponding to the mean value of this feature. Of course, we accept here a very strong assumption which may lead to a significant performance deterioration especially if the underlying probability distribution is not symmetric. Another way is to find the mean values of every observed feature and use the simplest classification algorithm considered by many authors, for instance, by [5]. However, we lose some useful information in this case, which can be inferred from the observations.

In order to maximally exploit the available information about features, we propose another approach whose underlying ideas can be formulated with a combination of multiple imputation [6] and imprecise models of features.

As indicated in [7], imputation is a class of methods by which an estimation of the missing value or its distribution is used to generate predictions from a given model. In particular, either a missing value is replaced with an estimation of the value or alternatively the distribution of possible missing values is estimated, and corresponding model predictions are combined probabilistically. Various imputation treatments for missing values in training data are available that may be deployed at prediction time [8–13]. However, some treatments such as multiple imputation [6] are particularly suitable to induction. In particular, multiple imputation (or repeated imputation) is a Monte Carlo approach that generates multiple simulated versions of a dataset such that each is analyzed, and the results are combined to generate inference.

We do not know the probability distributions of data for unobserved features. However, the mean values of features and their boundary values produce a set of probability distributions bounded by some lower and upper cumulative distribution functions (CDFs). This way leads to constructing the so-called p-boxes [14, 15] from data. It should be noted that the considered set of distributions is not the set of parametric distributions having the same parametric form as the bounding distributions, but it is the set of all possible distributions restricted by the lower and upper bounds. This is an important feature of the proposed approach in this paper. A probability distribution is selected from the p-box in order to make a pessimistic decision, which maximizes the risk function as a measure of the classification error. In other words, the well-known minimax strategy is applied for solving the classification problem, which appears as an insurance against the worst case [16]. Another probability distribution is selected from the p-box in order to make an optimistic or minimin decision. The similar idea applied to regression models has been considered in [17–19]. So, the first idea is to consider the lower and upper probability distributions of feature data produced by the corresponding mean values and bounds for the feature values.

It should be noted that the obtained bounding probability distributions do not belong to standard types of probability distributions and their convolution for combining features and for computing parameters of the discriminant function is an extremely hard problem. Therefore, in order to cope with this problem, the second idea is proposed. We can apply the Monte Carlo technique for generating random values of features, which are governed by the probability distributions selected from the p-boxes in accordance with the minimax and minimin strategies [20, 21]. In a nutshell, for every example of the training set, we generate a (large) number of random values for unobserved features. It is a multiple imputation technique which has been applied to classification problems [7, 22, 23]. But the main distinction of the proposed approach from the available ones is that it is based on some partial information about unobserved features and uses the p-boxes for generating random values of features.

After carrying out this procedure, the classification problem can be solved by means of standard methods, for instance, by means of the support vector machine (SVM). The Monte Carlo technique has also been applied to general classification problems [24, 25]. It has been successfully applied to reliability analysis problems in the framework of classification models [26, 27]. Of course, the Monte Carlo technique requires additional computation efforts. However, its main advantage is its simplicity. Moreover, we get the standard classification problem solved by standard available software tools.

We have to stress that there is no sense in applying the proposed model when we have some missing values among the observed values of a feature. The model has to be used when we do not have observations for some features at all and only mean values of the features and their bounds are known.

The paper is organized as follows. A statement of the well-known standard classification problem is given in Section 2. This statement is extended on the case of a set of probability distributions of training data in Section 3. In this section, two strategies, minimax and minimin, are formally introduced. The classification problem with mean values for a part of unobserved features is considered in Section 4. A general method for constructing the classification model by partial information about some features with using the set of probability distributions is described in the same section. A question of the training data generation for realizing the Monte Carlo simulation is solved in Section 5. A way for reducing the classification problem with partial information about features to the standard problem and its solution by means of the SVM method is given in Section 6. Numerical examples with synthetic data and with the real datasets, including Iris, Pima Indian Diabetes, Mammographic Masses, Parkinsons, Indian Liver Patient, Breast Cancer Wisconsin (Original), Breast Cancer Wisconsin (Diagnostic), Musk, and Lung-Cancer datasets from UCI machine learning repository [28], are provided in Section 7.

#### 2. The Standard Classification Problem

The binary-classification problem can be formulated as follows. There are predictor-response data with a binary response representing the observation of classes and . The binary-classification problem is to estimate a region in predictor space in which class 1 is observed with the greatest possible majority. Suppose we are given empirical data

Here, is some nonempty set of the patterns or examples; are labels or outputs taking the values and ; is the number of features. It is supposed that the number of elements in the training set belonging to the class is and their indices form the set of indices ; that is, we can write and .

Classification problem is usually characterized by an unknown CDF on defined by the training set or examples and their corresponding class labels .

The main problem is to find a decision function , which predicts accurately the class label of any example that may or may not belong to the training set. In other words, we seek a function that minimizes the classification error, which is given by the probability that . One of the possible approaches for solving the problem is the discriminant function approach which uses a real valued function called the discriminant function whose sign determines the class label prediction: . The discriminant function may be parametrized with some parameters , , that are determined from the training examples by means of a learning algorithm. In particular, the function may be linear; that is, . Introduce also the notation for the th element of the vector .

Given the training data, the linear discriminant training problem is to minimize the following risk measure [29]: Here, the loss function usually takes a nonzero value when the sign of the discriminant function (the class label prediction) does not coincide with the class label . The minimization of the risk measure is carried out over the parametric class of functions . In other words, the function provides the minimum of such that .

#### 3. The Classification Problem under a Set of Probability Distributions

Let us represent the joint probability as . Here, is the prior probability that an example belongs to the class . Then, we can rewrite the risk measure taking into account two values of Here,

By assuming that features are independent, we can rewrite the above risk measures as

Suppose that the distributions are unknown. However, we assume that some lower and upper bounds for a set of the CDFs are known to be accurate to and they are and , respectively. We can write

In other words, there is an unknown precise “true” CDF for every and every , but we do not know it and only know that it belongs to the set . It has been mentioned that the set is not the set of parametric distributions having the same parametric form as the bounding distributions, but it is the set of all possible distributions restricted by the lower and upper bounds.

##### 3.1. The Minimax Strategy

One of the possible strategies to derive an estimator is the minimax (pessimistic) strategy. According to the minimax strategy, we select a CDF from the set and a CDF from the set such that the risk measures and achieve their maximum for every fixed . The minimax strategy can be explained in a simple way. We do not know a precise CDF , and every CDF from can be selected. Therefore, we should take the “worst” distribution providing the largest value of the risk measure. The minimax criterion appears as an insurance against the worst case because it aims at minimizing the expected loss in the least favorable case [16].

Denote . Since the sets and are obtained independently for and , respectively, then

The minimax risk functional with respect to the minimax strategy is now of the form:

Let us consider in detail the first problem . Most loss functions applied in classification are increasing with . This implies that the upper bound for , that is, the maximum of over all distributions from , is achieved at the CDFs (see, e.g., Walley’s paper [30]). Hence, there holds

Here, where

The above condition can be rewritten in terms of the function instead of

In the same way, we can consider the second problem . Most of loss functions are decreasing with . Therefore, the upper bound for is achieved at the distribution . This implies that

Finally, we get the upper bound for the risk measure , which is of the form

Now we have two tasks. First, we have to define CDFs and from the available information for every and for every . Second, we have to define the prior probabilities of classes and .

##### 3.2. The Minimin Strategy

The minimin strategy can be regarded as a direct opposite of the minimax strategy. According to the minimin strategy, the risk measure is minimized over all probability distributions from the set as well as over all values of parameters. The strategy can be called optimistic because it selects the “best” probability distribution from the set . Of course, the minimin strategy is of little interest. Nevertheless, we study it in order to compare “extreme” cases (minimax and minimin strategies).

Similarly to the minimax strategy, we can write

Since loss functions applied in classification are increasing with , then the lower bound for , that is, the minimum of over all distributions from , is achieved at the distribution . The loss function is decreasing. Therefore, the lower bound for is achieved at the distribution . Hence, there holds where

The optimization problem for computing the optimal values of parameters for the minimin strategy can be written as

#### 4. Mean Values of Features and a Method for Constructing the Model

Suppose that an object is characterized by features. Moreover, we have the training set (1). Every observation contains the observed values of features . We assume that features with numbers are observed without loss of generality. However, other features are unobserved, and we know only the conditional mean values of the features for every class and their bounds and , . How to classify the objects in this case?

One of the simplest ways is to assume that the mean values are observed values. In other words, we can write , , for all , that is, for all observations. This way can be applied when there are a lot of observations. However, when the amount of statistical data is small, the above replacement of observations by mean values may lead to incorrect classification. Moreover, we do not take into account the information about bounds of feature values here, which might be useful.

Another way is to find the mean values of every observed feature with the number , for every class as Then we can exploit the simplest classification algorithm considered by many authors, for instance, by [5]. The algorithm is based on analyzing the distances between a predicted vector and two vectors of mean values of features. The smallest distance determines the class of . It has been noted in [5] that the proposed decision is the best we can do if we have no prior information about the probabilities of the two classes. However, we lose some useful information in this case, which can be inferred from the observations.

Therefore, we have to develop a classification method which maximally exploits the available information about features.

The first important assumption we use below is that the values of observed features are governed by the nonparametric or empirical distribution.

By dealing with the unobserved features, we consider two cases or two important assumptions. The first one is that we have conditional expectations defined for every class. The second one is that we have unconditional expectation for every feature, which does not depend on the class. This case is less informative, but it is typical for many applications. It is reduced to the first case by accepting the equality .

Let us divide the discriminant function into two parts: Here, , , , and .

The whole discriminant function is the sum . We assume that every function and has some conditional CDFs and for every , respectively.

Let us return to the risk measure defined in (4). It can be rewritten as follows: Here, is the conditional CDF of the th feature for the class .

By assuming that the observed features are governed by the empirical distribution, we can conclude that the distribution of the function is also empirical; that is, its PDF is the weighted sum of Dirac functions with weights . Hence, we obtain

The precise CDFs , , are unknown. However, we know the mean values of every feature with numbers for every class and the bounds of their values. Therefore, we can construct a set of CDFs with some lower and upper bounds. Given the mean value of the th feature and its bounds , , the lower and upper conditional CDFs of the th feature values are

It should be noted that the expression for the upper bound can be obtained by using the natural extension [31, 32] which can be represented as the following linear programming problem: subject to , , for all .

Here, is the indicator function taking the value if . The lower bound can be obtained in the same way by solving the following programming problem: subject to ,, for all .

The same bounds have been differently obtained in the work [33].

The lower and upper CDFs are shown in Figure 1, where , , and . The resulting bounds are optimal in the sense that they could not be any tighter under the given information. However, this does not mean that any distribution whose CDF is inscribed within this bounded probability region would have the same expectations . The obtained set is more rich and produces the p-box. This leads to a more conservative and cautious solution of the classification problem.

Now we have two problems. The first one is to determine the CDFs , . The second problem is to solve an optimization problem for computing parameters by using the above expressions for the risk measure.

Since the function is increasing, then the upper bound for can be written as Here, the upper bound depends only on the bounds for CDFs . This is a very important property which will be used later.

The function is decreasing. This implies that the upper bound for is Here, the upper bound depends also only on the bounds for CDFs .

It should be noted that it is difficult to integrate in (26)-(27) in an explicit form in order to get some functions of parameters even for the simplest loss functions . However, we can apply the standard Monte Carlo technique. By using this technique, random values of features with the indices , are generated in accordance with the CDFs for the class and with the CDFs for the class . By generating random vectors of features , , for every and every in accordance with the CDF and the CDF , we rewrite (26)-(27) as follows:

Finally, we obtain the upper risk measure as a function of parameters as where for and for .

In fact, we extend the training set by generating the “missing” values of features. We reduce the learning problem with combined types of the training information to the standard problem when there are training data in the form of real and generated observations of all features. It is important to note that we do not replace here the “missing” features by their mean values ,. The “missing” values are replaced by a set of random values of features generated in accordance with the corresponding lower and upper CDFs.

The optimization problem for computing parameters for the minimin strategy is of the same form as (29). However, the value is governed by the CDF for , and is governed by the CDF for . This is just one distinction of optimization problems by the minimax and minimin strategies.

An important question is how to determine the functions and or how to determine the type of dependence between and . We can propose two possible ways for doing that. First, the dependence can be determined by experts or by a decision maker on the basis of a preliminary analysis of features and classes. Very often, we can evaluate how possible changes of the feature values impact on the output variable on the basis of physical meaning of the analyzed classification problem. Of course, this way is simple, but, generally, it cannot be always applied to classification problems. Second, we can enumerate variants of the CDFs and by taking different lower and upper CDFs instead of and . In accordance with the minimax strategy, the optimal risk measure is the largest value of the risk measure by optimal parameters . The same procedure can be applied to the minimin strategy. However, we search for the smallest value of the risk measure by optimal parameters in this case.

#### 5. A Procedure for Generation of Random Feature Values

Let us consider how to generate random feature values in accordance with the above CDFs. First, we analyze the lower CDF. It can be seen from its form that the corresponding random variable is concentrated on two subsets. The first subset is the interval from till . The second is the point . The probability that the random variable is in the interval is equal to . The probability of the point is . Therefore, a random number is generated in two steps. First, a random variable uniformly distributed in interval is generated. If is larger than , then ; that is, the generated number at the second step is . If is smaller than , then we use the well-known inverse transformation method. According to the method, the random number is computed through the inverse lower CDF; that is, The right side of the above equality is obtained by means of the inverse transformation of the lower CDF.

The same simulation procedure can be provided for the upper probability distribution. A random variable uniformly distributed in interval is generated. If is smaller than , then ; that is, the generated number at the second step is . If is larger than , then, according to the inverse transformation method, the random number is computed through the inverse upper CDF; that is,

#### 6. Hinge Loss Function and SVM

A procedure for computing optimal values of parameters depends on the loss function . We consider the so-called hinge loss function which is of the form . This function is taken for the consideration in order to reduce the classification problem to the SVM method which gives the opportunity to construct nonlinear classification models in a rather simple way.

After substituting the hinge loss function into the objective function (29), we get the following optimization problem: It can be rewritten in a more dense form:

Let us introduce a new optimization variable Then we get the optimization problem subject to

So, we have the linear optimization problem having optimization variables and constraints.

Let us add the standard Tikhonov regularization term (the most popular penalty or smoothness term) [34] to the objective function (35) and the constant “cost” parameter . The smoothness (Tikhonov) term can be regarded as a constraint which enforces uniqueness by penalizing functions with wild oscillation and effectively restricting the space of admissible solutions. The detailed analysis of regularization methods can be found also in the work [35]. Then we get the following quadratic programming problem: subject to (36).

Instead of minimizing the primary objective function (37), a dual objective function, the so-called Lagrangian, can be formed of which the saddle point is the optimum. The Lagrangian is Here, ,, are Lagrange multipliers. Hence, the dual variables have to satisfy positivity constraints for all ,.

Hence, we get the simplified Lagrangian

Now we can divide all terms of the above objective function into two parts corresponding to the observed and unobserved features, respectively,

Hence, we obtain the dual optimization problem subject to

Any data point for which is called a support vector. Let and denote the set of indices of the support vectors and their total number, respectively. Then one of the ways for computing the parameter is where is one of the support vectors.

If we assume that for all , the prior probabilities are defined as , then we rewrite the optimization problem as subject to

Finally, we can write the discriminant function

The main advantage of the SVM is the use of kernels which are functions that transform the input data to a high-dimensional space where the learning problem is solved. There are many types of kernel that may be used in an SVM. Acceptable kernels must satisfy Mercer’s condition. Commonly used forms of kernels are linear , polynomial , , radial basis function (RBF) , , and sigmoid . Here, , , and are kernel parameters. The kernel functions allow us to significantly extend the class of discriminant functions that can be used in this approach.

#### 7. Experimental Design

We illustrate the method proposed in this paper via several examples; all computations have been performed using the statistical software R [36]. We investigate the performance of the proposed method and compare it with other methods dealing with missing data by considering the accuracy measure (ACC), which is the proportion of correctly classified cases on a sample of data; that is, ACC is an estimate of a classifier’s probability of a correct response. This measure is often used to quantify the predictive performance of classification methods, and it is an important statistical measure of the performance of a binary classification test. It can formally be written as . Here, is the number of test data for which the predicted class for an example coincides with its true class, and is the total number of test data.

First, we consider a numerical example with synthetic data. In this example, we generate instances with two features () such that the second feature is unobserved. We generate normally distributed random values for every feature with the expectations , , , and and the standard deviations and , respectively. We take identical standard deviations for both classes in order to simplify the example. Moreover, we state the lower and upper bounds for values of the second feature and . Then we randomly select points (instances) with identical numbers of points for both classes and get three training sets. The first and the second training sets are obtained in the following way. We generate the values of the second feature times for every example. In sum, we have examples. At that, the values for the first training set are generated in accordance with the CDFs for the class and with the CDFs for the class . This training set corresponds to the minimax strategy. The values for the second training set are generated in accordance with the CDFs for the class and with the CDFs for the class . The second training set corresponds to the minimin strategy. For getting the third training set, we replace all values of the second feature in the set of examples by the expectations for and . Here, we use the available mean values of the second feature as values of the feature. We will call this strategy as direct for short. The initially generated normally distributed random values will be used for testing resulting discriminant functions.

The ACC measures and the discriminant functions for the above three training sets will be indexed by numbers 1, 2, and 3 corresponding to the minimax, minimin, and direct strategies, respectively.

We will use the linear and RBF kernels with the parameter . By applying the above initial data, we get three discriminant functions corresponding to three strategies (minimax, minimin, and direct):

The corresponding ACCs for linear and RBF kernels are shown in Table 1. One can see from the table that the optimistic and direct strategies provide better results in comparison with the minimax strategy. This can be explained by exploiting the normal distribution (symmetric and unimodal) with rather small standard deviations for generating the random values of the second feature.

We replace the normal distribution of the second feature values by the truncated exponential distribution with the CDF if and if . This distribution is not symmetric, and its mean value cannot replace the corresponding random values. By taking the linear and RBF kernels, and , we get the following discriminant functions:

The corresponding ACCs for linear and RBF kernels are shown in Table 2. It can be seen from the table that the minimax strategy provides better results. It follows from the fact that the minimax strategy takes into account the worst cases of the probability distribution of feature values. Of course, the exploited exponential distribution is not the worst case, but it is not the best case too. We can immediately observe that change for the worse of the probability distribution leads to improving the minimax strategy in comparison with the minimin and direct strategies.

The proposed method has been evaluated and investigated by the following publicly available datasets: Iris, Pima Indian Diabetes, Mammographic Masses, Parkinsons, Indian Liver Patient, Breast Cancer Wisconsin (Original), Breast Cancer Wisconsin (Diagnostic), Musk, and Lung-Cancer. All datasets are from the UCI machine learning repository [28]. Table 3 is a brief introduction about these datasets, while more detailed information can be found from, respectively, the data resources.

For all data, we use the repeated random subsampling validation procedure; that is, we randomly split the dataset into two subsets. One of them (training set having instances) is used to train the model while the other (test set having instances) is used to validate the model. The number of instances for training will be denoted as . Moreover, we take instances from every class for training. They are randomly selected from the classes. The remaining instances in the dataset are used for validation. The parameter of the RBF kernel for every dataset is chosen in order to maximize the accuracy measure. It is carried out by means of the following procedure. It is well known that letting the and grow exponentially is a practical method to identify good parameters. An uniform grid in the logarithmic coordinate space (, ) is usually used. The point in the grid represents a parameter pair . However, we fix the value of in order to reduce the number of experiments because our main aim is to compare the proposed models with known models. So, we perform experiments on a uniform grid where has a range of .

From every dataset, we randomly select a feature corresponding to missing values and compute its mean values for negative and positive labels, respectively. Moreover, we find the smallest and largest values of the selected feature which will be used for determining the lower and upper cumulative distribution functions. Then we generate the random values of the selected feature times for every instance. In sum, we have instances. The above procedure is repeated times such that the selected feature with missing values is chosen randomly in every iteration. In addition to the minimin (ACC1), minimax (ACC2), and direct (ACC3) strategies, we generate random values of the “missing” feature in accordance with the normal distribution and compute the corresponding accuracy measure ACC. By using the RBF kernels and the cost parameter , we get the ACC measures for different values of , whose values are shown in Table 4. These measures are mean values of the corresponding ACCs computed for every iteration.

One can see from Table 4 that the proposed minimax strategy (ACC2) outperforms the direct strategy and the normal distribution imputation procedure for some real datasets. Of course, there are datasets for which the measures ACC3 or ACC4 are larger than ACC2. If we have seen from the experiments with synthetic data that the minimax strategy provides better results when the distribution of the feature values is not symmetric and its mean value cannot replace the corresponding random values, then it is difficult to determine clear conditions of using the proposed model with real data. We can say that these conditions directly depend on a probability distribution of the feature values in real data. When we do not have this information, the proposed method should be used jointly with other models dealing with missing data.

#### 8. Conclusion

A classification problem under partial information about some features in the form of conditional expectations or mean values of features for every class has been studied in the paper. Its solution is based on the pessimistic (minimax) and optimistic (minimin) decision strategies.

What are the main advantages of the proposed method? First, the classification algorithm totally exploits the available information in the form of mean values of some features and the bounds of these features. At the same time, it does not employ any additional information which may be unjustified and incorrect. It does not use also additional assumption which may lead to incorrect prediction results. Second, the proposed method has a strong probabilistic background, and this fact allows us to use it in arbitrary applications where the initial information is scarce. Third, the method exploits the well-known minimax and minimin strategies which have a strong explanation. A cautious decision strategy as an intermediate case between pessimistic and optimistic strategies with a predefined caution parameter can also be studied in the same way. However, this is a direction for further research. Fourth, the method is reduced to the SVM. This fact allows us to simply construct nonlinear classification models by using suitable kernels. Fifth, the method allows us to reduce the classification problem to the standard form. This implies that a standard software can be applied for its implementation. The algorithm for computing the optimal parameters of every classification model can be easily implemented with standard functions of the statistical software package R or by using the well-known software library LIBSVM (a library for support vector machines) [37].

The numerical examples have illustrated that the minimax classifiers can provide more accurate results in many cases in spite of their over-conservative decisions. At the same time, the given experiments can be viewed as a preliminary study of the proposed framework for applying the imprecise models to classification problems with missing values. An additional study has to be carried out in order to totally figure out when the proposed classifiers outperform the available classification models.

One can also see from the paper that the Monte Carlo technique is a versatile tool for dealing with partial information. Various classification problems under different types of partial and unreliable information could be solved in the same way. A detailed analysis of the corresponding classification models is another direction for further research.

At the same time, it is well known that one possible limitation of the Monte Carlo methods is the strong dependence of computational effort (proportional to the number of samplings). This implies that the learning of large datasets may lead to a hard computational problem. However, first of all, the minimax strategy should be used when the number of instances in training sets is rather small in order to provide the robust classification. When the training set consists of a large number of instances, other models might give better results. Second, variance reduction techniques can be applied to the classification procedures to decrease the computational effort. This is also a topic of further research.

The proposed method can be also extended on the case of interval-valued mean values of unobserved features. In this case, the lower and upper CDFs are determined by the lower and upper mean values of features.

#### Acknowledgment

The authors would like to express their appreciation to the anonymous referees whose very valuable comments have improved the paper.

#### References

- R. Alaiz-Rodríguez, A. Guerrero-Curieses, and J. Cid-Sueiro, “Minimax regret classifier for imprecise class distributions,”
*Journal of Machine Learning Research*, vol. 8, pp. 103–130, 2007. View at Google Scholar · View at Scopus - R. Alaiz-Rodríguez, A. Guerrero-Curieses, and J. Cid-Sueiro, “Improving classification under changes in class and within-class distributions,” in
*Systems: Computational and Ambient Intelligence*, J. Cabestany, F. Sandoval, A. Prieto, and J. Corchado, Eds., vol. 5517 of*Lecture Notes in Computer Science*, pp. 122––130, Springer, Berlin, Germany, 2009. View at Google Scholar - S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, “Handling imbalanced datasets: a review,”
*GESTS International Transactions on Computer Science and Engineering*, vol. 30, no. 1, p. 25–36, 2006. View at Google Scholar - G. M. Weiss, “Mining with rarity: a unifying framework,”
*ACM SIGKDD Explorations Newsletter*, vol. 6, no. 1, pp. 7––19, 2004. View at Google Scholar - B. Scholkopf and A. J. Smola,
*Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond*, The MIT Press, Cambridge, Mass, USA, 2002. - D. B. Rubin, “Multiple Imputation after 18+ Years,”
*Journal of the American Statistical Association*, vol. 91, no. 434, pp. 473–489, 1996. View at Google Scholar · View at Scopus - M. Saar-Tsechansky and F. Provost, “Handling missing values when applying classification models,”
*Journal of Machine Learning Research*, vol. 8, pp. 1625–1657, 2007. View at Google Scholar · View at Scopus - G. E. A. P. A. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,”
*Applied Artificial Intelligence*, vol. 17, no. 5-6, pp. 519–533, 2003. View at Google Scholar · View at Scopus - A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data,”
*Pattern Recognition*, vol. 41, no. 12, pp. 3692–3705, 2008. View at Publisher · View at Google Scholar · View at Scopus - S. Garcia and F. Herrera, “An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons,”
*Journal of Machine Learning Research*, vol. 9, pp. 2677–2694, 2008. View at Google Scholar - J. Grzymala-Busse and M. Hu, “A comparison of several approaches to missing attribute values in data mining,” in
*Rough Sets and Current Trends in Computing*, pp. 378––385, Springer, Berlin, Germany, 2001. View at Google Scholar - J. Luengo, S. Garcia, and F. Herrera, “On the choice of the best imputation methods for missing values considering three groups of classification methods,”
*Knowledge and Information Systems*, vol. 32, no. 1, p. 77–108, 2012. View at Google Scholar - J. Ning and P. E. Cheng, “A comparison study of nonparametric imputation methods,”
*Statistics and Computing*, vol. 22, no. 1, pp. 273–285, 2012. View at Google Scholar - S. Destercke, D. Dubois, and E. Chojnacki, “Unifying practical uncertainty representations. II: clouds,”
*International Journal of Approximate Reasoning*, vol. 49, no. 3, pp. 664–677, 2008. View at Publisher · View at Google Scholar · View at Scopus - S. Ferson, V. Kreinovich, L. Ginzburg, D. S. Myers, and K. Sentz, “Constructing probability boxes and Dempster-Shafer structures,” Tech. Rep. SAND2002-4015, Sandia National Laboratories, January 2003. View at Google Scholar
- C. P. Robert,
*The Bayesian Choice*, Springer, New York, NY, USA, 1994. - L.V. Utkin, “Regression analysis using the imprecise Bayesian normal model,”
*International Journal of Data Analysis Techniques and Strategies*, vol. 2, no. 4, pp. 356–372, 2010. View at Publisher · View at Google Scholar - L. V. Utkin and F. P. A. Coolen, “On reliability growth models using Kolmogorov-Smirnov bounds,”
*International Journal of Performability Engineering*, vol. 7, no. 1, pp. 5–19, 2011. View at Google Scholar · View at Scopus - L.V. Utkin and Y. A. Zhuk, “A machine learning algorithm for classification under extremely scarce information,”
*International Journal of Data Analysis Techniques and Strategies*, vol. 4, no. 2, pp. 115––133, 2012. View at Google Scholar - J. O. Berger and G. Salinetti, “Approximations of Bayes decision problems: the epigraphical approach,”
*Annals of Operations Research*, vol. 56, no. 1, pp. 1–13, 1995. View at Publisher · View at Google Scholar · View at Scopus - J. Shao, “Monte Carlo approximations in Bayesian decision theory,”
*Journal of the American Statistical Association*, vol. 84, no. 407, pp. 727––732, 1989. View at Publisher · View at Google Scholar - A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data,”
*Pattern Recognition*, vol. 41, no. 12, pp. 3692–3705, 2008. View at Publisher · View at Google Scholar · View at Scopus - D. Williams, X. Liao, Y. Xue, L. Carin, and B. Krishnapuram, “On classification with incomplete data,”
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol. 29, no. 3, pp. 427–436, 2007. View at Publisher · View at Google Scholar · View at Scopus - R. Esposito and L. Saitta, “Monte Carlo theory as an explanation of bagging and boosting,” in
*Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI '03)*, pp. 499––504, 2003. - P. Sollich, “Bayesian methods for support vector machines: evidence and predictive class probabilities,”
*Machine Learning*, vol. 46, no. 1–3, pp. 21–52, 2002. View at Publisher · View at Google Scholar · View at Scopus - J. E. Hurtado, “An examination of methods for approximating implicit limit state functions from the viewpoint of statistical learning theory,”
*Structural Safety*, vol. 26, no. 3, pp. 271–293, 2004. View at Publisher · View at Google Scholar · View at Scopus - J. E. Hurtado and D. A. Alvarez, “Classification approach for reliability analysis with stochastic finite-element modeling,”
*Journal of Structural Engineering*, vol. 129, no. 8, pp. 1141–1149, 2003. View at Google Scholar - A. Frank and A. Asuncion,
*UCI Machine Learning Repository*, 2010. - V. Vapnik,
*Statistical Learning Theory*, Wiley, New York, NY, USA, 1998. - P. Walley, “Measures of uncertainty in expert systems,”
*Artificial Intelligence*, vol. 83, no. 1, pp. 1––58, 1996. View at Publisher · View at Google Scholar - V. P. Kuznetsov,
*Interval Statistical Models. Radio and Communication*, Moscow, Russia, 1991, in Russian. - P. Walley,
*Statistical Reasoning with Imprecise Probabilities*, Chapman and Hall, London, UK, 1991. - S. Ferson, L. Ginzburg, and R. Akcakaya, “Whereof one cannot speak: when input distributions are unknown,”
*Applied Biomathematics Report*, 2001, http://www.ramas.com/whereof.pdf. View at Google Scholar - A. N. Tikhonov and V. Y. Arsenin,
*Solution of Ill-Posed Problems*, W.H. Winston, Washington, DC, USA, 1977. - T. Evgeniou, T. Poggio, M. Pontil, and A. Verri, “Regularization and statistical learning theory for data analysis,”
*Computational Statistics and Data Analysis*, vol. 38, no. 4, pp. 421–432, 2002. View at Publisher · View at Google Scholar · View at Scopus - R Development Core Team,
*R: A Language and Environment for Statistical Computing*, R Foundation for Statistical Computing, Vienna, Austria, 2005. - C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001, http://www.csie.ntu.edu.tw/~cjlin/libsvm/.