Table of Contents Author Guidelines Submit a Manuscript
Computational Intelligence and Neuroscience
Volume 2015 (2015), Article ID 973696, 17 pages
http://dx.doi.org/10.1155/2015/973696
Research Article

Optimism in Active Learning

1CentraleSupélec, MaLIS Research Group, 57070 Metz, France
2GeorgiaTech-CNRS UMI 2958, 57070 Metz, France
3Université de Lille-CRIStAL UMR 9189, SequeL Team, 59650 Villeneuve d’Ascq, France
4Institut Universitaire de France (IUF), 75005 Paris, France

Received 15 April 2015; Accepted 12 August 2015

Academic Editor: Francesco Camastra

Copyright © 2015 Timothé Collet and Olivier Pietquin. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Active learning is the problem of interactively constructing the training set used in classification in order to reduce its size. It would ideally successively add the instance-label pair that decreases the classification error most. However, the effect of the addition of a pair is not known in advance. It can still be estimated with the pairs already in the training set. The online minimization of the classification error involves a tradeoff between exploration and exploitation. This is a common problem in machine learning for which multiarmed bandit, using the approach of Optimism int the Face of Uncertainty, has proven very efficient these last years. This paper introduces three algorithms for the active learning problem in classification using Optimism in the Face of Uncertainty. Experiments lead on built-in problems and real world datasets demonstrate that they compare positively to state-of-the-art methods.

1. Introduction

Traditional classification is a supervised learning framework in which the goal is to find the best mapping between an instance space and a label set. It is based only on the knowledge of a set of instances and their corresponding labels called the training set. To obtain it, an expert or oracle is required to manually label each of the examples, which is expensive. Indeed, this task is time consuming and may as well involve any other kind of resources. The aim of active learning [1] is to reduce the number of requests to the expert without losing performances, which is equivalent to maximizing the performance with a certain number of labeled instances. This can be done by dynamically constructing the training set. Each new instance presented to the expert is thus carefully chosen to generate the best gain in performance. The selection is guided by all the previous received labels. This is a sequential decision process [2].

However, the gain in performance due to a particular instance is not known in advance. This is for two reasons: first, the label given by the expert is not known before querying, and second, the true mapping is unknown. However, those values can be estimated also more and more precisely as the training set grows, because it is the goal of classification to get a good estimate of those values. Still, a low confidence must be put on the first estimations while later estimations may be more trusted. An instance may thus be presented to the expert because it is believed to increase the performances of the classifier, resulting in a short term gain. Or, because it will improve the estimations and help to select better instances in the future, resulting in a long term gain.

This is a very common problem in literature known as exploration versus exploitation dilemma. It has been successfully addressed under the multiarmed bandit problem, as introduced in [3] and surveyed in [4]. In this problem, a set of arms (choices) is considered, where each provides a stochastic reward when pulled (selected). The distribution of rewards for an arm is initially unknown. The goal is to define a strategy to successively pull arms, which maximizes the expected reward under a finite budget of pulls. Several methods have been introduced to solve this dilemma. One of them is the Upper Confidence Bound algorithm, introduced in [5]. It uses the approach of Optimism in the Face of Uncertainty, which selects the arm for which the unknown expected reward is possibly the highest. One notable advantage of those algorithms is that they come with finite-sample analysis and theoretical bounds.

The idea is thus to use Optimism in the Face of Uncertainty for the Active Learning problem in classification. To use this approach, the problem is cast under the multiarmed bandit setting. However, this one deals with a finite number of arms, whereas in classification the instance space may be continuous. In order to adapt it to classification, the instance space is partitioned into several clusters. The goal is thus to find the best mapping between the clusters and the label set, under a finite budget of queries to the expert.

At first, we study the case of independent clusters, where the label given to each cluster only depends on the samples taken in it. We show two algorithms capable of the online allocation of samples among clusters. In this context, we need at least one (or even two) sample in each cluster in order to start favoring one for selection. Thus, the number of clusters must not be too high. This implies using a coarse partition which may limit the maximum performance. The choice of this partition is thus a key issue which has no obvious solution.

Allowing the prediction of each cluster to depend on the samples received in others enables us to use a more refined partition. This makes the choice of the partition less critical. We thus study the case of information sharing clusters. The adaptation of the first case to this one goes through the use of a set of coarse partitions combined by using a Committee of Experts approach. We introduce an algorithm that allocates samples in this context. Doing so, the number of clusters is not limited anymore, and increasing it allows us to apply our algorithms on a continuous instance space. Another algorithm is introduced as an extension of the first one using a kernel.

We start by an overview of the existing methods in active learning in Section 2. Then, in Sections 35, we describe the algorithms. We handle the cases of independent cluster and information sharing clusters. For each one of these problem we define a new loss function that has to be minimized. We also define the confidence interval used by our optimistic algorithms. In Section 6, we evaluate the performance of the algorithms in both built-in problems and real world datasets.

2. Related Work

Many algorithms already exist for active learning. A survey of those methods can be found in [6]. Among them, uncertainty sampling [7] uses a probabilistic classifier (it does not truly output a probability but a score on the label) and samples where the label to give is least certain. In binary classification with labels 0 or 1, this is where the score is closest to 0.5. Query by committee [8, 9] methods consider the version space or hypotheses space as the set of all consistent classifiers (nonnoisy classification) and try to reduce it as fast as possible by sampling the most discriminating instance. It finishes when only one classifier is left in the set. Extensions exist for the noisy case, either by requiring more samples before eliminating a hypothesis [10] or by associating a metric to the version space and trying to reduce it [11, 12]. Other algorithms exist that use a measure of confidence for the labels currently given, such as entropy [13] or variance [14]. Finally, the expected error reduction [1518] algorithms come from the fact that the measure of performance is mostly the risk and that it makes more sense to minimize it directly rather than some other indirect criteria. Our work belongs to this last category. Using an optimistic approach enables us to minimize directly the true risk instead of the expected belief about it.

Other methods also use Optimism in the Face of Uncertainty for active learning. In [19], the method is more related to query by committee since it tries to select the best hypothesis from a set. It thus considers each hypothesis as an arm of a multiarmed bandit and plays them in an optimistic way. In [20], the authors study the problem of estimating uniformly well the mean values of several distributions under a finite budget. This is equivalent to the problem of active learning for regression with an independent discrete instance space. Although this algorithm may still be used on a classification problem, it is not designed for that purpose. Indeed, a good estimate of the mean values leads to a good prediction of the label. However, from the active learning point of view, it will spend effort to be precise on the estimation of the mean value even if this precision is of no use for the decision of the label. Efforts could have been spent to be more certain about the label to give. The importance of having an algorithm specifically adapted to classification is evaluated in Section 6.

3. Materials and Methods

The classical multiarmed bandit setting deals with a finite number of arms. This is not appropriate for the general classification problem in which the instance space may be continuous. In order to adapt this theory to active learning, we must first study the case of a discrete instance space, which may come from a discretized continuous space or originally discrete data. At first, we study the case of independent clusters, where no knowledge is shared between neighbors. After that, we will improve the selection strategy by letting neighbor clusters to share information. At the end, by defining clusters that contain only one instance from the pool each, with a good generalization behavior, we are able to apply this theory to continuous data. We may even define externally the relations between instances and use a kernel.

Let us define the following notations. We consider the instance space and the label set . In binary classification, the label set is composed of two elements, in this work . The oracle is represented by an unknown but fixed distribution . The scenario considered in this work is pool-based sampling [7]. It assumes that there is a large pool of unlabeled instances available from which the selection strategy is able to pick. At each time step , an active learning algorithm selects an instance from the pool, receives a label drawn from the underlying distribution, and add the pair to the training set. This is repeated up to time . The aim is to define a selection strategy that generates the best performance of the classifier at time . The performance is measured with the risk, which is the mean error that would achieve the classifier by predicting labels.

4. Independent Clusters

4.1. Partition of the Instance Space

In this section, we focus on the problem of defining a selection strategy with a discrete instance space. Either the space is already discrete or a continuous space is partitioned into several clusters. The following formulation assumes the latter case; otherwise, the same formulation applies for the discrete case if clusters are replaced with instances. The instance space is thus divided into clusters. The problem is now to choose in which cluster to sample.

Let us define the partitionwith the following properties: (i), no cluster is empty,(ii), the clusters cover the whole instance space,(iii), no clusters overlap. It is important to note that the partition does not change during the progress of the algorithm.

Having discretized the instance space, we can now formalize the problem under a K-armed bandit setting. Each cluster is an arm characterized by a Bernoulli distribution with mean value . Indeed, samples taken in a given cluster can only have a value of 0 or 1. At each round, or time step, , an allocation strategy selects an arm , which corresponds to picking an instance randomly in the cluster and receives a sample , independently of the past samples. Let denote the weight of each cluster, with . For example, in a semisupervised context using pool-based sampling, each weight is proportional to the number of unlabeled data points in each cluster, while, in membership query synthesis, the weights are the sizes or areas of clusters.

Let us define the following notations: is the number of times arm has been pulled up to time and is the empirical estimate of the mean at time .

Under this partition, the mapping of the instance space to the label set is limited to the mapping of clusters to the label set. We thus define the classifier that creates this mapping according to the samples received up to time . In this section, the clusters are assumed to be independent. This means that the label given to a cluster can only depend on samples in this cluster. We use the naive Bayes classifier that gives the labelto cluster , where is the round operator.

4.2. Full Knowledge Criteria

The goal is to build an optimist algorithm for the active learning problem. A common methodology in the Optimism in the Face of Uncertainty paradigm is to characterize first the optimal solution. We thus place ourselves in the Full Knowledge setting. In this setting, we let the allocation strategy depend on the true value of for each cluster, and this defines the optimal allocation of the budget . An optimist algorithm will then estimate those values and allocate samples as close as possible to the optimal allocation. Note that the true values of cannot be used by the classifier directly but only by the allocation strategy.

In the following sections, we show two full knowledge criteria: data-dependent and data-independent. In the data-independent case, the optimal allocation does not depend on the samples received so far. It can be related to one-shot active learning, as defined in [18], in which the allocation of the budget is decided before sampling any instances. In the data-dependent case, the label given by the classifier at time is also considered. This is related to fully sequential active learning, as defined in [18], where the allocation of the budget is updated after each sample. Note that in both cases, the optimist algorithms built upon those criteria are fully sequential.

4.2.1. Data-Independent Criterion

In this section, we characterize the optimal allocation of the budget depending only on the values of for each cluster. We want an allocation of the budget that minimizes the true risk of the classifier at time . Here, the risk is based on the binary loss: Note that this loss is usually hard to use because of its nonconvex nature.

Using the partition , the true risk of the classifier is the sum of the true risks in each cluster with The risk is the mean number of misclassified instances resulting from a particular prediction of labels.

The optimal label the algorithm should assign to arm is . This incurs a regret in the true risk . In order to define an allocation of the samples according to the values regardless of their estimates, the regret is expected over all the samples. This gives us the following definition of the loss for classification per cluster, as the expected regret of the true risk in each cluster, where the expectation is taken over the samples:

The value to be minimized by our allocation of the budget is then the global loss. It is the sum of losses in each cluster:

The objective is now to define an allocation of the budget that minimizes this loss. However, in order to inverse the loss to retrieve the allocation, as well as to derive the online allocation strategy, the losses in each cluster have to be strictly decreasing with and convex. This is not the case with these losses. In order to get a more convenient shape, we bound those losses by pseudolosses. The algorithms we build aim to minimize this pseudoloss instead of the loss defined previously. The idea is thus to bound the probability . We use the fact that the estimated mean in one subset follows a binomial distribution (labels are either 0 or 1). The bounds obtained this way are very tight and equal at a infinitely countable number of points.

Let be the cumulative distribution function of a binomial distribution of parameters . Then,

Note that the probability given above is a step function of and thus is not a strictly decreasing function of . That is not convenient as we require this condition in the later. That is why we bound this probability by bounding the truncated value . Then,

Figure 1 displays this probability and the corresponding bound function of for different values of . We can see that the bound is extremely tight, and its only role is to make it strictly decreasing with and convex. It still retains as much as possible the shape of the probability.

Figure 1: and its bound defined in (12).

We therefore define the following pseudoloss: with being the pseudoloss in each cluster.

Due to the convex nature of , is a strictly increasing function of . Thus, it admits an inverse .

Let be the optimal number of samples to take in each subset in order to minimize under the constraint that : with such that .

This defines the theoretical optimal allocation of the budget. Since we do not know the closed form for and since an optimist algorithm needs an online allocation criterion, we now show the online allocation criterion ,is such that an algorithm sampling at each time the cluster with would result in the optimal allocation of the budget .

We have seen here an optimal allocation of the budget that the optimist algorithm which will be defined in Section 4.3 could try to reach without the knowledge of the values. The criterion we derived only depends on the values of the parameters in each cluster and not the current labels given by the classifier. Considering them would lead to a better allocation since the allocation in a cluster could stop when the correct label is given.

4.2.2. Data-Dependent Criterion

In this section, we show a criterion that leads to the optimal allocation of the budget depending not only on the values of in each cluster, but also on the current labels given by the classifier.

We define a new global loss that is the current regret of the true risk: withThe measure of performance is still the expected true risk but the value to be minimized is preferred to be run-dependent.

In order to minimize it, the selection strategy samples the cluster for which the expected decrease of the loss would be maximum. This criterion is thus the finite difference of the loss with where is the label resulting from the sample and the expectation is taken on .

However, this is a good strategy only if this criterion is strictly increasing with . We thus study the monotonicity of this criterion. We consider sampling more instances in cluster with resulting average label . The new label given by the classifier will be .

After samples, the expected decrease of the loss is Injecting the value of , To shorten notations we use .

We know that is drawn from a binomial distribution of parameter and , thus

The criterion is not strictly increasing. In order to consider this constraint, we define another criterion which is a tight bound of the previous one. We first bound the following probabilities: Equivalently,

The criterion resulting from this bounds is strictly increasing but is not defined for all . Indeed, in order to change the value of the label, the estimated mean has to move to the other side of 0.5. This often requires more than one sample (e.g., if we already sampled 10 instances and 8 were labeled 1, we need at least 6 new samples to have a chance to change the label given by the classifier). In order to get a bound defined for and strictly increasing with , we make a linear interpolation between the value in and the value in which is .

We thus define the actual criterion:

The online allocation criterion isand it is such that an algorithm sampling at each time the cluster with would result in the optimal allocation of the budget .

The criterion defined in this section leads to an optimal allocation of the budget that the optimist algorithm which will be defined in the next section could try to reach without the knowledge of the values. It depends on the value of the parameters in each cluster as well as the current estimate of this parameter by the classifier.

4.3. Included Optimism

In this section we introduce two optimistic algorithms: OALC-DI (Optimistic Active Learning for Classification: Data Independent) which use the data-independent criterion and OALC-DD (Optimistic Active Learning for Classification: Data Dependent) which use the data-dependent criterion for optimal budget allocation defined in the previous sections. Both can be described by the same core algorithm. Neither criteria can be used as they are currently defined, for the active learning problem. Indeed, the value of in each cluster is not known in advance; otherwise, the correct label would be known as well. Also, it cannot directly replace those values by their estimation which could lead to clusters being turned down. This is a case of the exploration/exploitation tradeoff where the uncertainty about the true value of in each cluster has to be considered. Therefore, we design an optimistic algorithm that estimates those values and samples as close as possible to the optimal allocation.

Following the Optimism in the Face of Uncertainty approach, it builds a confidence interval on the criterion to be maximized and draw the arm for which the upper bound of this interval is highest. This is equivalent to saying it draws the arm for which the criterion is possibly the highest. As we know the shape of the distribution of the values, the confidence interval is a Bayesian Credible Interval [21] which leads to tight bounds. The Bayesian Credible Interval is relative to a probability which allows for controling the amount of exploration of the algorithm. The core algorithm is presented in Algorithm 1. It takes one parameter and can be derived in two algorithms depending on the criterion used.

Algorithm 1: Core algorithm.

Let us show how to build the Bayesian Credible Interval. As each sample is drawn from a Bernoulli distribution, the estimated means follow a binomial distribution. Beta distributions provide a family of conjugate prior probability distributions for binomial distributions. The uniform distribution is taken as the prior probability distribution, because we have no information about the true distribution. Using the Bayesian inference,

In the following means either from (16) or from (27). Obviously,

Let , then

The upper bound of the Bayesian Credible Interval is then

In this section, we have shown two optimistic algorithms that share the same core. The difference lies in the full knowledge criterion used. One depends only on the value of the parameters of the distributions. The other one depends on both the value of the parameters and the current estimates of this parameter by the classifier. Both the resulting algorithms depend only on the estimates of the parameters.

The problem solved by those algorithm is the one that finds the best label to give to several separated clusters. This separation comes from the partition of a continuous instance space. A good hypothesis would be that the values do not vary fast and that neighbor clusters have close values of . In order to speed up learning and to increase generalization, we could estimate considering neighbor clusters. This is the subject of next section.

5. Information Sharing Clusters

5.1. A Set of Partitions

The previous section introduces an active learning algorithm which is based on a partition of the instance space. Supposing this partition is given, it defines the best allocation of samples among its clusters that lead to the lowest true risk of the classifier also based on this partition. The best performance of the classifier still highly depends on the choice of the partition, which has no obvious solution. One way to improve the classifier’s performance is to increase the number of clusters in the partition. But this slows learning as each cluster parameter has to be estimated independently. To counter that, we allow the classifier to generalize by letting neighbor clusters share information. In order to use the same approach as before, we consider the case of a committee of partitions. Each partition estimates the parameter of their clusters independently. Then, the local prediction of the label is determined by averaging the estimations of each partition.

Let be a set of partitions of the instance space: where : with the following properties: (i); no subset is empty,(ii); the subsets cover the whole instance space,(iii); no subsets overlap. Each partition may have a different number of subsets .

These partitions may come from random forests [22] or tile coding which is a function approximation method commonly used in the field of reinforcement learning [23]. The partitions must not change during the progress of the algorithm.

We write , the average label in each cluster of each partition, and , the number of samples in subset .

Let us now define the thinnest partition , which is the partition resulting from overlapping all the partitions from : such that