Computational Intelligence and Neuroscience

Volume 2015, Article ID 973696, 17 pages

http://dx.doi.org/10.1155/2015/973696

## Optimism in Active Learning

^{1}CentraleSupélec, MaLIS Research Group, 57070 Metz, France^{2}GeorgiaTech-CNRS UMI 2958, 57070 Metz, France^{3}Université de Lille-CRIStAL UMR 9189, SequeL Team, 59650 Villeneuve d’Ascq, France^{4}Institut Universitaire de France (IUF), 75005 Paris, France

Received 15 April 2015; Accepted 12 August 2015

Academic Editor: Francesco Camastra

Copyright © 2015 Timothé Collet and Olivier Pietquin. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Active learning is the problem of interactively constructing the training set used in classification in order to reduce its size. It would ideally successively add the instance-label pair that decreases the classification error most. However, the effect of the addition of a pair is not known in advance. It can still be estimated with the pairs already in the training set. The online minimization of the classification error involves a tradeoff between exploration and exploitation. This is a common problem in machine learning for which multiarmed bandit, using the approach of Optimism int the Face of Uncertainty, has proven very efficient these last years. This paper introduces three algorithms for the active learning problem in classification using Optimism in the Face of Uncertainty. Experiments lead on built-in problems and real world datasets demonstrate that they compare positively to state-of-the-art methods.

#### 1. Introduction

Traditional classification is a supervised learning framework in which the goal is to find the best mapping between an instance space and a label set. It is based only on the knowledge of a set of instances and their corresponding labels called the training set. To obtain it, an expert or oracle is required to manually label each of the examples, which is expensive. Indeed, this task is time consuming and may as well involve any other kind of resources. The aim of active learning [1] is to reduce the number of requests to the expert without losing performances, which is equivalent to maximizing the performance with a certain number of labeled instances. This can be done by dynamically constructing the training set. Each new instance presented to the expert is thus carefully chosen to generate the best gain in performance. The selection is guided by all the previous received labels. This is a sequential decision process [2].

However, the gain in performance due to a particular instance is not known in advance. This is for two reasons: first, the label given by the expert is not known before querying, and second, the true mapping is unknown. However, those values can be estimated also more and more precisely as the training set grows, because it is the goal of classification to get a good estimate of those values. Still, a low confidence must be put on the first estimations while later estimations may be more trusted. An instance may thus be presented to the expert because it is believed to increase the performances of the classifier, resulting in a short term gain. Or, because it will improve the estimations and help to select better instances in the future, resulting in a long term gain.

This is a very common problem in literature known as exploration versus exploitation dilemma. It has been successfully addressed under the multiarmed bandit problem, as introduced in [3] and surveyed in [4]. In this problem, a set of arms (choices) is considered, where each provides a stochastic reward when pulled (selected). The distribution of rewards for an arm is initially unknown. The goal is to define a strategy to successively pull arms, which maximizes the expected reward under a finite budget of pulls. Several methods have been introduced to solve this dilemma. One of them is the Upper Confidence Bound algorithm, introduced in [5]. It uses the approach of Optimism in the Face of Uncertainty, which selects the arm for which the unknown expected reward is possibly the highest. One notable advantage of those algorithms is that they come with finite-sample analysis and theoretical bounds.

The idea is thus to use Optimism in the Face of Uncertainty for the Active Learning problem in classification. To use this approach, the problem is cast under the multiarmed bandit setting. However, this one deals with a finite number of arms, whereas in classification the instance space may be continuous. In order to adapt it to classification, the instance space is partitioned into several clusters. The goal is thus to find the best mapping between the clusters and the label set, under a finite budget of queries to the expert.

At first, we study the case of independent clusters, where the label given to each cluster only depends on the samples taken in it. We show two algorithms capable of the online allocation of samples among clusters. In this context, we need at least one (or even two) sample in each cluster in order to start favoring one for selection. Thus, the number of clusters must not be too high. This implies using a coarse partition which may limit the maximum performance. The choice of this partition is thus a key issue which has no obvious solution.

Allowing the prediction of each cluster to depend on the samples received in others enables us to use a more refined partition. This makes the choice of the partition less critical. We thus study the case of information sharing clusters. The adaptation of the first case to this one goes through the use of a set of coarse partitions combined by using a Committee of Experts approach. We introduce an algorithm that allocates samples in this context. Doing so, the number of clusters is not limited anymore, and increasing it allows us to apply our algorithms on a continuous instance space. Another algorithm is introduced as an extension of the first one using a kernel.

We start by an overview of the existing methods in active learning in Section 2. Then, in Sections 3–5, we describe the algorithms. We handle the cases of independent cluster and information sharing clusters. For each one of these problem we define a new loss function that has to be minimized. We also define the confidence interval used by our optimistic algorithms. In Section 6, we evaluate the performance of the algorithms in both built-in problems and real world datasets.

#### 2. Related Work

Many algorithms already exist for active learning. A survey of those methods can be found in [6]. Among them, uncertainty sampling [7] uses a probabilistic classifier (it does not truly output a probability but a score on the label) and samples where the label to give is least certain. In binary classification with labels 0 or 1, this is where the score is closest to 0.5. Query by committee [8, 9] methods consider the version space or hypotheses space as the set of all consistent classifiers (nonnoisy classification) and try to reduce it as fast as possible by sampling the most discriminating instance. It finishes when only one classifier is left in the set. Extensions exist for the noisy case, either by requiring more samples before eliminating a hypothesis [10] or by associating a metric to the version space and trying to reduce it [11, 12]. Other algorithms exist that use a measure of confidence for the labels currently given, such as entropy [13] or variance [14]. Finally, the expected error reduction [15–18] algorithms come from the fact that the measure of performance is mostly the risk and that it makes more sense to minimize it directly rather than some other indirect criteria. Our work belongs to this last category. Using an optimistic approach enables us to minimize directly the true risk instead of the expected belief about it.

Other methods also use Optimism in the Face of Uncertainty for active learning. In [19], the method is more related to query by committee since it tries to select the best hypothesis from a set. It thus considers each hypothesis as an arm of a multiarmed bandit and plays them in an optimistic way. In [20], the authors study the problem of estimating uniformly well the mean values of several distributions under a finite budget. This is equivalent to the problem of active learning for regression with an independent discrete instance space. Although this algorithm may still be used on a classification problem, it is not designed for that purpose. Indeed, a good estimate of the mean values leads to a good prediction of the label. However, from the active learning point of view, it will spend effort to be precise on the estimation of the mean value even if this precision is of no use for the decision of the label. Efforts could have been spent to be more certain about the label to give. The importance of having an algorithm specifically adapted to classification is evaluated in Section 6.

#### 3. Materials and Methods

The classical multiarmed bandit setting deals with a finite number of arms. This is not appropriate for the general classification problem in which the instance space may be continuous. In order to adapt this theory to active learning, we must first study the case of a discrete instance space, which may come from a discretized continuous space or originally discrete data. At first, we study the case of independent clusters, where no knowledge is shared between neighbors. After that, we will improve the selection strategy by letting neighbor clusters to share information. At the end, by defining clusters that contain only one instance from the pool each, with a good generalization behavior, we are able to apply this theory to continuous data. We may even define externally the relations between instances and use a kernel.

Let us define the following notations. We consider the instance space and the label set . In binary classification, the label set is composed of two elements, in this work . The oracle is represented by an unknown but fixed distribution . The scenario considered in this work is pool-based sampling [7]. It assumes that there is a large pool of unlabeled instances available from which the selection strategy is able to pick. At each time step , an active learning algorithm selects an instance from the pool, receives a label drawn from the underlying distribution, and add the pair to the training set. This is repeated up to time . The aim is to define a selection strategy that generates the best performance of the classifier at time . The performance is measured with the risk, which is the mean error that would achieve the classifier by predicting labels.

#### 4. Independent Clusters

##### 4.1. Partition of the Instance Space

In this section, we focus on the problem of defining a selection strategy with a discrete instance space. Either the space is already discrete or a continuous space is partitioned into several clusters. The following formulation assumes the latter case; otherwise, the same formulation applies for the discrete case if* clusters* are replaced with* instances*. The instance space is thus divided into clusters. The problem is now to choose in which cluster to sample.

Let us define the partitionwith the following properties: (i), no cluster is empty,(ii), the clusters cover the whole instance space,(iii), no clusters overlap. It is important to note that the partition does not change during the progress of the algorithm.

Having discretized the instance space, we can now formalize the problem under a K-armed bandit setting. Each cluster is an arm characterized by a Bernoulli distribution with mean value . Indeed, samples taken in a given cluster can only have a value of 0 or 1. At each round, or time step, , an allocation strategy selects an arm , which corresponds to picking an instance randomly in the cluster and receives a sample , independently of the past samples. Let denote the weight of each cluster, with . For example, in a semisupervised context using pool-based sampling, each weight is proportional to the number of unlabeled data points in each cluster, while, in membership query synthesis, the weights are the sizes or areas of clusters.

Let us define the following notations: is the number of times arm has been pulled up to time and is the empirical estimate of the mean at time .

Under this partition, the mapping of the instance space to the label set is limited to the mapping of clusters to the label set. We thus define the classifier that creates this mapping according to the samples received up to time . In this section, the clusters are assumed to be independent. This means that the label given to a cluster can only depend on samples in this cluster. We use the naive Bayes classifier that gives the labelto cluster , where is the round operator.

##### 4.2. Full Knowledge Criteria

The goal is to build an optimist algorithm for the active learning problem. A common methodology in the Optimism in the Face of Uncertainty paradigm is to characterize first the optimal solution. We thus place ourselves in the Full Knowledge setting. In this setting, we let the allocation strategy depend on the true value of for each cluster, and this defines the optimal allocation of the budget . An optimist algorithm will then estimate those values and allocate samples as close as possible to the optimal allocation. Note that the true values of cannot be used by the classifier directly but only by the allocation strategy.

In the following sections, we show two full knowledge criteria: data-dependent and data-independent. In the data-independent case, the optimal allocation does not depend on the samples received so far. It can be related to one-shot active learning, as defined in [18], in which the allocation of the budget is decided before sampling any instances. In the data-dependent case, the label given by the classifier at time is also considered. This is related to fully sequential active learning, as defined in [18], where the allocation of the budget is updated after each sample. Note that in both cases, the optimist algorithms built upon those criteria are fully sequential.

###### 4.2.1. Data-Independent Criterion

In this section, we characterize the optimal allocation of the budget depending only on the values of for each cluster. We want an allocation of the budget that minimizes the true risk of the classifier at time . Here, the risk is based on the binary loss: Note that this loss is usually hard to use because of its nonconvex nature.

Using the partition , the true risk of the classifier is the sum of the true risks in each cluster with The risk is the mean number of misclassified instances resulting from a particular prediction of labels.

The optimal label the algorithm should assign to arm is . This incurs a regret in the true risk . In order to define an allocation of the samples according to the values regardless of their estimates, the regret is expected over all the samples. This gives us the following definition of the loss for classification per cluster, as the expected regret of the true risk in each cluster, where the expectation is taken over the samples:

The value to be minimized by our allocation of the budget is then the global loss. It is the sum of losses in each cluster:

The objective is now to define an allocation of the budget that minimizes this loss. However, in order to inverse the loss to retrieve the allocation, as well as to derive the online allocation strategy, the losses in each cluster have to be strictly decreasing with and convex. This is not the case with these losses. In order to get a more convenient shape, we bound those losses by pseudolosses. The algorithms we build aim to minimize this pseudoloss instead of the loss defined previously. The idea is thus to bound the probability . We use the fact that the estimated mean in one subset follows a binomial distribution (labels are either 0 or 1). The bounds obtained this way are very tight and equal at a infinitely countable number of points.

Let be the cumulative distribution function of a binomial distribution of parameters . Then,

Note that the probability given above is a step function of and thus is not a strictly decreasing function of . That is not convenient as we require this condition in the later. That is why we bound this probability by bounding the truncated value . Then,

Figure 1 displays this probability and the corresponding bound function of for different values of . We can see that the bound is extremely tight, and its only role is to make it strictly decreasing with and convex. It still retains as much as possible the shape of the probability.