Abstract

We propose a novel online coregularization framework for multiview semisupervised learning based on the notion of duality in constrained optimization. Using the weak duality theorem, we reduce the online coregularization to the task of increasing the dual function. We demonstrate that the existing online coregularization algorithms in previous work can be viewed as an approximation of our dual ascending process using gradient ascent. New algorithms are derived based on the idea of ascending the dual function more aggressively. For practical purpose, we also propose two sparse approximation approaches for kernel representation to reduce the computational complexity. Experiments show that our derived online coregularization algorithms achieve risk and accuracy comparable to offline algorithms while consuming less time and memory. Specially, our online coregularization algorithms are able to deal with concept drift and maintain a much smaller error rate. This paper paves a way to the design and analysis of online coregularization algorithms.

1. Introduction

Semi-supervised learning (S2L) is a relatively new subfield of machine learning which has become a popular research topic throughout the last two decades [16]. Different from standard supervised learning (SL), the S2L paradigm learns from both labeled and unlabeled examples. In this paper, we investigate the online semi-supervised learning (OS2L) problems with multiple views which have four features: data is abundant, but the resources to label them are limited; data arrives in a stream and cannot store them all; the target functions in each view agree on labels of most examples (compatibility assumption); the views are independent given the labels (independence assumption).

OS2L algorithms take place in a sequence of consecutive rounds. On each round, the learner is given a training example and is required to predict the label if the example is unlabeled. To label the examples, the learner uses a prediction mechanism which builds a mapping from the set of examples to the set of labels. The quality of an OS2L algorithm is measured by the cumulative loss it makes along its run. The challenge of OS2L is that we do not observe the true label for unlabeled examples to evaluate the performance of prediction mechanism. Thus, if we want to update the prediction mechanism, we have to rely on indirect forms of feedback.

Lots of OS2L algorithms have been proposed in recent years (see a survey in [7, 8]). A popular idea is defining an instantaneous risk function and decreasing its value in an online manner to avoid optimizing the primal semi-supervised problem directly [911]. References [1214] also treat the OS2L problem as online semi-supervised clustering in that there are some must-links pairs (in the same cluster) and cannot-links pairs (cannot in the same cluster), but the effects of these methods are often influenced by “bridge points” (see a survey in [15]).

Coregularization [2, 16] is a method of improving the generalization accuracy of SVMs [17] by using unlabeled data in different views. Multiple hypotheses are trained in coregularization framework and are required to make similar predictions on any given unlabeled example. Moreover, theoretical investigations demonstrate that the coregularization approach reduces the Rademacher complexity by an amount that depends on the “distance” between the views [18, 19]. Unfortunately, basic offline coregularization algorithms are still unable to deal with long-playing large-scale OS2L problems directly because of the constraint of time and memory.

In this paper, we introduce a novel online coregularization framework for the design and analysis of new OS2L algorithms. Since decreasing the primal coregularization objective function is impossible before obtaining all the training examples, we propose a Fenchel conjugate transform to increase the dual problem incrementally. The existing online coregularization algorithms in previous work can be viewed as an approximation of the dual ascending process based on gradient ascent. New online coregularization algorithms are derived based on the idea of ascending the dual function more aggressively. We also discuss the applicability of our framework to the settings where the target hypothesis is not fixed but drifts with the sequence of examples.

To the best of our knowledge, the closest prior work is proposed by de Ruijter and Tsivtsivadze [10]. Their method defines an instantaneous regularized risk function using part of examples to avoid optimizing the primal coregularization problem directly. The learning process is based on convex programming with stochastic gradient descent in kernel space. The update scheme of this work can also be derived from our online coregularization framework.

The rest of the paper will be organized as follows. In Section 2 we begin with a primal view of multiview semi-supervised learning problem based on coregularization. In Section 3 our new framework for designing and analyzing online coregularization algorithms is introduced. Next, in Section 4, we demonstrate that the existing online coregularization algorithms can be derived from our framework using gradient ascent. New online coregularization algorithms are derived based on aggressive dual ascending procedures in Section 5. Experiments and analyses are in Section 6. In Section 7, conclusions and possible extensions of our work are given.

2. Basic Problem Setting

Our notation and problem setting are formally introduced in this section. The italic lower case letters refer to scalars (e.g., and ), and the bold letters refer to vectors (e.g., and ). denotes the th training example, where is seen in views with (), is its label, and is a flag to determine whether the label can be seen. If , the example is labeled; and if , the example is unlabeled. The hinge function is denoted by . denotes the inner product between vectors and .

In most cases, the hypotheses used for prediction come from a parameterized set of hypotheses, , where is a subset of a vector space. For simplicity and concreteness, we focus on semi-supervised binary classification problems in two views in this paper. Consider an input sequence , where , , and (). The goal is to learn the function pair , where and are linear classifiers in two views, respectively. To learn max-margin decision boundaries, the multiview S2L problem based on coregularization [16] can be written as minimizing: where and are trade-off parameters for complexity functions in the two views, is a real-valued parameter to balance data fitting, is the coupling parameter that regularizes the pair towards compatibility using unlabeled data, and is the distance function which measures the difference between the predictions for the same example and composes the multiview coregularizer.

In previous approaches based on coregularization [16, 19], the distance function is often defined as a square function: The distance function is defined as an absolute function (using norm) in this paper (this idea is also adopted by Szedmak and Shawe-Taylor [18] and Sun et al. [20]): Furthermore, (3) is composed of two hinge functions (see Figure 1 for an illustration) In the next section, we will show that the online coregularization problem can be discussed in the dual form of (1) more easily and directly while using the absolute distance function.

Denote the instantaneous loss on round as where . We thus get a simple version of (1) using (5):

The minimization problem of (6) in an online manner is what we consider in the rest of this paper.

3. Online Coregularization by Ascending the Dual Function

In this section, we propose a unified online coregularization framework for multi-view semi-supervised binary classification problems. Our presentation reveals how the multiview S2L problem based on coregularization in Section 2 can be optimized in an online manner.

Before describing our framework, let us recall the definition of Fenchel conjugate that we use as a main analysis tool in this paper (see the appendix for more details). The Fenchel conjugate of a function is defined as

As shown in (7), the Fenchel conjugate is defined only for single variable function in former convex analysis. We extend the definition of Fenchel conjugate to multivariables functions for solving online multiview S2L problem in this paper.

The Fenchel conjugate of a multivariables function () is defined as Specially, the Fenchel conjugate of hinge functions is a key to transfer coregularization from offline to online in this paper.

Proposition 1. Let , where , , and . The Fenchel conjugate of is

Proof. We first rewrite the as follows: Based on the definition of Fenchel conjugate for multivariables functions, we can obtain that
Since the previous third equality follows from the strong max-min property, it can be transferred into a min-max problem. If , is , and if , is ; otherwise, if and , we have .

Back to the primal problem, we want to get a sequence of boundary , ) which makes . Unfortunately, decreasing the objective function directly is impossible, while the training examples arrive in a steam. In order to avoid this contradiction, we propose a Fenchel conjugate transform for multi-view S2L problems based on coregularization.

An equivalent problem of (6) is

Using the Lagrange dual function, we can rewrite (12) by introducing a vector group : Consider the dual function where is the Fenchel conjugate of . The primal problem can be described as maximizing the dual function as in the following

Based on our definition of Fenchel conjugate for multivariables functions, the Fenchel conjugate of can be rewritten as (based on Proposition 1 and Lemma A.2 in the appendix) Since our goal is to maximize the dual function, we can restrict to the first case in (16). has 3 associated coefficient variables which are , and .

Based on the previous analysis, the dual function can be rewritten using a new coefficient vectors group , where ). Consider the following:

As shown in (17), our task has been transferred to a constrained quadratic programming (QP) optimization problem. Every input training example brings a vector into the dual function. are independent, so we can update the vectors group on each learning round to ascend the dual problem incrementally. Obviously, unobserved examples would make no influence on the value of dual function in (17) by setting their associated coefficient variables to zero.

Denote the coefficient vector on round (). The update process of coefficient vectors group on round should satisfy the following conditions:(i)if , ;(ii).

The first one means that the unobserved examples do not make influence on the value of dual function, and the second means that the value of dual function never decreases during the online coregularization process. Therefore, the dual function on round can also be written as

Based on Lemmas A.1 and A.3 in the appendix, we can obtain that each coefficient vectors group has an associated boundary vector group . On round , the associated boundaries of are

To make a summary, we propose a template online coregularization algorithm by dual ascending procedure in Algorithm 1.

INPUT: four positive scalars: , , and .
INITIALIZE: a coefficient vectors group and its associated
     decision boundary vectors , .
PROCESS: For
     Receive an example ,
     Choose a new coefficient vectors group that satisfies
       ,
     Return two new associated decision boundary vectors ,
     using (19).

Essentially, our online coregularization framework aims to break the large QP in the primal objective function into a series of dual ascending procedures on each learning round. Therefore, we can ascend the dual function in an online manner.

4. Analysis of Previous Work Based on Gradient Ascent in the Dual

In the previous section, a template algorithm framework for online coregularization is proposed based on the idea of ascending the dual function. In Algorithm 1, we can obtain that algorithms that derive from our framework may vary in one of two ways. First, different algorithms may update different dual variables on each learning round. The second way in which different algorithms may vary is how to update the chosen variables to ascend the dual function.

Some online coregularization algorithms [9, 10] have been suggested in recent years. These approaches have a similar idea of “defining an instantaneous coregularized risk to avoid optimizing the primal coregularization problem directly.” In these works, there are two popular instantaneous coregularized risk functions which are defined as The online coregularization process in these works is based on convex programming with gradient descent on instantaneous coregularized risk function in kernel space. The step size is often defined to decay at a certain rate [11], for example, . The update process in these approaches can be summarized as

In the following, we demonstrate that these algorithms can be derived from our online coregularization framework. Since the dual coefficient vectors are independent, the dual function can be ascended by updating only the associated coefficient vector of the new arrived training example on round that means And the task on round can be rewritten as ascending

Using a gradient ascent (GA) step on , the update process on round can be written as where is a step size.

In fact, the dual coefficient vectors in can also be updated in (23). Since has dual coefficient vectors, it is impossible to update them, respectively. We introduce a new variable into (23),

From (25), we can get that a gradient ascent update on actually means to multiply all the dual coefficient vectors in by . Since every dual coefficient variable in is constrained, we also constrain . The initial value of is zero. While using a gradient ascent on , we obtain that

Based on the previous analysis, the gradient ascent process of boundary vector group can be written as

As far as we know, all the existing online coregularization algorithms in previous work can be viewed as an approximation of our dual ascending process using gradient ascent in (27).

5. Deriving New Algorithms Based on Aggressive Dual Ascending (ADA) Procedures

In the previous section, we show that the online coregularization algorithms in previous work can be derived from our framework. These algorithms lead to a conservative increase of the value of the dual function since they only modify a single dual vector using gradient ascent on each learning round. In fact, more aggressive online coregularization algorithms can also be derived from our framework.

In this section we describe broader and, in practice, more powerful online coregularization algorithms which increase the dual function more aggressively on each learning round. The motivation for the new algorithms is as follows. Intuitively, update schemes that yield larger increases of the dual function are likely to reach the minimal value of primal objective function faster. Thus, they are in practice likely to suffer a smaller number of mistakes.

5.1. Updating Single Dual Coefficient Vector

The update scheme we described in Section 4 for increasing the dual function modifies the associated coefficient vector of the new arrived training example which is based on gradient ascent, and all the variables in the vector share a same step size. This simple algorithm can be enhanced by solving the following optimization problem on each learning round

According to the type of the new arrived example, (28) can be solved in different ways. If the new arrived example is labeled, we have , and the task on round can be rewritten as Since , we can obtain that Otherwise, if the new arrived example is unlabeled, we have , and (28) can be rewritten as Since , we can obtain that

Based on the previous analysis, the update process of boundary vector group can be written as

In contrast to the gradient approaches in Section 4, this approach ascends the dual function more aggressively. So far, our focus was on an update which modifies a single dual coefficient vector. In fact, all the associated coefficient vectors of the arrived examples can be updated during the online regularization process. We now examine another update scheme based on our online coregularization framework that suggests the modification of multiple dual coefficient vectors on each learning round.

5.2. Updating Multiple Dual Coefficient Vectors

We now describe a new update scheme which modifies multiple dual coefficient vectors on each learning round. In practice, we can get the example set on round , and all the associated coefficient vectors of these examples can be updated to ascend the dual function.

Denote the buffer on round , where and . If , the associated coefficient vector of the th example can be updated on round . We optimize the dual function over the dual coefficient vectors whose indices belong to . Consider the following: Obviously, this more general update scheme could make a larger increase on each learning round. If setting , (34) makes use of all the examples that have been observed and thus is likely to make the largest increase of the dual function. And if setting , (34) degenerates in the update scheme we have described in Section 5.1. The time complexity for solving (34) is (worst case).

Similar to the idea in Section 4, we can also introduce a new variable into (34), is more like a forgetting factor [21] to downweight the contribution of observations whose indices do not belong to . When tracking the changes in the data stream, it is likely that recent observations will be more indicative of its appearance than more distant ones. Incorporating a forgetting factor in the online learning algorithms is a good way to moderate the balance between old and new observations. We can also obtain that () indicates that no forgetting is to occur.

For practical purpose, we test two choices of to update multiple dual coefficient vectors in this paper.(i)Buffer-. Let the buffer size be . Buffer- replaces the oldest example in the buffer with the new arrived example on each learning round, which means .(ii)Buffer-. This buffering strategy replaces the oldest unlabeled point in the buffer with the incoming point while keeping labeled points. The oldest labeled point is evicted from the buffer only when it is filled with labeled points.

Figure 2 shows the set of the associated coefficient vectors which are used to ascend the dual function on each learning round for different choices of . Essentially, different choices of construct different QP problems on each learning round.

5.3. Sparse Approximations for Kernel Representation

In practice, kernel functions are always used to find a linear classifier, like SVM. Our online coregularization framework only contains the product of two points, so we can easily introduce the kernel function into our framework. If we note the kernel matrix such that can be replaced by in our framework. Therefore, we can rewrite (19) as Unfortunately, our previous derived online coregularization algorithms with kernel functions have to store the example sequence up to the current round, and the stored matrix size is (worst case). For practical purpose, we present two approaches to sparsify the kernel representation of boundaries on each learning round.

Absolute Threshold. To construct a sparse representation for the boundaries, absolute threshold discards the examples whose associated coefficients are close to zero. Let denote the absolute threshold. When an arrived example would not be used to update the boundaries in further learning process, is discarded if . The examples whose indices are in cannot be discarded on round since they would be used to ascend the dual function.

Maximal Coefficients (-MC). Another way to sparsify the kernel representation is to keep the examples of which the absolute value of is the first maximum. Similar as the absolute threshold, -MC does not discard the examples in which would be used to ascend the dual function on round . Based on this sparse approximation, the stored matrix size on round reduces to .

The previous two sparse approximations are both motivated by the fact that the examples which have larger coefficients tend to exert more influence on our learned boundaries.

6. Experiments

This section presents a series of experimental results to report the effectiveness of our derived online coregularization algorithms. It is known that the performance of semi-supervised learning depends on the correctness of model assumptions. Thus, our focus is on comparing different online coregularization algorithms with multiple views, rather than different semi-supervised regularization methods.

We report experimental results on two synthetic and a real word binary classification problems. The prediction function in online coregularization algorithms are adopted as the average of the prediction functions from two views

Based on the idea of “interested in the best performance and simply select the parameter values minimizing the error” [3], we select combinations of the parameter values on a finite grid in Table 1, and it is sufficient to perform algorithm comparisons.

The training sequences are generated randomly from each data sets. To avoid the influence of different training sequences, all results on each dataset are the average of five such trials (this idea is inspired by [11]). The error rates have 1 standard deviation. While using buffering strategies to update multiple dual coefficient vectors, the buffer size is fixed at 100 to avoid high computational complexity on each learning round. We implemented all the experiments using MATLAB.

6.1. Two-Moons-Two-Lines Synthetic Data Set

This synthetic data set is generated similarly to the toy example used in [16, 19] in which examples in two classes appear as two moons in one view and two oriented lines in another (see Figure 3 for an illustration). This data set contains 2000 examples, and only 5 examples for each class are labeled. A Gaussian and linear kernel are chosen for the two-moons and two-lines views respectively. In this data set, the offline coregularization algorithms (CoLapSVM) [16] achieve an error rate of 0.

The best performance of all the online coregularization algorithms in Section 5 is presented in Table 2. We also provide some additional details during the online coregularization process.

We compare cumulative runtime curves of online coregularization algorithms with different sparse approximation approaches in Figure 4. Online coregularization algorithms with sparse representation perform better than the basic online coregularization algorithms on the growth rate. The cumulative runtime growth curves of online coregularization algorithms with sparse approximation approaches scale only linearly, while the others scale quadratically.

We also compare the number of examples in the kernel representation of boundary vectors in two views on each learning round for different sparse approximation approaches. Figure 5 shows that only part of examples have to be stored (and computed) while using sparse approximation approaches. Online coregularization algorithms without sparse approximation approaches are time consuming and memory-consuming, and it is intractable to apply them to real-world long time tasks.

In Section 3, we have demonstrated the relationship between the primal objective function and the dual function . We compare the primal objective function versus the dual function on the training sequence of two-moons-two-lines data set as increases in Figure 6. The result shows that the two curves approach each other along the online coregularization algorithms run. The value of dual function never decreases as increases; correspondingly, the curve of primal function has a downward trend and some little fluctuations. We also observe that the curve of the primal objective function which updates multiple dual coefficient vectors on each learning round shows a more smooth downward trend and has less rapid fluctuations. This experiment supports the theory that increasing the dual problem can achieve comparable risks of the primal objective function.

We report the performance of on the whole two-moons-two-lines data set in Figure 7. This result shows that the boundary vector is adjusted to be a better one along the online coregularization algorithms run. Since our algorithms adjust the decision boundary vector according to the local agreement of in the two views on each learning round, the curve of the error rate is not always decreasing along the online coregularization process. It is also the reason why our online coregularization algorithms can track the changes in the data sequence (more detail in Section 6.3). Similar as the experiments in Figure 6, we can observe that the error rate of which updates multiple dual coefficient vectors on each learning round shows a more smooth downward trend and has less rapid fluctuations.

6.2. Web Page Categorization

We applied our derived online coregularization algorithms to the WebKB text classification task studied in Blum and Mitchell [2]; Sindhwani et al. [16]; Sun and Shawe-Taylor [19]. The task is to predict whether a web page is a course home page or not. The data set consists of 1051 web pages in two views (page and link) collected from the computer science department web sites at four U.S. universities: Cornell, University of Washington, University of Wisconsin, and University of Texas. The first view of the data is the textual content of a webpage itself, and the second view is all links pointing to the web page from other web pages. We preprocessed each view by removing stop words, punctuation, and numbers and then applied Porter’s stemming to the text [22]. This problem has an unbalanced class distribution since there are a total of 230 course home pages and 821 noncourse. In addition, words that occur in five or fewer documents were ignored. This resulted in 2332 and 87 dimensional vectors for two views, respectively. Finally, document vectors were normalized to TDIDF features (the product of term frequency and inverse document frequency) [23]. As in [19], we randomly label 3 course and 9 noncourse examples for each class. In this experiment, the linear kernel is used for both views.

In this data set, the offline coregularization algorithms (CoLapSVM) [16] achieve an error rate of 6.32%. In Table 3, we report the best performance of all the online coregularization algorithms on the web page data set.

6.3. Rotating Two-Moons-Two-Lines Synthetic Data Set

When the underlying distributions, both and , change during the course of learning, the algorithms are expected to track the changes in the data sequence. In this subsection, we test the applicability of our framework to settings where the target hypotheses in different views are not fixed but rather drift with the sequence of examples. To demonstrate that our online coregularization algorithms can handle concept drift, we perform our experiments on rotating two-moons-two-lines data sequence. This data set contains 8000 examples, and only 1% examples for each class are labeled. Figure 8 shows that the two-moons-two-lines data set smoothly rotate during the sequence, and the target boundaries in two views drift with the sequence of examples. In the rotating two-moons-two-lines data set, the points will change their true labels during the sequence, and every stationary decision boundary will have an error rate of 50% approximately. A Gaussian and linear kernel are chosen for the two rotating moons and two rotating lines view, respectively.

In Table 4, we report the best performance of all the online coregularization algorithms on the rotating two-moons-two-lines data sequence. In this experiment, we also discuss the effect of the buffer size to track the changes in the data sequence.

Obviously, when tracking the changes in the rotating two-moons-two-lines data sequence, it is likely that recent examples will be more indicative of the boundaries than more distant ones. Buffer- updates the boundaries using the recent examples, it is also the reason why ADA (Buffer-) performs better than the other online coregularization algorithms. We report the error rates of ADA (Buffer-) with different buffer sizes on rotating two-moons-two-lines synthetic data sequence in Figure 9. This experiment illustrates that a suitable size of Buffer- is able to adapt to the changing sequence and maintain a small error rate.

7. Conclusion and Further Discussion

In this paper we presented an online coregularization framework based on the notion of ascending the dual function. We demonstrated that the existing online coregularization algorithms in previous work can be viewed as an approximation of our dual ascending process using gradient ascent. New online coregularization algorithms are derived based on aggressive dual ascending procedures. For practical purpose, we proposed two sparse approximation approaches for kernel representation to reduce the computational complexity. Experiments showed that our online coregularization algorithms can adjust the boundary vector with the input sequence and have risk and error rates comparable to offline algorithms. Specially, our online coregularization algorithms can handle the settings where the target boundaries are not fixed but rather drift with the sequence of examples in different views.

There are many interesting questions remaining in the online semi-supervised learning setting. For instance, we plan to study new online learning algorithms for other semi-supervised learning models. Another direction is how to choose effective combination of the parameter values more intelligently during the online coregularization process.

Appendix

Fenchel Conjugate

The Fenchel conjugate of a function is defined as Since is defined as a supremum of linear functions, it is convex. Here, we describe a few lemmas of Fenchel conjugate which we use as theoretical tools in this paper. More details are in [24].

Lemma A.1. Let be a closed and convex function and let be its differential set at , Then, for all , we have .

Proof. Since and is closed, and convex, we know that for all . Equivalently The right-hand side of the previous equation equals to , and thus The assumption that is closed and convex implies that is the Fenchel conjugate of . Thus, Combining the two inequalities, we have

Lemma A.2. Let , where for all , and . The Fenchel conjugate of is

Proof. We first rewrite the as follows: where for all . Based on the definition of Fenchel conjugate, we can obtain that Since the aforementioned third equality follows from the strong max-min property, it can be transferred into a min-max problem. If , is ; otherwise, if , we have .

Lemma A.3. Let be any norm on , and let with . Then where is the dual norm of . The domain of is also . For example, if then since norm is dual to itself.

Lemma A.4. Let be a function, and let be its Fenchel conjugate. For and , the Fenchel conjugate of is .