Computational Intelligence and Neuroscience

Volume 2018, Article ID 1853517, 13 pages

https://doi.org/10.1155/2018/1853517

## Generalization Bounds for Coregularized Multiple Kernel Learning

Department of Communication and Information Engineering, Shanghai Technical Institute of Electronics & Information, Shanghai 201411, China

Correspondence should be addressed to Xinxing Wu; ten.haey@uwgnixnix

Received 10 February 2018; Revised 26 August 2018; Accepted 13 September 2018; Published 1 November 2018

Academic Editor: Amparo Alonso-Betanzos

Copyright © 2018 Xinxing Wu and Guosheng Hu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Multiple kernel learning (MKL) as an approach to automated kernel selection plays an important role in machine learning. Some learning theories have been built to analyze the generalization of multiple kernel learning. However, less work has been studied on multiple kernel learning in the framework of semisupervised learning. In this paper, we analyze the generalization of multiple kernel learning in the framework of semisupervised multiview learning. We apply Rademacher chaos complexity to control the performance of the candidate class of coregularized multiple kernels and obtain the generalization error bound of coregularized multiple kernel learning. Furthermore, we show that the existing results about multiple kennel learning and coregularized kernel learning can be regarded as the special cases of our main results in this paper.

#### 1. Introduction

Kernel-based learning is related to achieve nonlinear machine learning tasks from linear ones. In the real applications, selecting a good or suitable kernel for the kernel-based learning is an important and difficult task. To this end, an approach named multiple kernel learning has been developed, and it allows to automatically choose the best kernel from a predefined kernel class. The earliest work of multiple kernel learning can be traced back to the research in [1], where the authors proposed to automatically pick up a linear combination of candidate kernels for the support vector machines based on a semidefinite programming approach. Theoretical generalization analysis of multiple kernel learning has been widely studied by many researchers [1–7]. In particular, Ying and Campbell in [2] proposed a novel generalization bound (Rademacher chaos complexity) for the study of multiple kernel learning. However, the discussions in [2] were for single view and supervised learning. In this paper, we will employ Rademacher chaos complexity proposed in [2] to study the generalization error of coregularized multiple kernel learning in the semisupervised multiview learning framework.

Semisupervised multiview learning as an area of machine learning is trained with both labeled samples and unlabeled samples, and the unlabeled samples are helpful to reduce the amount of the labeled samples. Semisupervised multiview learning supposes that the train samples can be represented by multiple views. The coregularized least squares algorithm—a semisupervised version of regularized least squares with two views—is a typical multiview learning model that uses the unlabeled samples to estimate the view incompatibility of models [8, 9]. Rosenberg in [10] extended the coregularized least squares algorithm to the case of kernel cotraining. And Brefeld et al. and Rosenberg in [11, 12, 13] discussed the generalization bound of kernel-based learning with multiple (or two) views in the semisupervised learning framework. However, the discussions in [11, 12, 13] supposed that the kernel used to construct the reproducing kernel Hilbert space is predefined. Therefore, their results cannot be used to the analysis of multiple kernel learning.

In this paper, we will discuss the generalization error of coregularized multiple kernel learning in the semisupervised multiview learning framework. And we show that the results in [2] and [11] can be regarded as the special cases of our main results. The rest of the paper is organized as follows. In Section 2, we introduce some basic notations and definitions for later discussion. In Section 3, we discuss the related research and put forward the question that will be studied in this paper. In Section 4, we present our main results. In Section 5, we give the main proofs for our main results proposed in Section 4. In Section 6, we give a comparative discussion of our results to the existing work and show that the results about multiple kennel learning in [2] and coregularized kernel learning in [11] can be regarded as the special cases of our main results. The last Section 7 concludes this paper.

#### 2. Notations and Definitions

In this section, we introduce notations and definitions for later discussions:(i)Let be the set of natural numbers and be the set of real numbers. Let .(ii)Let be a probability space; that is, alone is called the sample space, is a *σ*-algebra on , and is a probability measure on . And has the structure , where and are the input space and output space, respectively. Denote as the marginal distribution on . For ignoring the discussion of measure theory, we simply denote as .(iii)Let be the set of all measurable functions . Assume that is a subset of . That is, , the set is called the hypothesis class.(iv)Let be a finite set of the labeled training samples, and assume these samples are independent and identically distributed (i.i.d.) according to . Denote the bold letter as a vector; for example, presents a vector .(v)For the sign , if *D* is a set, we use to represent the number of elements of a set and if *D* is a function, we use to represent the absolute value of the function *D*.(vi)If *A* is a matrix, we use to represent the transpose of the matrix *A*.(vii)Let *L* be the loss function, , and the loss of *f* on a sample point is defined by or .

In learning theory, one of the purposes is to pick up a function *f* in hypothesis space that minimizes the following generalization error:

Generally speaking, the distribution in the above Equation (1) is unknown. Rather than minimizing , we usually minimize the empirical or training error below:where the sign *S* represents the finite labeled samples and .

In this paper, the main quantity we are interested in is the following uniform estimation of the difference between the generalization error and empirical error:

For the discussion in the later sections, we introduce the following four definitions and one lemma (Definition 5 is proposed by us).

*Definition 1 (Empirical Rademacher Complexity) [2]*. Let be a class of functions . The samples , are independently drawn from the probability space . The empirical Rademacher complexity can be defined aswhere the random variables , are Rademacher variables, and presents a vector .

*Definition 2 (Empirical Rademacher Chaos Complexity) [2]*. Let be a class of functions . The samples , are independently drawn from the probability space . The empirical Rademacher chaos complexity can be defined aswhere the random variables , are Rademacher variables, and presents a vector .

*Definition 3 (Reproducing Kernel Hilbert Space, RKHS) [14]*. The functionis a reproducing kernel of the Hilbert space if and only if.(i)For any (ii)For any , for any ,

A Hilbert space of functions which possesses a reproducing kernel is called a reproducing kernel Hilbert space.

* Remark 1. *The second condition in the above Definition 3 is called “the reproducing property”: the value of the function

*f*at the point

*x*is reproduced by the inner product of

*f*with . From the above two conditions, for any, , it is clear thatIn real applications, the solution to many reproducing kernel Hilbert space optimization questions is contained in a special subspace of the reproducing kernel Hilbert space, often known as the “span of the data”. The span of the data for an reproducing kernel Hilbert space is the linear subspace (Appendix A.3 on page 75 of [13]):For simplicity, we denote the above linear subspace by .

Let be a class of kernels. In this paper, we assume that is finite.

*Definition 4 (Subnormalized Functional with Degenerate Dimension n)*. If a loss function satisfiesthen we call a subnormalized functional with degenerate dimension *n* on .

Lemma 1. *[13] Let be a reproducing kernel Hilbert space with kernel , and consider any point . If is a closed subspace containing , then the projection of f onto has the same value at x as f does. That is,*

For multiple kernel learning, the main task is to automatically choose a kernel K from a predefined class of kernels, and find a function from the reproducing kernel Hilbert space that is most suitable to model the given samples.

In this paper, our purpose is to minimizeover the class . Here, let and be the classes of kernels. and denote the reproducing kernel Hilbert spaces. m represents the amount of the labeled points , and u represents the amount of the unlabeled points . The signs , and λ mean the regularization parameters. The functions and , respectively, represent two viewers, and the function is the labeled loss function, which measures the performance of and on the labeled points .

#### 3. Related Work

In [11], Rosenberg and Bartlett used Rademacher complexity to bound the coregularized kernel class in the semisupervised two-view learning framework, and two viewers are two predefined reproducing kernel Hilbert spaces ( and , respectively). Take labeled points , and unlabeled points . The coregularized least squares algorithm discussed in [11] can be described as follows:where *m* represents the amount of the labeled points , and *u* represents the amount of the unlabeled points . The signs , and *λ* mean the regularization parameters. The functions and , respectively, represent two viewers, and the function is the labeled loss function, which measures the performance of and on the labeled points .

In [11], the final output is denoted as .

In [2], Ying and Campbell applied the Rademacher chaos complexity to study the generalization of multiple kernel learning in the supervised learning framework. The multiple kernel learning model they considered is as follows:where *m* represents the amount of the labeled points . The function is the loss function. The sign *λ* means the regularization parameters. And denotes the reproducing kernel Hilbert spaces. In Equation (13), is not predefined and depends on the kernel choose from the class of kernel.

In this paper, we are interested in the topic of coregularized multiple kernel learning; that is, the two reproducing kernel Hilbert spaces are not defined in advance. Our discussions are in the framework of semisupervised multiview learning. We give this learning question as the following two-layer minimization formation:where let and be the classes of kernels. and denote the reproducing kernel Hilbert spaces. *m* represents the amount of the labeled points , and *u* represents the amount of the unlabeled points . The signs , and *λ* mean the regularization parameters. The functions and , respectively, represent two viewers, and the function is the labeled loss function, which measures the performance of and on the labeled points .

* Remark 2. *We will explain the following:(1)Equation (14) given in this paper is different from Equation (12):(a)The solution from Equation (14) is to minimize on the class , while the solution from Equation (12) is to minimize on the class .(b)The solution from Equation (14) is through two minimization steps: first, it finds the most suitable reproducing kernel for the given samples. Second, it obtains the best function/model from the found reproducing kernel Hilbert spaces in the first step. While in Equation (12), it only needs to get the best function/model from the reproducing kernel Hilbert spaces which are predefined.(2)Equation (14) given in this paper is different from Equation (13):(a)The solution from Equation (14) is to minimize on the product space , while the solution from Equation (12) is to minimize on the space .(b)The minimization item in Equation (13) is much simpler because the analysis on Equation (13) is limited to supervised learning and single view and does not deal with unlabeled samples.From the above discussion, we can see that the generalization analysis about Equation (14) will make more meaningful and bring greater challenge.

In the next section, we will present the main results of this paper.

#### 4. Generalization Bounds

In this section, we assume that the loss function in Equation (11) is the subnormalized functional with degenerate dimension 2 on . In Equation (11), let and , we have

Note that , and then for any samples and , the class of candidate reproducing kernel Hilbert spaces is defined as follows:

That is, the solution minimizing belongs to the class . For simplicity, we use to denote in the next sections.

As the assumption in [11], the final predictor for the coregularized multiple kernel learning is selected from the following class:

* Remark 3. *In Equations (16) and (17), we can see that and mainly depend on the prescribed set of candidate kernels and the unlabeled data.

For any , we define the expected loss as Equation (2), and we can use the loss function

*L*to compute the labeled empirical loss in Equation (11). For the given samples , the loss can be also computed as Equation (3).

If

*L*satisfies the Lipschitz continuous condition on , we introduce the constant defined byand the local Lipschitz constant denoted asIn the end of this section, we give the main theorems in this paper.

Theorem 1. *Let the function L be a subnormalized functional with degenerate dimension 1 on and satisfy the Lipschitz continuous condition on . Let the labeled samples , be independent random variables drawn from the probability space , and the unlabeled samples , be independent random variables drawn from the probability space . Then, for any , with probability at least , for any , the following inequality holdswhere presents a vector and , are Rademacher variables, with diagonal elements , and orthogonal matrix O, and*

Corollary 1. *Under the assumption of Theorem 1, we have the following inequality:*

Corollary 2. *Under the assumption of Theorem 1 and assume and , the following inequality holds*

* Remark 4. *We will proof Theorem 1 and Corollaries 1 and 2 in Section 5. And in Section 6, we will reveal that Theorem 1 and Corollary 1 are the extensions of the results in [2] and [11], respectively.

#### 5. Proofs

In this section, we will prove Theorem 1 and Corollaries 1 and 2 in Section 4. As the preparation for the next proof, we give two following lemmas (Lemmas 2 and 3) in advance.

Lemma 2. *Let the function L be a subnormalized functional with degenerate dimension 1 on , and satisfy the Lipschitz continuous condition on . Let the labeled samples , be independently drawn from the and the unlabeled samples , be independently from the probability space . Then, for any , with probability at least , we have the following inequality:where presents a vector and , are Rademacher variables.*

*Proof. *For simplicity, letReplacing the *i*th sample in the labeled samples with , we haveBy McDiarmid’s inequality (Theorem D.3 on page 372 of [15]), with probability at least , we haveBy the symmetrization argument (page 36 of [15]), we bound the expectation of the first term on the right-hand side of the above inequality (28) as follows:Again, applying McDiarmid’s inequality to the right-hand side of the above inequality (29), with probability at least , we haveCombining inequalities (28), (29), and (30) yield with probability at least :

Lemma 3. *Under the assumption of Lemma 2, for any , with probability at least , we have the following inequality:where, presents a vector and , are Rademacher variables.*

*Proof. *As defined in Equation (19), is the local Lipschitz constant of *L*. And by the contraction property of Rademacher complexity (Lemma 26.9 on page 331 of [16] and Theorem 7 of [17]), the result is as follows.

Lemma 4. *If Equation* (11) *has a solution, then, for a fixed ** and a fixed **, it has a solution ** as follows:*for some and . That is, the solution belongs to .

*Proof. *The result follows in a similar way to Proposition 2.3.1 in [11].

Lemma 5. *Let the labeled samples , be independent random variables drawn from the probability space and the unlabeled samples , be independent random variables drawn from the probability space . The following inequality holdswhere presents a vector and , are Rademacher variables, with diagonal elements , and orthogonal matrix O, and*

*Proof. *As defined in Equation (17), we can rewrite the left-hand side of inequality (35) aswhere the sign is defined in Equation (16).

From the assumptions, we haveBy the reproducing property and Lemma 1, for any and , *for any *, we haveCombining Equations (39), (40), (41), and (42) yields thatwhereBy Lemma 4, for any and , we have (this is similar to Section 5.2.1 converting to Euclidean space in [11])whereHence, we havewhereNote that the matrix is not full rank, and by using the similar steps in [11], we can rewrite Equation (47) aswhereIn the above equations, the matrices and are, respectively, diagonal matrices containing nonzero eigenvalues. And we write the projections of and onto the column spaces of and as and .

Next, we try to relate Equation (47) to Rademacher Chaos complexity. Note thatThe first equation can be easily obtained by the discussion of Section 5.2.4 in [11]. The second equation fromThen, we haveThe second equation uses Equation (51), the forth inequality is obtained by using Jensen’s inequality twice, and the last inequality uses the definition of Rademacher chaos complexity and the finite of kernels.

*Proof (Proof of Theorem 1). *For any , it is easy to show thatCombining Lemmas 2, 3, 4, and 5, the result is as follows.

*Proof (Proof of Corollary 1). *By Gershgorin Circle Theorem (Theorem 7.2.1 on page 381 of [18]), the in Equation (21) can be estimated as follows:Then, we can write Equation (22) asSo, the result is as follows.

*Proof (Proof of Corollary 2). *By the assumption in Corollary 2, we haveBy substituting Equations (57) and (58) into Equation (24), the Corollary 2 follows.

#### 6. Discussion

In the above two sections, we give our main results and proof these results. In this section, we will give a comparative discussion of our results to the existing work (in [2] and [11]).

First, we can see that the termin Theorem 1 reflects the compatibility of two viewers on the unlabeled samples.(1)For multiple kernel learning in supervised learning.

If we let and , then we have

By Corollary (1) and Equations (60) and (61), we can get that

Then, the main result in [2] recovers from Corollary (1). Therefore, the main result in [2] becomes the special case of Corollary (1).

* Remark 5. *For the discussion in Section 3, if we set and and by combining Equation (17), then we have that Equation (14) reduces to Equation (13). Furthermore, let and , and we haveSubstituting inequality (63) into inequality (62), we can obtain the generalization bound for the single kernel learning in the framework of supervised learning as follows:(2)For coregularized kernel learning in semisupervised learning.If we let , , , and , by equation we haveAnd note that Equation (65) is the same as the supremum evaluation in Section 5.2.2 in [11]. So, the main result in [11] recovers from Theorem 1. Then, we have that the main result in [11] can be regarded as the special case of Theorem 1.

* Remark 6. *For the discussion in Section 3, if we set , , , and , then we have that Equation (14) reduces to Equation (12).

In Figure 1, we show the relations between the main results in this paper and the results in [2] and [11].