#### Abstract

The problem of learning the kernel function with linear combinations of multiple kernels has attracted considerable attention recently in machine learning. Specially, by imposing an -norm penalty on the kernel combination coefficient, multiple kernel learning (MKL) was proved useful and effective for theoretical analysis and practical applications (Kloft et al., 2009, 2011). In this paper, we present a theoretical analysis on the approximation error and learning ability of the -norm MKL. Our analysis shows explicit learning rates for -norm MKL and demonstrates some notable advantages compared with traditional kernel-based learning algorithms where the kernel is fixed.

#### 1. Introduction

##### 1.1. Overview of Multiple Kernel Learning

Kernel methods such as Support Vector Machines (SVMs) have been extensively applied to supervised learning tasks such as classification and regression. The performance of a kernel machine largely depends on the data representation via the choice of kernel function. Hence, one central issue in kernel methods is the problem of kernel selection; a great many approaches to selecting the right kernel have been studied in the literature [1–4] and other references therein.

We begin with reviewing the classical supervised learning setup. Let be a compact metric space and , given a labeled sample , sampled . according to an unknown distribution supported on , the goal is to estimate a real-valued function depending on the sample, that generalizes well on new and unseen data. A widely used approach to estimate a function from empirical data consists in minimizing a regularization functional in a Hilbert space of real-valued functions: . Typically, a regularization scheme estimates as a minimizer of the functional where is the empirical risk of hypothesis , measured by a nonnegative loss function . In addition, is a regularizer and is a trade-off regularization parameter.

In this paper, we assume that is a reproducing kernel Hilbert space (RKHS) with kernel , see [5]. Every kernel corresponds to a feature mapping satisfying , and each element of has the following form:

By restricting the regularization to be the form , there is a lot of studies from different perspectives such as statistics, optimal recovery and machine learning [6–9], and other references therein. Regularization in an RKHS has a number of attractive features, including the availability of effective error bounds and stability analysis relative to perturbations of the data (see Cucker and Smale [7]; Wu et al. [10]; Bousquet and Elisseeff [6]). Moreover, the optimization problem (1.1) in an RKHS can be reduced to seek for solution in a finite-dimensional space. Although it is simple to prove, this result shows that the variational problem (1.1) can be computational easily.

Because of their simplicity and generality, kernels and associated RKHS play an increasingly important role in Machine Learning, Pattern Recognition and Artificial Intelligence. When the kernel is fixed, an immediate concern is the choice of the regularization parameter . This is typically solved by means of cross validation or generalized cross validation [11]. However, the performance of kernel methods critically relies on the choice of the kernel function. A natural question is how to choose the optimal kernel in a collection of candidate kernels.

Kernel learning can range from the width parameter selection of Gaussian kernels [9, 12] to obtaining an optimal linear combination from a set of finite candidate kernels. The latter is often referred to as multiple kernel learning in machine learning and nonparametric group Lasso in statistics [13]. Lanckriet et al. [3] pioneered work on MKL and proposed a semidefinite programming approach to automatically learn a linear combination of candidate kernels for the cases of SVMS. To improve computation efficiency, the multikernel class further is restricted to only convex combinations of kernels [2, 14, 15]. Most learning kernel algorithms are based on considering linear kernel mixtures with a prescribed kernels . For notational simplicity, we will frequently use instead of the standard feature . Compared to (1.1), the primal model for learning with multiple kernels is extended to

In this paper, we mainly focus on the -norm MKL, consisting in minimizing the regularized empirical risk with respect to the optimal kernel mixture , in addition to -regularizer on to avoid overfitting. This leads to the following optimization problem: This scheme was introduced in [2] and the existence of its minimum has been discussed in [4].

The optimization problem subsumes state-of-the-art approaches to multiple kernel learning, covering sparse and nonsparse MKL by arbitrary -norm regularization () on the mixing coefficients as well as the incorporation of prior knowledge by allowing for nonisotropic regularizer. Kloft et al. [2] developed two efficient interleaved optimization strategies for the -norm multiple kernel learning, and this interleaved optimization is much faster than the commonly used wrapper approaches, as demonstrated on real-world problems from computational biology. An analysis of this model, based on Rademacher complexities, was first developed by Cortes et al. [1]. Later improved rates of convergence were derived based on the theory of local Rademacher complexities [15]. However, the estimate on local Rademacher complexities with strictly depends on no-correlation assumption of the different features, which is too strong condition in theory and practice. In this paper, we employ the notion of empirical covering number to present a theoretical analysis of its generalization error. Besides no-correlation condition is not necessary, empirical covering number is one tight upper bound of local Rademacher complexities [16], also independent of the underlying distribution. We will see that some satisfying learning rates are established when the regularization parameter is appropriately chosen. The interaction between the sample error and the approximation error plays an important role in our analysis, and our new methodology mainly depends on the complexity of hypothesis class measured by empirical covering number and the regularity of a target function.

It should be pointed out that the Tikhonov Regularization in (1.4) has two regularization parameter (, ), which may be hard to deal with in practice. Fortunately, an alternative approach has been studied by Rakotomamonjy et al. [14] and Kloft et al. [2]. More precisely, this approach employs the regularizer as an additional constraint into the optimization problem. By substituting for , they arrive at the following problem:

##### 1.2. Algorithm and Main Consequence

The following Lemma (see [4]) indicates that the above multikernel class can equivalently be represented as a block-norm regularized linear class in the product Hilbert space .

Lemma 1.1. * If , , and , then
**
and the equality occurs for at
*

Hence, Lemma 1.1 can be applied to define the feature mapping: associated with a kernel ; the class of functions defined above coincides with when , holds from . The -norm is defined here as . For simplicity, we write . Clearly learning the complexity of (1.8) will be greater than one that is based on a single kernel only, further it provides greater learning ability while the computational complexity increases accordingly. The sample complexity of the above hypothesis space has been studied by Cortes et al. [1] and Kloft and Blanchard [15]. Thus the primal MKL optimization problem (1.5) is equivalent to the following regularization scheme, which is the primary object of investigation in this paper Here we use the symbol “min” instead of “inf,” since (1.4) is equivalent to (1.9) and the solution of (1.4) exists and is unique. Remark that the above algorithm is a standard regularized empirical risk minimization; this implies that -norm multiple kernel learning scheme can be free of over-fitting, a phenomenon which occurs when the empirical error is zero but the expected error in far from zero.

In the following, we assume that is uniformly bounded, that is, Also suppose that each is continuous. In other words, each is a Mercer kernel with bound ; we refer to [17] for more properties and discussions on Mercer kernel.

In this paper, we only focus on the least square loss: . Accordingly, the target function is given by where we denote by the conditional distribution of . Through this paper we assume that is supported on , it follows that for almost surely. Since the learner may be much larger than , it is natural to apply a projection operator on , which was introduced into learning algorithms to improve learning rates.

*Definition 1.2. *The projection operator is defined on the space of measurable functions as
where is called the projection level.

The target of error analysis is to understand how approximates the regression function . More precisely, we aim to estimate the *excess generalization error*
for the -norm MKL algorithm (1.4), where denotes the expect error of .

To show some ideas of our error analysis, we first state learning rates of (1.4) in a special case when and is on .

Theorem 1.3. * Let be defined by (1.9). Assume is on and . For any and , with confidence , there holds
**
where is some constant independent of or . *

Theorem 1.3 can be viewed as a corollary of our main result presented in Section 5. It can be arbitrary close to by choosing to be small enough, which is the best convergence rate in learning theory literature.

#### 2. Key Error Analysis

Our main result is about learning rates of (1.4) stated under conditions on the approximation ability of with respect to and capacity of .

The approximation ability of the hypothesis space with respect to in the space is reflected by regularization error.

*Definition 2.1. * The regularization error of the triple is defined as
We will assume that for some and ,

*Remark 2.2. *Our assumption implies that when is replaced by , tends to zero by polynomial order decay as goes to zero. Note [7] that would imply . So in (2.2) is the best we can expect. This case is equivalent to when is dense in , see [18]. Assumption (2.2) with can be characterized in terms of interpolation spaces [7].

If is the Lebesgue measure on and the target function , a Sobolev space with power . When Gaussian kernel is taken with a fixed variance , a polynomial decay of is impossible. However, Example 1 of [19] successfully obtains a polynomial decay under the multikernel setting, allowing for varying variances of Gaussian kernels. This shows that multikernel learning can improve the approximation power and learning ability. More interestingly, we will take a special example to show the impact of the multikernel class on the regularization error in Section 5 below. In particular, a proper multikernel class can be applied to improve the regularization error if the regularity of is rather high.

Next we define the truncated sample error as and the sample error as

The function in the above equation can be arbitrarily chosen; however, only proper choices lead to good estimates of the regularization error. A good choice is where

A useful approach for regularization schemes with sample independent hypothesis spaces such as RKHS is an error decomposition, which decomposes the total error into the sum of the truncated sample error and the regularization error stated as follows.

Proposition 2.3. *Let be defined by (2.5); there holds
*

*Proof. *We can decompose into
To bound the second term, by Definition of , can be bounded by
since holds for any function on . The conclusion follows by combining these two inequalities.

#### 3. Estimation on Sample Error

We are in a position to estimate the sample error . The sample error can be written as

can be easily bounded by applying the following one-side Bernstein-type probability inequality.

Lemma 3.1. * Let be a random variable on a probability space with variance satisfying for some constant . Then for any , we have
*

Proposition 3.2. *Define the random variable . For every , with confidence at least , there holds
*

*Proof. *From the definition of , we see that
Note that for some , by the Cauchy-Schwarz inequality , we have for any :
where . Using Assumption (1.10), it follows that

Observe that
Since that almost surely, we have
Hence . Moreover, equals
which implies that . The desired result follows from Lemma 3.1.

Next we estimate the first term . It is more difficult to deal with because it involves a set of random variables varying with , requiring to consider the functional complexity. For this purpose, we introduce the notion of empirical covering numbers, which often lead to sharp error estimates [16].

*Definition 3.3. *Let be a pseudometric space and . For every , the covering number of with respect to and is defined as the minimal number of balls of radius whose union covers , that is,
where is a ball in .

The -empirical covering number of a function set is defined by means of the normalized -metric on the Euclidian space given by

*Definition 3.4. *Let be a set of function on , and . Set . The -empirical covering number of is defined by

Denote by the ball of radius with , . We need the following capacity assumption on .

*Assumption 3.5. *There exists an exponent , with and a constant such that
where is the unit ball of defined as above.

For any function , by the hölder inequality, we have and it follows from (3.13) where is called the generalized unit ball of associated with , defined by

Note that for any function set , the empirical covering number is bounded by , the (uniform) covering number of under the metric , since . It was shown in [20] that the quantity holds for some if is on a subset of , hence also holds. In particular, is arbitrarily small for a kernel ( such as Gaussian kernel). Now we give a concrete example in to reveal relationship between the regularity of function class and its corresponding empirical covering number.

*Example 3.6. *Let be a bounded domain in and the Sobolev space of index . When , the classical Embedding Theorem tells us that is an RKHS and its unit ball is embedded in a finite ball of the function space with inclusion bounded where . From the classical bounds for covering numbers of the unit ball of , we see that
Hence Assumption (3.13) below holds with .

Our concentration estimate for the sample error dealing with is based on the following concentration inequality, which can be found in [12].

Lemma 3.7. * Let be a class of measurable functions on . Assume that there are constants and such that and for every . If for some and ,
**
then there exists a constant such that for any , with confidence at least , there holds
**
where
*

Denote the set of function with , where

Proposition 3.8. * If satisfies the capacity condition (3.13) with some , then for any , with confidence , there holds
**
with constant . *

*Proof. *Consider the set . Each function can be expressed as = with some . Then and = . Note that
Since and for any , we see that for any ,
On the other hand, for any at point , we have
since the projector operator is a contractive map. It follows that
It follows from the capacity condition (3.15)

Applying Lemma 3.7 with , , and , we see that for any , with confidence , there holds
Besides, following the definition of (1.9), we have
that is, . Hence we can replace with . Thus we derive our desired result.

#### 4. Total Learning Rates

We are now in a position to obtain the learning rates of projected algorithm (1.9). Main results of this paper will be presented in Theorem 4.1.

Following the error decomposition scheme in Proposition 2.3 and combining Propositions 3.2 and 3.8, we derive the following bounds on the total error.

Theorem 4.1. * Suppose that satisfies the capacity condition (3.13) with some , and . For any , with confidence , there holds
**
where and is defined as in Proposition 3.8. *

*Proof. *Following Propositions 3.2 and 3.8, with confidence at least , can be bounded by
Firstly, we set
which implies that . On the other hand, from the assumption , we set
Hence our assertion follows by taking .

*Proof of Theorem 1.3. * When , it follows that condition (3.13) holds for arbitrary small . Moreover, implies that , the conclusion follows easily from Theorem 4.1.

Our learning rates below in Corollary 4.2 will be achieved under the regularity assumption on the regression function that lies in the range of for some . Given any kernel , is the integral operator on defined by
The operator is linear, compact, positive and can be also regarded as a self-adjoint operator on . Hence the fractional power operator is well defined and is given by
where are eigenvalues of the operator arranged in a decreasing order and are the corresponding eigenfunctions, which form an orthonormal basis of . In fact, the image of is contained in if . So indicates that lies in the range of , measuring the regularity of the regression function.

Corollary 4.2. * Suppose that satisfies the capacity condition (3.13) with some , and (). For any , with confidence , there holds
**
where is some constant independent of or . *

*Proof. *Recall a result from [18], if (), there holds
If , this shows that as mentioned above, then we have and
Hence for any , we have

On the other hand, observe Lemma 5.1 below, and we have
where is some constant and . The conclusion follows immediately from Theorem 4.1.

Let us compare our learning rates with the existing results.

In [10], a uniform covering number technique was used to derive the expected value of learning schemes (1.1) where . If all the kernels are the same with some specialized and for some . For any and any , with confidence , then Clearly the learning rates derived from Corollary 4.2 are better than that in [10] since .

In [21], an operator monotonic technique was used to improve the kernel independent error bounds in comparison with the result in [17]. If for some . For any , with confidence , there holds

When and , the learning rate given by Corollary 4.2 is better than the above result.

As for empirical risk minimization (ERM), classical results on analysis of ERM schemes give error bounds between the empirical target function and the regression function. In particular, learning rates of type with arbitrarily close to 1 can be achieved by ERM schemes (see [15]). However, the ERM setting is different from the one on Tikhonov regularization. How to choose the regularization parameter , depending on the sample size , is the essential difficulty for the regularization scheme, even when lies in . On the other hand, it is obvious that our result is more general than that of [15] since the case for () is also covered.

#### 5. Discussion on Regularization Error

By our assumptions on M different kernels , we see that is an RKHS generated by the Mercer kernel . There are several standard approximation results on regularization error in learning theory (see [17]). Next we establish a tight connection between and with .

Lemma 5.1. * Let be a separable over associated with a bounded measurable kernel, be a distribution on , and . If there exist constants and such that , then for all we have
*

*Proof. *If there exists a function satisfying
we see that . We write with some , by the inequality, we have
Then we obtain

In other words, if has a polynomial behavior in , then this behavior completely determines the behavior of all . Thus it suffices to assume that the standard -approximation error function satisfies (2.2).

From statistical effective dimension point of view, we will discuss the impact of the multikernel class on the approximation error . To estimate this error, note that the regularizing function of exists, is unique, and given by [7] For simplicity, let and take a Mercer kernel as the original one, by the classical Mercer theorem, can be expressed as . Another kernel we take is (). In this case, . By the fact that and assumption , we have

Let us compare the multikernel class regularization with Tikhonov regularization in when the Mercer kernel is employed. Denote the saturation index as the maximal so that the approximation error achieves fastest decay rate under the condition . Then (5.6) shows the saturation index for multikernel class regularization is while it is for Tikhonov regularization in , as shown in [17].

In this case, our analysis implies that we should use an alternative kernel with faster eigenvalue decay when the spectral coefficients of the target function decay faster: for example, using instead of . This has a dimension reduction effect. Essentially, we effectively project the data into the principal components of data. The intuition is also quite clear: if the dimension of the target function is small (spectral coefficient decays fast), then we should project data to those dimensions by reducing the remaining noisy dimensions (corresponding to fast kernel eigenvalue decay ). In fact, the similar idea under the framework of semisupervised learning has been shown in spectral kernel design methods [22, 23].

In general, for the sample error, there exist rates of convergence which hold independently of the underlying distribution . This is important, as it tells us that we can give convergence guarantees no matter what a kernel is used, even we do not know the underlying distribution. In fact, this is very common in statistical analysis of various machine learning algorithms (see [24]). This decay is usually fast enough for practical use where amounts of samples are available. For the approximation error, however, it is impossible to give rates of convergence which hold for all probability distributions . Hence what determines the learning accuracy is the approximation error. In kernel regression setup, this is determined by the choice of the kernel and enhances the importance of learning kernels [4] and constructing refined kernels [25].

#### 6. Further Study

In the last section, we exclusively discuss sparsity in the case of the square loss regularization functional in (1.1) with the regularizer in RKHS. We can derive the explicit expression for this functional from [4], in turn which provides improvement and simplification of our algorithm (1.4).

Lemma 6.1. * For any kernel and positive constant , we have that
**
where the vector and denotes the Gram matrix and denotes the standard inner product in Euclidean space. *

According to Lemma 6.1, the least square algorithm of (1.4) can be rewritten as a one-layer minimization problem

We assume that for some . Define . We say that is sparse relative to if the cardinally of is far smaller than .

For (but is close to 1), it would be interesting to show that the solution of (6.2) is approximately sparse following the path of . In some sense, is very small with a very high probability. A refined analysis of -regularized methods was done by Koltchinskii [26] in the case of combination of basis functions, mainly taking into account the soft sparsity pattern of the Bayes function and establishing several oracle inequalities in statistical sense. Extending the ideas into the kernels learning setting would be of a great significance, because it can provide theoretical support showing that the -norm MKL can automatically select good kernels, which coincide with the underlying right kernels.

#### Acknowledgments

The authors would like to thank the two anonymous referees for their valuable comments and suggestions which have substantively improved this paper. The research of S.-G. Lv is supported partially by 211 project for the Southwestern University of Finance and Economics, China [Project no. 211QN2011028] as well as Annual Cultivation of the Fundamental Research Funds for the Central Universities [Project no. JBK120940].