Abstract

We investigate the consistency of spectral regularization algorithms. We generalize the usual definition of regularization function to enrich the content of spectral regularization algorithms. Under a more general prior condition, using refined error decompositions and techniques of operator norm estimation, satisfactory error bounds and learning rates are proved.

1. Introduction

In this paper, we study the consistency analysis of spectral regularization algorithms in regression learning.

Let be a compact metric space and a probability distribution on with . The regression learning aims at estimating or approximating the regression function through a set of samples drawn independently and identically according to from .

In learning theory, a reproducing kernel Hilbert space (RKHS) associated with a Mercer kernel is usually taken as the hypothesis space. Recall that a function is called a Mercer kernel if it is continuous, symmetric, and positive semidefinite. The reproducing kernel Hilbert space is defined to be the closure of the linear span of . The reproducing property takes the form For the Mercer kernel , we denote that

Our first contribution is to generalize the definition of regularization in [1] such that many more learning algorithms can be included in the scope of spectral algorithms.

Definition 1.1. We say that a family of continuous functions is regularization, if the following conditions hold.(i)There exists a constant such that (ii)There exists a constant such that (iii)There exists a constant such that (iv)The qualification of the regularization is the maximal such that where does not depend on .
Our definition for regularization is different from that in [1]. In fact, the definition given by [1] is the special case when taking in (1.5) and (1.7). So from this viewpoint, our assumption is more mild and it is fit for more general situations, for example, coefficient regularization algorithms correspond to spectral algorithms with , the relation between coefficient regularization algorithms and spectral algorithms had been explored in [2].
Let and . The sample operator is defined as . The adjoint of under times the Euclidean norm is . For simplicity, we use to stand for .
The spectral regularization algorithm considered here is given by
The regularization in (1.8) was proposed originally to solve ill-posed inverse problems. The relation between learning theory and regularization of linear ill-posed problems has been well discussed in a series of articles, see [1, 3] and the references therein. The analysis made in previous literatures provides us with a deep understanding of the connection between learning theory and regularization.
A large class of learning algorithms can be considered as spectral regularization algorithms in accordance with different regularizations.

Example 1.2. The regularized least square algorithm is given as It has been well understood due to a lot of literatures [411], and so forth. It is proved in [7] that which corresponds to algorithm (1.8) with the regularization In this case, we have , the qualification .

Example 1.3. In regression learning, the coefficient regularization with norm becomes where
The coefficient regularization was first introduced by Vapnik [12] to design linear programming support vector machines. The consistency of this algorithm has been studied in literatures [2, 13, 14]. In [2], it is proved that the sample error has decay, even for nonpositive semidefinite kernels, and Thus, it corresponds to algorithm (1.8) with the regularization In this case, we have , the qualification and .

Example 1.4. Landweber iteration is defined by where . This corresponds to the gradient descent algorithm in Yao et al. [15] with constant step-size. In this case, we have that any can be considered as qualification of this method and if and otherwise.
Let be the projection of onto , here denotes the closure of in . The generalization error of is where is the marginal distribution of on , is the variance of random variable . So the goodness of the approximation is measured by , where we take the norm defined as
The integral operator associated with kernel from to is defined by is a nonnegative self-adjoint compact operator [4]. If the domain of is restricted to , it also is a nonnegative self-adjoint compact operator from to , with norm [16]. In the sequel, we simply write instead of and assume that almost surely.
As usual, we use the following error decomposition: where
The first term on the right-hand side of (1.19) is called sample error, and the second one is approximation error. Sample error depends on the sampling, and the law of large numbers would lead to its estimation; approximation error is independent of the sampling, and its estimation is mainly through the method of operator approximation.
In order to deduce the error bounds and learning rates, we have to set restriction on the class of possible probability measures that is usually called prior condition. In previous literatures, prior conditions are usually described through the smoothness of regression function . We suppose the following prior condition: Here, called the index function is some continuous nondecreasing function defined on with .
In the sequel, we request the qualification , and there exists covering , which means that there is such that It is easy to see that, for any , covers .
Furthermore, we request that is operator monotone on , that is, there is a constant , such that for any pair , of nonnegative self-adjoint operators on some Hilbert space with norm less than , it holds and, there is such that It is proved that for is operator monotone [8].
In [1], Bauer et al. consider the following prior condition: This condition is somewhat restrictive, since it asks that must belong to .
Our result shows that satisfactory error bound is available with a more general prior condition, this is our second main contribution. So from this view point, our work is meaningful. The main result of this paper is the following theorem.

Theorem 1.5. Suppose the index function with covering number is operator monotone on . The qualification satisfies and that for . Then, with confidence , there holds where and is a constant independent of .

This theorem shows the consistency of the spectral algorithms, gives the error bound, and also can lead to satisfactory learning rates by the explicit expression of .

This paper is prepared as follows. In Section 2, we will prove a basic lemma about estimation of operator norms related to the regularization and two concentration inequalities with vector value random variables. In Section 3, we give the proof of Theorem 1.5. In Section 4, we derive learning rate under the setting of several specific regularization.

2. Some Lemmas

We simply write instead of in (1.7) for qualification . To estimate the error , we need the following lemma to bound the norms of some operators.

Lemma 2.1. Let be an index function and . Then, the following inequalities hold true: Here, are constants only dependent on .

Proof. By (1.6) and (1.7), for any , we have
Since and is covered by , by (2.1) and (1.6), we get
In order to prove the third inequality, let and , by (2.2), we have Thus, where is a constant only dependent on .
If , we have Similarly computation shows that, for , Thus, the last inequality holds, and we complete the proof.

By taking in (2.1), we have

The estimates of operator norm mainly adopt the following classical argument in operator theory. Argument: let be a positive operator in a Hilbert space, for , then is self-adjoint by [17, Proposition  4.4.7] and by [17, Theorem  4.4.8] where is the spectral set of . Consequently, .

The following probability inequality concerning random variables with values in a Hilbert space is proved in [18].

Lemma 2.2. Let be a Hilbert space and a random variable on with values in . Assume almost surely. Denote . Let be independent random drawers of . For any , with confidence , there holds

Let be the class of all the Hilbert Schmidt operators on . It forms a Hilbert space with inner product where is an orthonormal basis of and this definition does not depend on the choice of the basis. The integral operator , as an operator on , belongs to and (see [9]). By Lemma 2.2, we can estimate the following operator norms.

Lemma 2.3. Let be a sample set i.i.d drawn from . With confidence , we have

Proof. Observe that . Denote . Here is the random variable on given by .
Consider For and , the reproducing property insures that Hence, , and thereby According to (2.15), there holds . Inequality (2.14) then follows from (2.12) and the fact that .

Lemma 2.4. Under the assumption of Lemma 2.1. Let be a sample set i.i.d drawn from . With confidence , we have

Proof. Define , so is a random variable from to . Combing the reproducing property with Cauchy-Schwartz inequality, we get Since is an isometric isomorphism from onto (see [16]), we obtain where the last inequality follows from (2.4).
By almost surely, there holds By (2.3) and , we get where, in the last step, we used the result of Proposition 3.1 in Section 3. For simplicity, we write for . Applying Lemma 2.2, there holds Then, we can use the following inequality to get the desired error bound, This completes the proof of Lemma 2.4.

3. Error Analysis

Proposition 3.1. Let be an index function with covering and , so under the assumptions of (1.21), there holds .

Proof. From the definition of and , we have So the following error estimation holds where the last inequality follows from (2.2).

Let us focus on the estimation of sample error.

Consider The idea is to separately bound each term in . We start dealing with the first term of (3.3).

Consider According to (1.4) and (1.5), we derive the following bound: Now, we are in the position to bound (3.4).

Suppose that , then By Lemmas 2.3 and 2.4, with confidence , the following inequalities hold simultaneously:

Combing (1.6), (3.5) together with the operator monotonicity property of and , we obtain By Lemma 2.1 and (3.5), For the purpose of bounding , we rewritten as the following form: In the same way, we have that Thus, we can get the bound for by combining (3.8), (3.9), and (3.11). What left is to estimate and , we can employ the same way used in the estimation of .

Consider Lastly, combining (3.8) to (3.13) with Proposition 3.1, we have Theorem 1.5 holds.

4. Learning Rates

Significance of this paper lies in two facts; firstly, we generalize the definition of regularization and enrich the content of spectral regularization algorithms; secondly, analysis of this paper is able to undertake on the very general prior condition (1.21). Thus, our results can be applied to many different kinds of regularization, such as regularized least square learning, coefficient regularization learning, and (accelerate) landweber iteration and spectral cutoff. In this section, we will choose a suitable index function and apply Theorem 1.5 to some specific algorithms mentioned in Section 1.

4.1. Least Square Regularization

In this case, the regularization is with . The qualification of this algorithm is . Suppose with , that means , . Thus, we have that covering .

Using the result of Theorem 1.5, we obtain the following corollary.

Corollary 4.1. Under the assumptions of Theorem 1.5, we have the following.(i)For , with confidence , there holds By taking , we have the following learning rate: (ii)For , with confidence , there holds By taking , we have the following learning rate:

4.2. Coefficient Regularization with the Norm

In this case, the regularization is with . The qualification is . We also consider the index function with and .

Corollary 4.2. Under the assumptions of Theorem 1.5, we have the following.(i)For , with confidence , there holds By taking , we have the following learning rate: (ii)For , with confidence , there holds By taking , we have the following learning rate:

For coefficient regularization, the learning rates derived by Theorem 1.5 are almost the same, see Corollary  5.2 in [2]. For least square regularization, the learning rates in Corollary 4.1 are weak, the analysis in [8] by integral operator method gives learning rate for ; leave one out analysis in [11] gives the rate .

Our analysis is influenced by both the prior condition and the regularization. Under the weaker prior condition (1.21), some techniques for error analysis in [1] are inapplicable; we take more complicated error decomposition and refined analysis to estimate error bounds and learning rates.

Acknowledgment

This work is supported by the Natural Science Foundation of China (Grant no. 11071276).