Research Article | Open Access

# On the Convergence Rate of Kernel-Based Sequential Greedy Regression

**Academic Editor:**Jean M. Combes

#### Abstract

A kernel-based greedy algorithm is presented to realize efficient sparse learning with data-dependent basis functions. Upper bound of generalization error is obtained based on complexity measure of hypothesis space with covering numbers. A careful analysis shows the error has a satisfactory decay rate under mild conditions.

#### 1. Introduction

Kernel methods have been extensively utilized in various learning tasks, and its generalization performance has been investigated from the viewpoint of approximation theory [1, 2]. Among these methods, a family of them can be considered as coefficient-based regularized framework in data-dependent hypothesis spaces; see, for example, [3–8]. For given samples , the solution of these kernel methods has the following expression , where and is a Mecer kernel. The aim of these coefficient-based algorithms is to search a set of coefficients with good predictive performance.

Inspired by greedy approximation methods in [9–12], we propose a sparse greedy algorithm for regression. The greedy approximation has two advantages over the regularization methods: one is that the sparsity is directly controlled by a greedy approximation algorithm, rather than by the regularization parameter; the other is that greedy approximation does not change the objective optimization function, while the regularized methods usually modify the objective function by including a sparse regularization term [13].

Before introducing the greedy algorithm, we recall some preliminary background for regression. Let the input space be a compact subset and for some constant . In the regression model, the learner gets a sample set , where , are randomly independently drawn from an unknown distribution on . The goal of learning is to pick a function with the expected error as small as possible. Note that the regression function is the minimizer of , where is the conditional probability measure at induced by .

The empirical error is defined as

We call a symmetric and positive semidefinite continuous function a Mercer kernel. The reproducing kernel Hilbert space (RKHS) is defined to be the closure of the linear span of the set of functions with the inner product defined by . For all and , the reproducing property is given by . We can see because of the continuity of and the compactness of .

Different from the coefficient-based regularized method [3–6], we use the idea of sequential greedy approximation to realize sparse learning in this paper. Denote , where and . The hypothesis space (depending on ) is defined as For any hypothesis function space , we denote .

The definition of tells us , so it is natural to restrict the approximating functions to . The projection operator has been used in error analysis of learning algorithms (see, e.g., [2, 14]).

*Definition 1.1. *The projection operator is defined on the space of measurable functions as

The kernel-based greedy algorithm can be summarized as below. Let be a stopping time and let be a positive constant. Set . And then for , define Different from the regularized algorithms in [6, 12, 14–18], the above learning algorithm tries to realize efficient learning by greedy approximation. The study for its generalization performance can enrich the learning theory of kernel-based regression. In the remainder of this paper, we focus on establishing the convergence rate of to the regression function under choice of suitable parameters. The theoretical result is dependent on weaker conditions than the previous error analysis for kernel-based regularization framework in [4, 5].

#### 2. Main Result

Define a data-free basis function set

To investigate the approximation of to , we introduce a data-independent function

Observe that Here, the three terms on the right-hand side are called as the sample error, the hypothesis error, and the approximation error, respectively.

To estimate the sample error, we usually need the complexity measure of hypothesis function space . For this reason, we introduce some definitions of covering numbers to measure the complexity.

*Definition 2.1. *Let be a pseudometric space and denote a subset . For every , the covering number of with respect to is defined as the minimal number of balls of radius whose union covers , that is,
where is a ball in .

The empirical covering number with metric is defined as below.

*Definition 2.2. *Let be a set of functions on , and . Set . The empirical covering number of is defined by
where metric

Denote as the ball of radius with , where . We need the following capacity assumption on , which has been used in [5, 6, 18].

*Assumption 2.3. *There exist an exponent , with and a constant such that

We now formulate the generalization error bounds for . The result follows from Propositions 3.2–3.5 in the next section.

Theorem 2.4. *Under Assumption 2.3, for any , the following inequality holds with confidence *

From the result, we know there exists a constant independent of such that with confidence In particular, if for some fixed constant and , we have with decay rate . The learning rate is satisfactory as .

Here, the estimate of the hypothesis error is simple and does not need the strict condition on and in [3–5] for learning with data-dependent hypothesis spaces.

If there are some additional conditions on approximation error with the increasing of , we can obtain the explicit learning rates with suitable parameter selection.

Corollary 2.5. *Assume that the RKHS satisfies (2.7) and for some . Choose . For any and , one has
**
with confidence . Here is a constant independent of . *

Observe that the learning rate depends closely on the approximation condition between and . This means that only the target function can be well described by the functions from the hypothesis space, the learning algorithm can achieve good generalization performance. In fact, similar approximation assumption is extensively studied for error analysis in learning theory; see, for example, [1, 2, 4, 17].

From Corollary 2.5, when the kernel , can be arbitrarily small, one can easily see that the learning rate is quite low. Future research direction may be furthered to improve the estimate by introducing some new analysis techniques.

#### 3. Proof of Theorem 2.4

In this section, we provide the proof of Theorem 2.4 based on the upper bound estimates of sample error and hypothesis error. Denote We can observe that the sample error

Here can be bounded by applying the following one-side Bernstein type probability inequality; see, for example, [1, 2, 14].

Lemma 3.1. *Let be a random variable on a probability space with mean and variance . If for almost all , then for all ,
*

Proposition 3.2. *For any , with confidence , one has
*

*Proof. *Following the definition of , we have , where random variable .

From the definition of , we know and . Then
and . Moreover,

Applying Lemma 3.1 with and , we get
with confidence at least . By setting , we derive the solution
Thus, with confidence , we have
This completes the proof.

To establish the uniform upper bound of , we introduce a concentration inequality established in [18].

Lemma 3.3. *Assume that there are constants and such that and for every . If for some and ,
**
then there exists a constant depending only on such that for any , with probability at least , there holds
**
where
*

Proposition 3.4. *Under Assumption 2.3, for any , one has with confidence at least *

*Proof. *From the definition of , we have . Denote
We can see that and . Since and , we have
For , we have
Then, from Assumption 2.3,

Applying Lemma 3.3 with and , for any and for all ,
holds with confidence . This completes the proof.

Different from the previous studies related with regularized framework [3–5], we introduce the estimate of hypothesis error based on Theorem 4.2 in [11] for sequential greedy approximation.

Proposition 3.5. *For a fixed sample , one has
*

The desired result in Theorem 2.4 can be derived directly by combining Propositions 3.2–3.5.

#### Acknowledgments

This work was supported partially by the National Natural Science Foundation of China under Grant no. 11001092, Humanities and Social Science Projects of the Ministry of Education of China (Program no. 11y3jc630197), and the Fundamental Research Funds for the Central Universities (Programs nos. 2011PY130, and 2011QC022).

#### References

- F. Cucker and S. Smale, “On the mathematical foundations of learning,”
*Bulletin of the American Mathematical Society*, vol. 39, no. 1, pp. 1–49, 2002. View at: Publisher Site | Google Scholar | Zentralblatt MATH - F. Cucker and D. X. Zhou,
*Learning Theory: An Approximation Theory Viewpoint*, Cambridge University Press, Cambridge, Mass, USA, 2007. View at: Publisher Site - Q. Wu and D. X. Zhou, “Learning with sample dependent hypothesis spaces,”
*Computers & Mathematics with Applications*, vol. 56, no. 11, pp. 2896–2907, 2008. View at: Publisher Site | Google Scholar | Zentralblatt MATH - Q. W. Xiao and D. X. Zhou, “Learning by nonsymmetric kernels with data dependent spaces and ${\ell}^{1}$-regularizer,”
*Taiwanese Journal of Mathematics*, vol. 14, no. 5, pp. 1821–1836, 2010. View at: Google Scholar - L. Shi, Y. L. Feng, and D. X. Zhou, “Concentration estimates for learning with ${\ell}^{1}$-regularizer and data dependent hypothesis spaces,”
*Applied and Computational Harmonic Analysis*, vol. 31, no. 2, pp. 286–302, 2011. View at: Publisher Site | Google Scholar - Y. L. Feng and S. G. Lv, “Unified approach to coefficient-based regularized regression,”
*Computers & Mathematics with Applications*, vol. 62, no. 1, pp. 506–515, 2011. View at: Publisher Site | Google Scholar | Zentralblatt MATH - S. G. Lv and J. D. Zhu, “Error bounds for ${l}^{p}$-norm multiple kernel learning with least square loss,”
*Abstract and Applied Analysis*, vol. 2012, Article ID 915920, 18 pages, 2012. View at: Publisher Site | Google Scholar - Y. K. Zhu and H. W. Sun, “Consistency analysis of spectral regularization algorithms,”
*Abstract and Applied Analysis*, vol. 2012, Article ID 436510, 16 pages, 2012. View at: Google Scholar | Zentralblatt MATH - A. R. Barron, A. Cohen, W. Dahmen, and R. A. DeVore, “Approximation and learning by greedy algorithms,”
*The Annals of Statistics*, vol. 36, no. 1, pp. 64–94, 2008. View at: Publisher Site | Google Scholar | Zentralblatt MATH - S. Mannor, R. Meir, and T. Zhang, “Greedy algorithms for classification—consistency, convergence rates, and adaptivity,”
*Journal of Machine Learning Research*, vol. 4, no. 4, pp. 713–741, 2003. View at: Google Scholar | Zentralblatt MATH - T. Zhang, “Sequential greedy approximation for certain convex optimization problems,”
*IEEE Transactions on Information Theory*, vol. 49, no. 3, pp. 682–691, 2003. View at: Publisher Site | Google Scholar | Zentralblatt MATH - H. Chen, L. Q. Li, and Z. B. Pan, “Learning rates of multi-kernel regression by orthogonal greedy algorithm,”
*Journal of Statistical Planning and Inference*, vol. 143, no. 2, pp. 276–282, 2013. View at: Publisher Site | Google Scholar - T. Zhang, “Approximation bounds for some sparse kernel regression algorithms,”
*Neural Computation*, vol. 14, no. 12, pp. 3013–3042, 2002. View at: Publisher Site | Google Scholar - D. R. Chen, Q. Wu, Y. Ying, and D. X. Zhou, “Support vector machine soft margin classifiers: error analysis,”
*Journal of Machine Learning Research*, vol. 5, pp. 1143–1175, 2004. View at: Google Scholar - H. Chen, L. Q. Li, and J. T. Peng, “Error bounds of multi-graph regularized semi-supervised classification,”
*Information Sciences*, vol. 179, no. 12, pp. 1960–1969, 2009. View at: Publisher Site | Google Scholar | Zentralblatt MATH - H. Chen, “On the convergence rate of a regularized ranking algorithm,”
*Journal of Approximation*, vol. 164, no. 12, pp. 1513–1519, 2012. View at: Publisher Site | Google Scholar - Z. C. Guo and D. X. Zhou, “Concentration estimates for learning with unbounded sampling,”
*Advances in Computational Mathematics*. In press. View at: Publisher Site | Google Scholar - Q. Wu, Y. Ying, and D. X. Zhou, “Multi-kernel regularized classifiers,”
*Journal of Complexity*, vol. 23, no. 1, pp. 108–134, 2007. View at: Publisher Site | Google Scholar | Zentralblatt MATH

#### Copyright

Copyright © 2012 Xiaoyin Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.