- About this Journal ·
- Abstracting and Indexing ·
- Aims and Scope ·
- Annual Issues ·
- Article Processing Charges ·
- Articles in Press ·
- Author Guidelines ·
- Bibliographic Information ·
- Citations to this Journal ·
- Contact Information ·
- Editorial Board ·
- Editorial Workflow ·
- Free eTOC Alerts ·
- Publication Ethics ·
- Reviewers Acknowledgment ·
- Submit a Manuscript ·
- Subscription Information ·
- Table of Contents
Journal of Applied Mathematics
Volume 2013 (2013), Article ID 496282, 11 pages
Learning Rates for -Regularized Kernel Classifiers
1School of Statistics, University of International Business and Economics, Beijing 100029, China
2Department of Mathematics and LMIB, Beijing University of Aeronautics and Astronautics, Beijing 100083, China
3School of Applied Mathematics, Central University of Finance and Economics, Beijing 100081, China
Received 25 July 2013; Accepted 6 October 2013
Academic Editor: Huijun Gao
Copyright © 2013 Hongzhi Tong et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
We consider a family of classification algorithms generated from a regularization kernel scheme associated with -regularizer and convex loss function. Our main purpose is to provide an explicit convergence rate for the excess misclassification error of the produced classifiers. The error decomposition includes approximation error, hypothesis error, and sample error. We apply some novel techniques to estimate the hypothesis error and sample error. Learning rates are eventually derived under some assumptions on the kernel, the input space, the marginal distribution, and the approximation error.
Let be a compact subset of , . Classification algorithms produce binary classifiers , such a classifier labels a class for each point . The prediction power of the classifier is measured by its misclassification error. If is a probability distribution on , then the misclassification error of is defined by Here is the marginal distribution on and is the conditional probability measure at induced by . The classifier minimizing the misclassification error is called the Bayes rule and is given by
The classifiers considered in this paper have the form , defined as , if , and , if , induced by real-valued functions . Those functions are generated from a regularization scheme associated with convex loss function (see ).
Definition 1. A continuous function : is called a classifying loss (function) if it is convex, differentiable at with , and is the smallest real for which the value of is zero.
Typical examples of classifying loss include(1) hinge loss for the classical support vector machines (SVM) classifier; see [2–5];(2) least square loss ; see for example [6, 7];(3)-norm SVM loss ; see [8, 9].
The following concept describes the increment of .
Definition 2. One says that has a increment exponent , if there exists some such that where denotes the right and left derivatives of .
It is easy to see that , and satisfy Definition 2 with increment exponent 1, 2, and , respectively.
Let be a set of samples independently drawn according to ; we call the empirical error with respect to . Regularized learning schemes are implemented by minimizing a penalized version of empirical error over a set of functions, called a hypothesis space.
Definition 3. Given a classifying loss and a hypothesis space , is a penalty functional called regularizer that reflects the constrains imposed on functions from . The regularized classifier is then defined as , where is a minimizer of the following regularization scheme: Here is a regularization parameter which may depend on the sample size with .
Choosing different hypothesis spaces and regularizers in (5) will lead to different regularization algorithms. These learning algorithms are often based on a kernel function (see, e.g., ). One way appears naturally when is a Mercer kernel. Such a kernel is continuous, symmetric, and positive semidefinite on . The reproducing kernel Hilbert spaces (RKHS) associated with the Mercer kernel is defined  to be the completion of the linear span of functions with the inner product: and the reproducing property is given by By setting , , (5) becomes the classical regularized classification scheme: Its mathematical analysis has been well understood with various techniques in extensive literature (see, e.g., [4, 5, 8, 12–15]). In this paper we will consider a different regularization scheme in RKHS for classification; in our setting, the regularizer is -norm of the coefficients in the kernel expansions over the sample points.
Definition 4. Let Then the -regularized classification scheme is given as
Algorithm (10) can be efficiently computed because it reduces to solve a convex optimization problem in a finite dimensional space , containing linear combinations of kernels centered on the training points.
In the last ten years, learning with -regularization has attracted much attention. The increasing interest is mainly brought by the progress of the Lasso algorithm [16–18] and compressive sensing [19, 20], in which -regularizer is able to yield sparse representation of the resulting minimizer. Kernel methods formulate learning and estimation problems in RKHS of functions expanded in terms of kernels. There have been a series of papers to investigate the learning ability of coefficient-based regularization kernel regression methods (see, e.g., [21–25]). However, as we know, there are currently a few results on classification based on kernel designing. For example,  studies classification problem with hinge loss and complexity regularization in a finite-dimensional hypothesis space spanned by a set of base functions. While it does not assume a kernel setting nor is it assumed that the expansion must be in terms of the sample points, so the problem of data-dependent hypothesis space is not present there. Although  provided an error analysis for linear programming SVM classifiers by means of a stepping-stone from quadratic programming SVM to linear programming SVM, no evidence shows that this method can still work for other classifying losses.
In this paper we will present an elaborate error analysis for algorithm (10), and we use a modified error decomposition technique that was firstly introduced in , by dealing with the approximation error, the hypothesis error, and the sample error, and we derive an explicit learning rate for classification scheme (10) under some assumptions.
For a classifying loss , we define the generalization error of as Let be a measurable function minimizing the generalization error: where the minimum is taken over all measurable functions. According to Theorem 3(c) in , we may always choose an satisfying for each . This choice will be taken throughout the paper.
Estimating the excess misclassification error for classification scheme (10) is our main purpose. The following comparison theorem (see [7, 8, 12]) describes the relationship between excess misclassification error and excess generalization error.
Proposition 5. If is a classifying loss, then, for any measurable function , where is some constant dependent on .
Since , we can improve the error estimates by replacing values of by projections onto . The idea of the following projection operator was firstly introduced for this purpose in .
Definition 6. The projection operator is defined on the space of measurable functions as
The definition of classifying loss implies that , so It is trivial that . By Proposition 5,
So it is sufficient for us to bound (13) by means of , which in turn can be estimated by an error decomposition technique. However, there are essential differences between algorithm (8) and (10). For example, the hypothesis space and the regularizer in (10) are dependent on samples . This causes that the standard error analysis methods of (8) (see, e.g., [8, 12, 13, 30]) cannot be applied to (10) any more. This difficulty was overcome in  by introducing a modified error decomposition with an extra hypothesis error term. In this paper we apply the same underlying idea to classification scheme (10). To this end, we need to consider a Banach space containing all of the possible hypothesis space .
Definition 7. The Banach space is defined as the function set on containing all functions of the form with the norm
Obviously, By the continuity of and compactness of , we have It implies that is a subset of the continuous function space , and To formulate the error decomposition for scheme (10), we introduce a regularization function as
Proposition 8. Let be defined in (10), ; then Here
and are called hypothesis error and sample error, and they will be estimated, respectively, in next two sections. is independent of samples and usually called approximation error, and it characterizes the approximation ability of the function space with respect to target function . We will assume that, for some constants and ,
3. Estimating the Hypothesis Error
In this section we bound the hypothesis error by a technique of scattered data interpolation which was firstly used in kernel regression context in . To this end, we need some assumptions on input space , margin distribution , and kernel . Denote the Euclidean norm in .
Definition 9. A subset of is said to satisfy an interior cone condition if there exist an angle , a radius , and a unit vector for every such that the cone is contained in .
Definition 10. The margin distribution is said to satisfy condition with if for some
Recall that, for , the space consists of functions whose partial derivative is continuous for every with , and . Throughout the paper we assume the kernel with .
Definition 11. a set is said to be -dense if for any there exists some such that .
The following lemma derived from  describes a local polynomial reproduction and it is the key point to bound the hypothesis error.
Lemma 12. Suppose that is compact and satisfies an interior cone condition with some radius and angle . Fix with . Assume that the point set is -dense with for some constant depending on and ; then, for any , there exist real number , , satisfying that(1) is any polynomial of degree at most on ,(2),(3) for those satisfying .
Proof. We know from (26) that . So for any , can be written as with and At the same time, there exists some such that , and thus Fix and , and we can take as the Taylor polynomial of of degree at . Then by Lemma 12, there exists such that where . Moreover, It follows from (34) that The above bound holds for every and , so This together with (33) and (32) implies Denote ; we get from (16), (10), and (32) that Since is convex and satisfies (3), we have, for any , This in connect with (22), (32), and (38) yields Therefore Let , and we then derive
We can now bound by the following theorem.
Theorem 14. Let be a classifying loss satisfying (3), for some . Suppose that satisfies the conditions in Lemma 12 and satisfies condition with some , and (28) is valid; then, for any and satisfying with confidence , where are some constants independent of , , or .
Proof. Applying Lemma 3 in , we get that, with confidence , the point set is -dense in , where is a constant depending on , , and . Taking , . If , then we have . So Proposition 13 ensures us that, with confidence , This proves the theorem by setting .
4. Estimating the Sample Error
In this section we focus on the sample error, it is the major improvement we make in this paper for the error analysis of algorithm (10).
Definition 15. Let be a class of functions on and . The -metric is defined on by For every , the covering number of with respect to is
The function sets in our situation are balls of in the form of . We need the -empirical covering number of defined as According to a bound for -empirical covering number derived in , we know that if , then where is a constant independent of , and is a power index defined dy
For a measurable function , denote . The following definition is a variance-expectation condition for the pair , which is generally used to achieve tight bounds.
Definition 16. A variance power of the pair is a number in such that for any , there exists some constant satisfying
We are in a position to bound the sample error. Write as We will first bound , and to this end we need the following one-side Bernstein inequality (see ).
Let be a random variable on a probability space with mean and variance . If almost everywhere, then for all
Proof. Denote , . Then
By (22) and (26), we can see that
We may assume , since otherwise . Then from (3), we can derive that
Applying the one-side Bernstein inequality to we have, for any with confidence ,
On the other hand, both and are contained in , and we know from (3) and (52) that Applying the one-side Bernstein inequality again, we have, with confidence , where in the second inequality we have used the elementary inequality Since , combining the estimates above, we can get that, under assumption (28) with confidence , Then we prove the proposition by setting , and
It is more difficult to bound , because it involves the samples and thus runs over a set of functions. To get a better error estimation, An iteration technique is often used to shrink the radius of the ball containing (see, e.g., [5, 12, 30, 32]); however this process is rather tough and complicated. In this paper We succeed to avoid the prolix iteration by considering the following reweighted empirical process Here for a threshold and . Different from the classical weight function, contains the regularization term and thus makes it possible to control the variances and by the threshold simultaneously.
The following concentration inequality is a scaled version of Theorem 2.3 in , where the case is given.
Lemma 19. Assume that are identically distributed according to . Let be a countable set of measurable functions from to and assume that all functions in satisfy , for some positive real number . Denote Then, for all , one has
This lemma allows us to take care of the deviation of the supremum of a empirical process with respect to its expectation.
Proof. Before presenting the proof, let us first introduce some additional notations: Then By (3) and (52), we can see that Here in the second inequality of (72), we have used the elementary inequality (62) again with , and . So applying Lemma 19 to , we get with confidence
So we can bound through bounding its expectation. To this end, we need some preparations.
Definition 21. Let be a probability space, and is a class of measurable functions from to . Set to be independent random variables distributed according to and to be independent Rademacher random variables. Then for the local Rademacher average of is defined by
The following lemma was given in .
Lemma 22. Let be a class of measurable functions from to . If for some and then there exists a constant depending only on such that
Lemma 23. Let be a nonnegative stochastic process indexed by and a nonnegative nonrandom function defined on . Define . Let be a function such that for some , and Then, for any , we have
Proof. For , we obtain by a standard peeling approach Therefore,
Now we can give a bound of .
Proof. Using the notations in the proof of Proposition 20, the weight function Hence , and Denote . Standard symmetrization argument (see Lemma 2.3.1 of ) and Assumption (52) then yield By (3) we know , and for any , so This together with (50) implies that By Lemma 22, we obtain Setting , it is easy to see . So Lemma 23 tells us By the choice of , we can easily check that Then the conclusion follows from Proposition 20.
Proof. Taking in Proposition 24, then, for any with confidence , It follows that Then the corollary is proved by setting , and
5. Deriving Learning Rates
We may now present the main results by combining the results obtained in the previous two sections. The following theorem gives the bounds for the excess generalization error.
Theorem 26. Let be a classifying loss satisfying (3), for some , , and given by (51). Suppose that satisfies an interior cone conditions, and satisfies condition with some . If (28) is valid, then for any and satisfying (44), by taking with , we have, with confidence , where is a constant independent of or .
Proof. Putting the estimates of Theorem 14, Proposition 18, and Corollary 25 into the error decomposition (24), we see that, with confidence , Therefore, with the same confidence By the choice of , we can easily check that So our theorem follows by taking .
Corollary 27. If the conditions in Theorem 26 are satisfied, then for any , with confidence , there holds where .
Remark 28. For the hinge loss , the increment exponent . If , then one can take and . This is the case for polynomial kernel (see [13, 14]) or Gaussian kernel (see [5, 15]), usually used in practice. So Corollary 27 tells us the learning rate of the 1-norm SVM is with arbitrarily close to .
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was supported by National Natural Science Foundation of China under Grants 11171014 and 11072274, the Program for Innovative Research Team in UIBE.
- P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, 2006.
- V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, NY, USA, 1998.
- N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK, 2000.
- Q. Wu and D. Zhou, “Analysis of support vector machine classification,” Journal of Computational Analysis and Applications, vol. 8, no. 2, pp. 99–119, 2006.
- I. Steinwart and C. Scovel, “Fast rates for support vector machines using Gaussian kernels,” Annals of Statistics, vol. 35, no. 2, pp. 575–607, 2007.