Approximation Analysis of Learning Algorithms for Support Vector Regression and Quantile Regression

Xiang, Dao-Hong; Hu, Ting; Zhou, Ding-Xuan

doi:https://doi.org/10.1155/2012/902139

Journal of Applied Mathematics

On this page

Abstract Introduction References Copyright Related Articles

Research Article | Open Access

Volume 2012 | Article ID 902139 | https://doi.org/10.1155/2012/902139

Approximation Analysis of Learning Algorithms for Support Vector Regression and Quantile Regression

Dao-Hong Xiang,¹Ting Hu,²and Ding-Xuan Zhou³

Academic Editor: Yuesheng Xu

Received14 Jul 2011

Accepted14 Nov 2011

Published08 Feb 2012

Abstract

We study learning algorithms generated by regularization schemes in reproducing kernel Hilbert spaces associated with an -insensitive pinball loss. This loss function is motivated by the -insensitive loss for support vector regression and the pinball loss for quantile regression. Approximation analysis is conducted for these algorithms by means of a variance-expectation bound when a noise condition is satisfied for the underlying probability measure. The rates are explicitly derived under a priori conditions on approximation and capacity of the reproducing kernel Hilbert space. As an application, we get approximation orders for the support vector regression and the quantile regularized regression.

1. Introduction and Motivation

In this paper, we study a family of learning algorithms serving both purposes of support vector regression and quantile regression. Approximation analysis and learning rates will be provided, which also helps better understanding of some classical learning methods.

Support vector regression is a classical kernel-based algorithm in learning theory introduced in [1]. It is a regularization scheme in a reproducing kernel Hilbert space (RKHS) associated with an -insensitive loss defined for by Here, for learning functions on a compact metric space , is a continuous, symmetric, and positive semidefinite function called a Mercer kernel. The associated RKHS is defined [2] as the completion of the linear span of the set of function with the inner product satisfying . Let and be a Borel probability measure on . With a sample independently drawn according to , the support vector regression is defined as where is a regularization parameter.

When is fixed, convergence of (1.2) was analyzed in [3]. Notice from the original motivation [1] for the insensitive parameter for balancing the approximation and sparsity of the algorithm that should change with the sample size and usually as the sample size increases. Mathematical analysis for this original algorithm is still open. We will solve this problem in a special case of our approximation analysis for general learning algorithms. In particular, we show how approximates the median function , which is one of the purposes of this paper. Here, for , the median function value is a median of the conditional distribution of at .

Quantile regression, compared with the least squares regression, provides richer information about response variables such as stretching or compressing tails [4]. It aims at estimating quantile regression functions. With a quantile parameter , a quantile regression function is defined by its value to be a -quantile of , that is, a value satisfying

Quantile regression has been studied by kernel-based regularization schemes in a learning theory literature (e.g., [5–8]). These regularization schemes take the form where is the pinball loss shown in Figure 1 defined by

Motivated by the -insensitive loss and the pinball loss , we propose the -insensitive pinball loss with an insensitive parameter shown in Figure 1 defined as This loss function has been applied to online learning for quantile regression in our previous work [8]. It is applied here to a regularization scheme in the RKHS as The main goal of this paper is to study how the output function given by (1.7) converges to the quantile regression function and how explicit learning rates can be obtained with suitable choices of the parameters based on a priori conditions on the probability measure .

2. Main Results on Approximation

Throughout the paper, we assume that the conditional distribution is supported on for every . Then, we see from (1.3) that we can take values of to be on . So to see how approximates , it is natural to project values of the output function onto the same interval by the projection operator introduced in [9].

Definition 2.1. The projection operator on the space of function on is defined by

Our approximation analysis aims at establishing bounds for the error in the space with some where is the marginal distribution of on .

2.1. Support Vector Regression and Quantile Regression

Our error bounds and learning rates are presented in terms of a noise condition and approximation condition on .

The noise condition on is defined in [5, 6] as follows.

Definition 2.2. Let and . We say that has a -quantile of -average type if for every , there exist a -quantile and constants such that for each , and that the function on taking value at lies in .

Note that condition (2.2) tells us that is uniquely defined at every .

The approximation condition on is stated in terms of the integral operator defined by . Since is positive semidefinite, is a compact positive operator and its -th power is well-defined for any . Our approximation condition is given as

Let us illustrate our approximation analysis by the following special case which will be proved in Section 5.

Theorem 2.3. Let and . Assume and has a -quantile of -average type 2 for some . Take and with . Let . Then, with , for any , with confidence , one has where is a constant independent of or .

If for , we see that the power exponent for the learning rate (2.4) is at least . This exponent can be arbitrarily close to when is small enough.

In particular, if we take , Theorem 2.3 provides rates for output function of the support vector regression (1.2) to approximate the median function .

If we take leading to , Theorem 2.3 provides rates for output function of the quantile regression algorithm (1.4) to approximate the quantile regression function .

2.2. General Approximation Analysis

To state our approximation analysis in the general case, we need the capacity of the hypothesis space measured by covering numbers.

Definition 2.4. For a subset of and , the covering number is the minimal integer such that there exist disks with radius covering .

The covering numbers of balls with radius of the RKHS have been well studied in the learning theory literature [10, 11]. In this paper, we assume for some and that

Now we can state our main result which will be proved in Section 5. For and , we denote

Theorem 2.5. Assume (2.3) with and (2.5) with . Suppose that has a -quantile of -average type for some and . Take with and . Set with . Let Then, with , for any , with confidence , one has where is a constant independent of or and the power index is given in terms of , and by

The index can be viewed as a function of variables . The restriction on and (2.7) on ensure that is positive, which verifies the valid learning rate in Theorem 2.5.

Assumption (2.5) is a measurement of regularity of the kernel when is a subset of . In particular, can be arbitrarily small when is smooth enough. In this case, the power index in (2.8) can be arbitrarily close to . Again, when , , algorithm (1.7) corresponds to algorithm (1.4) for quantile regression. In this case, Theorem 2.5 provides learning rates for quantile regression algorithm (1.4).

Error analysis has been done for quantile regression algorithm (1.4) in [5, 6]. Under the assumptions that satisfies (2.2) with some and and the -empirical covering number (see [5] for more details) of is bounded as it was proved in [5] that with confidence , where and are constants independent of or . Here, is the regularization error defined as and is the generalization error associated with the pinball loss defined by Note that is minimized by the quantile regression function . Thus, when the regularization error decays polynomially as (which is ensured by Lemma 2.6 below when (2.3) is satisfied) and , then with Since , we see that this learning rate is comparable to our result in (2.8).

2.3. Comparison with Least Squares Regression

There has been a large literature in learning theory (described in [12]) for the least squares algorithms: It aims at learning the regression function . A crucial property for its error analysis is the identity for the least squares generalization error . It yields a variance-expectation bound for the random variable on where is an arbitrary measurable function. Such a variance-expectation bound with possibly replaced by its positive power plays an essential role for analyzing regularization schemes and the power exponent depends on strong convexity of the loss. See [13] and references therein. However, the pinball loss in the quantile regression setting has no strong convexity [6] and we would not expect a variance-expectation bound for a general distribution . When has a -quantile of -average type , the following variance-expectation bound with given by (2.6) can be found in [5, 7] (derived by means of Lemma 3.1 below).

Lemma 2.6. If has a -quantile of -average type for some and , then where the power index is given by (2.6) and the constant is .

Lemma 2.6 overcomes the difficulty of quantile regression caused by lack of strong convexity of the pinball loss. It enables us to derive satisfactory learning rates, as in Theorem 2.5.

3. Insensitive Relation and Error Decomposition

An important relation for quantile regression observed in [5] assets that the error taken in a suitable space can be bounded by the excess generalization error when the noise condition is satisfied.

Lemma 3.1. Let and . Denote . If has a -quantile of -average type , then for any measurable function on , one has where .

By Lemma 3.1, to estimate , we only need to bound the excess generalization error . This will be done by conducting an error decomposition which has been developed in the literature for regularization schemes [9, 13–15]. Technical difficulty arises for our problem here because the insensitive parameter changes with . This can be overcome [16] by the following insensitive relation

Now, we can conduct an error decomposition. Define the empirical error for as

Lemma 3.2. Let , be defined by (1.7) and Then, where

Proof. The regularized excess generalization error can be expressed as The fact implies that . The insensitive relation (3.2) and the definition of tell us that Then, by subtracting and adding and and noting , we see that the desired inequality in Lemma 3.2 holds true.

In the error decomposition (3.5), the first two terms are called sample error. The last term is the regularization error defined in (2.12). It can be estimated as follows.

Proposition 3.3. Assume (2.3). Define by (3.4). Then, one has where is the constant .

Proof. Let and It can be found in [17, 18] that when (2.3) holds, we have Hence, and . Since is Lipschitz, we know by taking in (2.12) that This verifies the desired bound for . By taking in (2.12), we have Then the bound for is proved.

4. Estimating Sample Error

This section is devoted to estimating the sample error. This is conducted by using the variance-expectation bound in Lemma 2.6.

Denote . For , denote

Proposition 4.1. Assume (2.3) and (2.5). Let and . If has a -quantile of -average type for some and , then there exists a subset of with measure at most such that for any , where is given by (2.6) and are constants given by

Proof. Let us first estimate the second part of the sample error. It can be decomposed into two parts where
For bounding , we take the random variable on . It satisfies . Hence, and . Applying the one-side Bernstein inequality [12], we know that there exists a subset of with measure at least such that
For , we apply the one-side Bernstein inequality again to the random variable , bound the variance by Lemma 2.6 with , and find that there exists another subset of with measure at least such that
Next, we estimate the first part of the sample error. Consider the function set A function from this set satisfies , , and by (2.16). Also, the Lipschitz property of the pinball loss yields . Then, we apply a standard covering number argument with a ratio inequality [12, 13, 19, 20] to and find from the covering number condition (2.5) that Setting the confidence to be , we take to be the positive solution to the equation Then, there exists a third subset of with measure at least such that Thus, for , we have Here, we have used the elementary inequality and Young’s inequality. Putting this bound and (4.5), (4.6) into (3.5), we know that for , there holds which together with Proposition 3.3 implies Here, we have used the reproducing property in which yields [12] Equation (4.9) can be expressed as By Lemma 7.2 in [12], the positive solution to this equation can be bounded as Thus, for , the desired bound (4.2) holds true. Since the measure of the set is at least , our conclusion is proved.

5. Deriving Convergence Rates by Iteration

To apply Proposition 4.1 for error analysis, we need some for . One may choose according to which is seen by taking in (1.7). This choice is too rough. Recall from Proposition 3.3 that which is a bound for the noise-free limit of . It is much better than . This motivates us to try similar tight bounds for . This target will be achieved in this section by applying Proposition 4.1 iteratively. The iteration technique has been used in [13, 21] to improve learning rates.

Lemma 5.1. Assume (2.3) with and (2.5) with . Take with and with . Let . If has a -quantile of -average type for some and , then for any , with confidence , there holds where is given by

Proof. Putting with and with into Proposition 4.1, we know that for any there exists a subset of with measure at most such that where with , the constants are given by It follows that
Let us apply (5.6) iteratively to a sequence defined by and where will be determined later. Then, . By (5.1), . So we have As the measure of is at most , we know that the measure of is at most . Hence, has measure at least .
Denote . The definition of the sequence tells us that Let us bound the two terms on the right-hand side.
The first term equals which is bounded by Take to be the smallest integer greater than or equal to . The above expression can be bounded by .
The second terms equals where . It is bounded by When , the above expression is bounded by . When , it is bounded by .
Based on the above discussion, we obtain where . So with confidence , there holds Then, our conclusion follows by replacing by and noting .

Now, we can prove our main result, Theorem 2.5.

Proof of Theorem 2.5. Take to be the right side of (5.2). By Lemma 5.1, there exists a subset of with measure at most such that . Applying Proposition 4.1 to this , we know that there exists another subset of with measure at most such that for any , where Since the set has measure at most , after scaling to and setting the constant by we see that the above estimate together with Lemma 3.1 gives the error bound with confidence and the power index give by provided that Since , we know that . By the restriction on , we find and . Moreover, restriction (2.7) on tell us that . Therefore, condition (5.20) is satisfied. The restriction and the above expression for tells us that the power index for the error bound can be exactly expressed by formula (2.9). The proof of Theorem 2.5 is complete.

Finally, we prove Theorem 2.3.

Proof of Theorem 2.3. Since , we know that (2.3) holds with . The noise condition on is satisfied with and . Then, . Since and , we know from [11] that (2.5) holds true for any . With , let us choose to be a positive number satisfying the following four inequalities: The first inequality above tells us that the restrictions on are satisfied by choosing and . The second inequality shows that condition (2.7) for the parameter renamed now as is also satisfied by taking . Thus, we apply Theorem 2.5 and know that with confidence , (2.8) holds with the power index given by (2.9); but , and imply that The last two inequalities satisfied by yield . So (2.8) verifies (2.4). This completes the proof of the theorem.

Acknowledgment

The work described in this paper is supported by National Science Foundation of China under Grant 11001247 and by a grant from the Research Grants Council of Hong Kong (Project no. CityU 103709).

References

V. N. Vapnik, Statistical Learning Theory, Adaptive and Learning Systems for Signal Processing, Communications, and Control, John Wiley & Sons, New York, NY, USA, 1998.
N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American Mathematical Society, vol. 68, pp. 337–404, 1950.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
H. Tong, D.-R. Chen, and L. Peng, “Analysis of support vector machines regression,” Foundations of Computational Mathematics, vol. 9, no. 2, pp. 243–257, 2009.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
R. Koenker, Quantile Regression, vol. 38 of Econometric Society Monographs, Cambridge University Press, Cambridge, UK, 2005.
View at: Publisher Site
I. Steinwart and A. Christman, “How SVMs can estimate quantile and the median,” in Advances in Neural Information Processing Systems, vol. 20, pp. 305–312, MIT Press, Cambridge, Mass, USA, 2008.
View at: Google Scholar
I. Steinwart and A. Christmann, “Estimating conditional quantiles with the help of the pinball loss,” Bernoulli, vol. 17, no. 1, pp. 211–225, 2011.
View at: Publisher Site | Google Scholar
D. H. Xiang, “Conditional quantiles with varying Gaussians,” submitted to Advances in Computational Mathematics.
View at: Google Scholar
T. Hu, D. H. Xiang, and D. X. Zhou, “Online learning for quantile regression and support vector regression,” Preprint.
View at: Google Scholar
D.-R. Chen, Q. Wu, Y. Ying, and D.-X. Zhou, “Support vector machine soft margin classifiers: error analysis,” Journal of Machine Learning Research, vol. 5, pp. 1143–1175, 2004.
View at: Google Scholar
D.-X. Zhou, “The covering number in learning theory,” Journal of Complexity, vol. 18, no. 3, pp. 739–767, 2002.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
D.-X. Zhou, “Capacity of reproducing kernel spaces in learning theory,” IEEE Transactions on Information Theory, vol. 49, no. 7, pp. 1743–1752, 2003.
View at: Publisher Site | Google Scholar
F. Cucker and D.-X. Zhou, Learning Theory: An Approximation Theory Viewpoint, Cambridge Monographs on Applied and Computational Mathematics, Cambridge University Press, Cambridge, UK, 2007.
View at: Publisher Site
Q. Wu, Y. Ying, and D.-X. Zhou, “Learning rates of least-square regularized regression,” Foundations of Computational Mathematics, vol. 6, no. 2, pp. 171–192, 2006.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
Q. Wu and D.-X. Zhou, “Analysis of support vector machine classification,” Journal of Computational Analysis and Applications, vol. 8, no. 2, pp. 99–119, 2006.
View at: Google Scholar | Zentralblatt MATH
T. Hu, “Online regression with varying Gaussians and non-identical distributions,” Analysis and Applications, vol. 9, no. 4, pp. 395–408, 2011.
View at: Publisher Site | Google Scholar
D.-H. Xiang, T. Hu, and D.-X. Zhou, “Learning with varying insensitive loss,” Applied Mathematics Letters, vol. 24, no. 12, pp. 2107–2109, 2011.
View at: Publisher Site | Google Scholar
S. Smale and D.-X. Zhou, “Learning theory estimates via integral operators and their approximations,” Constructive Approximation, vol. 26, no. 2, pp. 153–172, 2007.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
S. Smale and D.-X. Zhou, “Online learning with Markov sampling,” Analysis and Applications, vol. 7, no. 1, pp. 87–113, 2009.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
Y. Yao, “On complexity issues of online learning algorithms,” IEEE Transactions on Information Theory, vol. 56, no. 12, pp. 6470–6481, 2010.
View at: Publisher Site | Google Scholar
Y. Ying, “Convergence analysis of online algorithms,” Advances in Computational Mathematics, vol. 27, no. 3, pp. 273–291, 2007.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
I. Steinwart and C. Scovel, “Fast rates for support vector machines using Gaussian kernels,” The Annals of Statistics, vol. 35, no. 2, pp. 575–607, 2007.
View at: Publisher Site | Google Scholar | Zentralblatt MATH

Copyright

Copyright © 2012 Dao-Hong Xiang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1483

Downloads

1207

Citations