Learning TheoryView this Special Issue
Research Article | Open Access
ERM Scheme for Quantile Regression
This paper considers the ERM scheme for quantile regression. We conduct error analysis for this learning algorithm by means of a variance-expectation bound when a noise condition is satisfied for the underlying probability measure. The learning rates are derived by applying concentration techniques involving the -empirical covering numbers.
In this paper, we study empirical risk minimization scheme (ERM) for quantile regression. Let be a compact metric space (input space) and . Let be a fixed but unknown probability distribution on which describes the noise of sampling. The conditional quantile regression aims at producing functions to estimate quantile regression functions. With a prespecified quantile parameter , a quantile regression function is defined by its value to be a -quantile of , that is, a value satisfying where is the conditional distribution of at .
We consider a learning algorithm generated by ERM scheme associated with pinball loss and hypothesis space . The pinball loss is defined by The hypothesis space is a compact subset of . So there exists some such that for any . We assume without loss of generality for any .
The ERM scheme for quantile regression is defined with a sample drawn independently from as follows:
A family of kernel based learning algorithms for quantile regression has been widely studied in a large literature [1–4] and references therein. The form of the algorithms is a regularized scheme in a reproducing kernel Hilbert space (RKHS, see  for details) associated with a Mercer kernel . Given a sample the kernel based regularized scheme for quantile regression is defined by In [1, 3, 4], error analysis for general has been done. Learning with varying Gaussian kernel was studied in .
ERM scheme (3) is very different from kernel based regularized scheme (4). The output function produced by the ERM scheme has a uniform bound, under our assumption, . However, we cannot expect it for . It is easy to see that by choosing . It happens often that as . The lack of a uniform bound for has a serious negative impact on the learning rates. So in the literature of kernel based regularized schemes for quantile regression, values of the output function are always projected onto the interval , and error analysis is conducted for the projected function, not itself.
In this paper, we aim at establishing convergence and learning rates for the error in the space . Here depends on the pair which will be decided in Section 2 and is the marginal distribution of on . In the rest of this paper, we assume which in turn leads to values of the target function lie in the same interval.
2. Noise Condition and Main Results
There has been a large literature in learning theory (see  and references therein) devoted to the least square regression. It aims at learning the regression function . The identity for the generalization error leads to a variance-expectation bound with the form of , where on . It plays an essential role in error analysis of kernel based regularized schemes.
However, this identity relation and expectation-variance bound fail in the setting of the quantile regression. The reason is that the pinball loss is lack of strong convexity. If we add some noise condition on distribution named -quantile of -average type (see Definition 1), we can also get a similar identity relation which in turn enables us to have a variance-expectation bound stated in the following which is proved by Steinwart and Christman .
Definition 1. Let and . A distribution on is said to have a -quantile of -average type if for -almost every , there exist a -quantile and constants such that for each , and that the function on defined by satisfies .
We also need capacity of the hypothesis space for our learning rates. Here in this paper, we measure the capacity by empirical covering numbers.
Definition 2. Let be a pseudometric space and be a subset of . For every , the covering number of with respect to and is defined as the minimal number of balls of radius whose union covers , that is, where is a ball in .
Definition 3. Let be a set of functions on , and . Set . The -empirical covering number of is defined by Here is the normalized -metric on the Euclidean space given by
Assumption. Assume that the empirical covering number of the hypothesis space is bounded for some and ,
Theorem 4. Assume that satisfies (5) with some and . Denote . One further assumes that is uniquely defined. If and satisfies (9) with , then for any , with confidence , one has where and is a constant independent of and .
Remark 5. In the ERM scheme, we can choose which in turn makes the approximation error described by (23) equal to zero. However, it is impossible for the kernel based regularized scheme because of the appearance of the penalty term .
If , all conditional distributions around the quantile behave similar to the uniform distribution. In this case and for all . Hence, . Furthermore, when is large enough, the parameter tends to and the power index for the above learning rate arbitrarily approaches to which shows that the learning rate power index for is arbitrarily close to independent of . In particular, can be arbitrarily small when is smooth enough. In this case, the power index of the learning rates can be arbitrarily close to which is the optimal learning rate for the least square regression.
Let us take some examples to demonstrate the above main result.
Example 6. Let be a unit ball of the sobolev space with . Observe that the empirical covering number is bounded above by the uniform covering number defined in Definition 2. Hence we have (see [6, 7])
where is the dimension of the input space and .
Under the same assumptions as Theorem 4, we get that by replacing by , for any , with confidence , where and is a constant independent of and .
We carry out the same discussions on the case of and large enough as Remark 5. Therefore the power index of the learning rates for is arbitrarily close to independent of . Furthermore, can be arbitrarily large if the Sobolev space is smooth enough. In this special case, the learning rate power index arbitrarily approaches to .
Example 7. Let be a unit ball of the reproducing kernel Hilbert space generated by a Gaussian kernel (see ). Reference  tells us that
where depends only on and . Obviously, the right-hand side of (15) is bounded by .
So from Theorem 4, we can get different learning rates with power index If and is large enough, the power index of the learning rates for is arbitrarily close to which is very slow if is large. However, in most data sets the data are concentrated on a much lower dimensional manifold embedded in the high dimensional space. In this setting an analysis that replaces by the intrinsic dimension of the manifold would be of great interest (see  and references therein).
3. Error Analysis
Define the noise-free error called generalization error associated with the pinball loss as Then the measurable function is a minimizer of . Obviously, .
We need the following results from  for our error analysis.
Proposition 8. Let be the pinball loss. Assume that satisfies (5) with some and . Then for all one has Furthermore, with one has
The above result implies that we can get convergence rates of in the space by bounding the excess generalization error .
To bound , we need a standard error decomposition procedure  and a concentration inequality.
3.1. Error Decomposition
Define the empirical error associated with the pinball loss as Define
Proof. The excess generalization error can be written as The definition of implies that . Furthermore, by subtracting and adding and in the first term and third term, we see that Lemma 9 holds true.
3.2. Concentration Inequality and Sample Error
Let us recall the one-sided Bernstein inequality as follows.
Lemma 10. Let be a random variable on a probability space with variance satisfying for some constant . Then for any , with confidence , one has
Let us turn to estimate the sample error (3.5) involving the function which runs over a set of functions since is a random sample itself. To estimate it, we use a concentration inequality below involving empirical covering numbers [10–12].
Lemma 12. Let be a class of measurable functions on . Assume that there are constants and and and for every . If (7) holds, then there exists a constant depending only on such that for any , with probability at least , there holds where
We apply Lemma 12 to a function set , where
Proof. Take with the form where . Hence and .
The Lipschitz property of the pinball loss implies that For , we have where . It follows that Hence
Applying Lemma 12 with , and , we know that for any , with confidence , there holds Here Note that where is indicated in (32). Then our desired bound holds true.
3.3. Bounding the Total Error
Now we are in a position to present our general result on error analysis for algorithm (3).
Theorem 15. Assume that satisfies (5) with some and . Denote . Further assume that satisfies (9) with and is uniquely defined. Then for any , with confidence , one has where and are constant independent of and .
4. Further Discussions
In this paper, we studied ERM algorithm (3) for quantile regression and provide convergence and learning rates. We showed some essential differences between ERM scheme and kernel based regularized scheme for quantile regression. We also point out the difficulty to deal with quantile regression: the lack of strong convexity of the pinball loss. To overcome this difficulty, some noise condition on is proposed to enable us to get a variance-expectation bound similar to the one for the least square regression.
In our analysis we just consider and . The case for would be interesting in the future work. The approximation error involving can be estimated by the knowledge of interpolation space.
In our setting, the sample is drawn independently from the distribution . However, in many practical problems, the i.i.d condition is a little demanding, so it would be interesting to investigate the ERM scheme for quantile regression with nonidentical distributions [13, 14] or dependent sampling .
This work described in this paper is supported by NSF of China under Grant 11001247 and 61170109.
- I. Steinwart and A. Christman, How SVMs Can Estimate Quantile and the Median, vol. 20 of Advances in Neural Information Processing Systems, MIT Press, Cambridge, Mass, USA, 2008.
- D. H. Xiang, “Conditional quantiles with varying Gaussians,” Advances in Computational Mathematics, 2011.
- D.-H. Xiang, T. Hu, and D.-X. Zhou, “Learning with varying insensitive loss,” Applied Mathematics Letters, vol. 24, no. 12, pp. 2107–2109, 2011.
- D.-H. Xiang, T. Hu, and D.-X. Zhou, “Approximation analysis of learning algorithms for support vector regression and quantile regression,” Journal of Applied Mathematics, vol. 2012, Article ID 902139, 17 pages, 2012.
- N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American Mathematical Society, vol. 68, pp. 337–404, 1950.
- F. Cucker and D.-X. Zhou, Learning Theory: An Approximation Theory Viewpoint, vol. 24, Cambridge University Press, Cambridge, UK, 2007.
- D.-X. Zhou, “The covering number in learning theory,” Journal of Complexity, vol. 18, no. 3, pp. 739–767, 2002.
- S. Mukherjee, Q. Wu, and D.-X. Zhou, “Learning gradients on manifolds,” Bernoulli, vol. 16, no. 1, pp. 181–207, 2010.
- S. Smale and D.-X. Zhou, “Estimating the approximation error in learning theory,” Analysis and Applications, vol. 1, no. 1, pp. 17–41, 2003.
- S. Mukherjee and Q. Wu, “Estimation of gradients and coordinate covariation in classification,” Journal of Machine Learning Research, vol. 7, pp. 2481–2514, 2006.
- Y. Yao, “On complexity issues of online learning algorithms,” Institute of Electrical and Electronics Engineers, vol. 56, no. 12, pp. 6470–6481, 2010.
- Y. Ying, “Convergence analysis of online algorithms,” Advances in Computational Mathematics, vol. 27, no. 3, pp. 273–291, 2007.
- T. Hu and D.-X. Zhou, “Online learning with samples drawn from non-identical distributions,” Journal of Machine Learning Research, vol. 10, pp. 2873–2898, 2009.
- S. Smale and D.-X. Zhou, “Online learning with Markov sampling,” Analysis and Applications, vol. 7, no. 1, pp. 87–113, 2009.
- Z.-C. Guo and L. Shi, “Classification with non-i.i.d. sampling,” Mathematical and Computer Modelling, vol. 54, no. 5-6, pp. 1347–1364, 2011.
Copyright © 2013 Dao-Hong Xiang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.