Abstract
We study learning algorithms generated by regularization schemes in reproducing kernel Hilbert spaces associated with an -insensitive pinball loss. This loss function is motivated by the -insensitive loss for support vector regression and the pinball loss for quantile regression. Approximation analysis is conducted for these algorithms by means of a variance-expectation bound when a noise condition is satisfied for the underlying probability measure. The rates are explicitly derived under a priori conditions on approximation and capacity of the reproducing kernel Hilbert space. As an application, we get approximation orders for the support vector regression and the quantile regularized regression.
1. Introduction and Motivation
In this paper, we study a family of learning algorithms serving both purposes of support vector regression and quantile regression. Approximation analysis and learning rates will be provided, which also helps better understanding of some classical learning methods.
Support vector regression is a classical kernel-based algorithm in learning theory introduced in [1]. It is a regularization scheme in a reproducing kernel Hilbert space (RKHS) associated with an -insensitive loss defined for by Here, for learning functions on a compact metric space , is a continuous, symmetric, and positive semidefinite function called a Mercer kernel. The associated RKHS is defined [2] as the completion of the linear span of the set of function with the inner product satisfying . Let and be a Borel probability measure on . With a sample independently drawn according to , the support vector regression is defined as where is a regularization parameter.
When is fixed, convergence of (1.2) was analyzed in [3]. Notice from the original motivation [1] for the insensitive parameter for balancing the approximation and sparsity of the algorithm that should change with the sample size and usually as the sample size increases. Mathematical analysis for this original algorithm is still open. We will solve this problem in a special case of our approximation analysis for general learning algorithms. In particular, we show how approximates the median function , which is one of the purposes of this paper. Here, for , the median function value is a median of the conditional distribution of at .
Quantile regression, compared with the least squares regression, provides richer information about response variables such as stretching or compressing tails [4]. It aims at estimating quantile regression functions. With a quantile parameter , a quantile regression function is defined by its value to be a -quantile of , that is, a value satisfying
Quantile regression has been studied by kernel-based regularization schemes in a learning theory literature (e.g., [5–8]). These regularization schemes take the form where is the pinball loss shown in Figure 1 defined by
Motivated by the -insensitive loss and the pinball loss , we propose the -insensitive pinball loss with an insensitive parameter shown in Figure 1 defined as This loss function has been applied to online learning for quantile regression in our previous work [8]. It is applied here to a regularization scheme in the RKHS as The main goal of this paper is to study how the output function given by (1.7) converges to the quantile regression function and how explicit learning rates can be obtained with suitable choices of the parameters based on a priori conditions on the probability measure .
2. Main Results on Approximation
Throughout the paper, we assume that the conditional distribution is supported on for every . Then, we see from (1.3) that we can take values of to be on . So to see how approximates , it is natural to project values of the output function onto the same interval by the projection operator introduced in [9].
Definition 2.1. The projection operator on the space of function on is defined by
Our approximation analysis aims at establishing bounds for the error in the space with some where is the marginal distribution of on .
2.1. Support Vector Regression and Quantile Regression
Our error bounds and learning rates are presented in terms of a noise condition and approximation condition on .
The noise condition on is defined in [5, 6] as follows.
Definition 2.2. Let and . We say that has a -quantile of -average type if for every , there exist a -quantile and constants such that for each , and that the function on taking value at lies in .
Note that condition (2.2) tells us that is uniquely defined at every .
The approximation condition on is stated in terms of the integral operator defined by . Since is positive semidefinite, is a compact positive operator and its -th power is well-defined for any . Our approximation condition is given as
Let us illustrate our approximation analysis by the following special case which will be proved in Section 5.
Theorem 2.3. Let and . Assume and has a -quantile of -average type 2 for some . Take and with . Let . Then, with , for any , with confidence , one has where is a constant independent of or .
If for , we see that the power exponent for the learning rate (2.4) is at least . This exponent can be arbitrarily close to when is small enough.
In particular, if we take , Theorem 2.3 provides rates for output function of the support vector regression (1.2) to approximate the median function .
If we take leading to , Theorem 2.3 provides rates for output function of the quantile regression algorithm (1.4) to approximate the quantile regression function .
2.2. General Approximation Analysis
To state our approximation analysis in the general case, we need the capacity of the hypothesis space measured by covering numbers.
Definition 2.4. For a subset of and , the covering number is the minimal integer such that there exist disks with radius covering .
The covering numbers of balls with radius of the RKHS have been well studied in the learning theory literature [10, 11]. In this paper, we assume for some and that
Now we can state our main result which will be proved in Section 5. For and , we denote
Theorem 2.5. Assume (2.3) with and (2.5) with . Suppose that has a -quantile of -average type for some and . Take with and . Set with . Let Then, with , for any , with confidence , one has where is a constant independent of or and the power index is given in terms of , and by
The index can be viewed as a function of variables . The restriction on and (2.7) on ensure that is positive, which verifies the valid learning rate in Theorem 2.5.
Assumption (2.5) is a measurement of regularity of the kernel when is a subset of . In particular, can be arbitrarily small when is smooth enough. In this case, the power index in (2.8) can be arbitrarily close to . Again, when , , algorithm (1.7) corresponds to algorithm (1.4) for quantile regression. In this case, Theorem 2.5 provides learning rates for quantile regression algorithm (1.4).
Error analysis has been done for quantile regression algorithm (1.4) in [5, 6]. Under the assumptions that satisfies (2.2) with some and and the -empirical covering number (see [5] for more details) of is bounded as it was proved in [5] that with confidence , where and are constants independent of or . Here, is the regularization error defined as and is the generalization error associated with the pinball loss defined by Note that is minimized by the quantile regression function . Thus, when the regularization error decays polynomially as (which is ensured by Lemma 2.6 below when (2.3) is satisfied) and , then with Since , we see that this learning rate is comparable to our result in (2.8).
2.3. Comparison with Least Squares Regression
There has been a large literature in learning theory (described in [12]) for the least squares algorithms: It aims at learning the regression function . A crucial property for its error analysis is the identity for the least squares generalization error . It yields a variance-expectation bound for the random variable on where is an arbitrary measurable function. Such a variance-expectation bound with possibly replaced by its positive power plays an essential role for analyzing regularization schemes and the power exponent depends on strong convexity of the loss. See [13] and references therein. However, the pinball loss in the quantile regression setting has no strong convexity [6] and we would not expect a variance-expectation bound for a general distribution . When has a -quantile of -average type , the following variance-expectation bound with given by (2.6) can be found in [5, 7] (derived by means of Lemma 3.1 below).
Lemma 2.6. If has a -quantile of -average type for some and , then where the power index is given by (2.6) and the constant is .
Lemma 2.6 overcomes the difficulty of quantile regression caused by lack of strong convexity of the pinball loss. It enables us to derive satisfactory learning rates, as in Theorem 2.5.
3. Insensitive Relation and Error Decomposition
An important relation for quantile regression observed in [5] assets that the error taken in a suitable space can be bounded by the excess generalization error when the noise condition is satisfied.
Lemma 3.1. Let and . Denote . If has a -quantile of -average type , then for any measurable function on , one has where .
By Lemma 3.1, to estimate , we only need to bound the excess generalization error . This will be done by conducting an error decomposition which has been developed in the literature for regularization schemes [9, 13–15]. Technical difficulty arises for our problem here because the insensitive parameter changes with . This can be overcome [16] by the following insensitive relation
Now, we can conduct an error decomposition. Define the empirical error for as
Lemma 3.2. Let , be defined by (1.7) and Then, where
Proof. The regularized excess generalization error can be expressed as The fact implies that . The insensitive relation (3.2) and the definition of tell us that Then, by subtracting and adding and and noting , we see that the desired inequality in Lemma 3.2 holds true.
In the error decomposition (3.5), the first two terms are called sample error. The last term is the regularization error defined in (2.12). It can be estimated as follows.
Proposition 3.3. Assume (2.3). Define by (3.4). Then, one has where is the constant .
Proof. Let and It can be found in [17, 18] that when (2.3) holds, we have Hence, and . Since is Lipschitz, we know by taking in (2.12) that This verifies the desired bound for . By taking in (2.12), we have Then the bound for is proved.
4. Estimating Sample Error
This section is devoted to estimating the sample error. This is conducted by using the variance-expectation bound in Lemma 2.6.
Denote . For , denote
Proposition 4.1. Assume (2.3) and (2.5). Let and . If has a -quantile of -average type for some and , then there exists a subset of with measure at most such that for any , where is given by (2.6) and are constants given by
Proof. Let us first estimate the second part of the sample error. It can be decomposed into two parts where
For bounding , we take the random variable on . It satisfies . Hence, and . Applying the one-side Bernstein inequality [12], we know that there exists a subset of with measure at least such that
For , we apply the one-side Bernstein inequality again to the random variable , bound the variance by Lemma 2.6 with , and find that there exists another subset of with measure at least such that
Next, we estimate the first part of the sample error. Consider the function set
A function from this set satisfies , , and by (2.16). Also, the Lipschitz property of the pinball loss yields . Then, we apply a standard covering number argument with a ratio inequality [12, 13, 19, 20] to and find from the covering number condition (2.5) that
Setting the confidence to be , we take to be the positive solution to the equation
Then, there exists a third subset of with measure at least such that
Thus, for , we have
Here, we have used the elementary inequality and Young’s inequality. Putting this bound and (4.5), (4.6) into (3.5), we know that for , there holds
which together with Proposition 3.3 implies
Here, we have used the reproducing property in which yields [12]
Equation (4.9) can be expressed as
By Lemma 7.2 in [12], the positive solution to this equation can be bounded as
Thus, for , the desired bound (4.2) holds true. Since the measure of the set is at least , our conclusion is proved.
5. Deriving Convergence Rates by Iteration
To apply Proposition 4.1 for error analysis, we need some for . One may choose according to which is seen by taking in (1.7). This choice is too rough. Recall from Proposition 3.3 that which is a bound for the noise-free limit of . It is much better than . This motivates us to try similar tight bounds for . This target will be achieved in this section by applying Proposition 4.1 iteratively. The iteration technique has been used in [13, 21] to improve learning rates.
Lemma 5.1. Assume (2.3) with and (2.5) with . Take with and with . Let . If has a -quantile of -average type for some and , then for any , with confidence , there holds where is given by
Proof. Putting with and with into Proposition 4.1, we know that for any there exists a subset of with measure at most such that
where with , the constants are given by
It follows that
Let us apply (5.6) iteratively to a sequence defined by and where will be determined later. Then, . By (5.1), . So we have
As the measure of is at most , we know that the measure of is at most . Hence, has measure at least .
Denote . The definition of the sequence tells us that
Let us bound the two terms on the right-hand side.
The first term equals
which is bounded by
Take to be the smallest integer greater than or equal to . The above expression can be bounded by .
The second terms equals
where . It is bounded by
When , the above expression is bounded by . When , it is bounded by .
Based on the above discussion, we obtain
where . So with confidence , there holds
Then, our conclusion follows by replacing by and noting .
Now, we can prove our main result, Theorem 2.5.
Proof of Theorem 2.5. Take to be the right side of (5.2). By Lemma 5.1, there exists a subset of with measure at most such that . Applying Proposition 4.1 to this , we know that there exists another subset of with measure at most such that for any , where Since the set has measure at most , after scaling to and setting the constant by we see that the above estimate together with Lemma 3.1 gives the error bound with confidence and the power index give by provided that Since , we know that . By the restriction on , we find and . Moreover, restriction (2.7) on tell us that . Therefore, condition (5.20) is satisfied. The restriction and the above expression for tells us that the power index for the error bound can be exactly expressed by formula (2.9). The proof of Theorem 2.5 is complete.
Finally, we prove Theorem 2.3.
Proof of Theorem 2.3. Since , we know that (2.3) holds with . The noise condition on is satisfied with and . Then, . Since and , we know from [11] that (2.5) holds true for any . With , let us choose to be a positive number satisfying the following four inequalities: The first inequality above tells us that the restrictions on are satisfied by choosing and . The second inequality shows that condition (2.7) for the parameter renamed now as is also satisfied by taking . Thus, we apply Theorem 2.5 and know that with confidence , (2.8) holds with the power index given by (2.9); but , and imply that The last two inequalities satisfied by yield . So (2.8) verifies (2.4). This completes the proof of the theorem.
Acknowledgment
The work described in this paper is supported by National Science Foundation of China under Grant 11001247 and by a grant from the Research Grants Council of Hong Kong (Project no. CityU 103709).