Abstract

We study learning algorithms generated by regularization schemes in reproducing kernel Hilbert spaces associated with an 𝜖-insensitive pinball loss. This loss function is motivated by the 𝜖-insensitive loss for support vector regression and the pinball loss for quantile regression. Approximation analysis is conducted for these algorithms by means of a variance-expectation bound when a noise condition is satisfied for the underlying probability measure. The rates are explicitly derived under a priori conditions on approximation and capacity of the reproducing kernel Hilbert space. As an application, we get approximation orders for the support vector regression and the quantile regularized regression.

1. Introduction and Motivation

In this paper, we study a family of learning algorithms serving both purposes of support vector regression and quantile regression. Approximation analysis and learning rates will be provided, which also helps better understanding of some classical learning methods.

Support vector regression is a classical kernel-based algorithm in learning theory introduced in [1]. It is a regularization scheme in a reproducing kernel Hilbert space (RKHS) 𝐾 associated with an 𝜖-insensitive loss 𝜓𝜖+ defined for 𝜖0 by𝜓𝜖(𝑢)=max{|𝑢|𝜖,0}=|𝑢|𝜖,if|𝑢|𝜖,0,otherwise.(1.1) Here, for learning functions on a compact metric space 𝑋, 𝐾𝑋×𝑋 is a continuous, symmetric, and positive semidefinite function called a Mercer kernel. The associated RKHS 𝐾 is defined [2] as the completion of the linear span of the set of function {𝐾𝑥=𝐾(𝑥,)𝑥𝑋} with the inner product ,𝐾 satisfying 𝐾𝑥,𝐾𝑦𝐾=𝐾(𝑥,𝑦). Let 𝑌= and 𝜌 be a Borel probability measure on 𝑍=𝑋×𝑌. With a sample 𝐳={(𝑥𝑖,𝑦𝑖)}𝑚𝑖=1𝑍𝑚 independently drawn according to 𝜌, the support vector regression is defined as𝑓SVR𝐳=argmin𝑓𝐾1𝑚𝑚𝑖=1𝜓𝜖𝑓𝑥𝑖𝑦𝑖+𝜆𝑓2𝐾,(1.2) where 𝜆=𝜆(𝑚)>0 is a regularization parameter.

When 𝜖>0 is fixed, convergence of (1.2) was analyzed in [3]. Notice from the original motivation [1] for the insensitive parameter 𝜖 for balancing the approximation and sparsity of the algorithm that 𝜖 should change with the sample size and usually 𝜖=𝜖(𝑚)0 as the sample size 𝑚 increases. Mathematical analysis for this original algorithm is still open. We will solve this problem in a special case of our approximation analysis for general learning algorithms. In particular, we show how 𝑓SVR𝐳 approximates the median function 𝑓𝜌,1/2, which is one of the purposes of this paper. Here, for 𝑥𝑋, the median function value 𝑓𝜌,1/2(𝑥) is a median of the conditional distribution 𝜌(𝑥) of 𝜌 at 𝑥.

Quantile regression, compared with the least squares regression, provides richer information about response variables such as stretching or compressing tails [4]. It aims at estimating quantile regression functions. With a quantile parameter 0<𝜏<1, a quantile regression function 𝑓𝜌,𝜏 is defined by its value 𝑓𝜌,𝜏(𝑥) to be a 𝜏-quantile of 𝜌(𝑥), that is, a value 𝑢𝑌 satisfying𝜌({𝑦𝑌𝑦𝑢}𝑥)𝜏,𝜌({𝑦𝑌𝑦𝑢}𝑥)1𝜏.(1.3)

Quantile regression has been studied by kernel-based regularization schemes in a learning theory literature (e.g., [58]). These regularization schemes take the form𝑓Q𝑅𝐳=argmin𝑓𝐾1𝑚𝑚𝑖=1𝜓𝜏𝑓𝑥𝑖𝑦𝑖+𝜆𝑓2𝐾,(1.4) where 𝜓𝜏+ is the pinball loss shown in Figure 1 defined by𝜓𝜏(𝑢)=(1𝜏)𝑢,if𝑢>0,𝜏𝑢,if𝑢0.(1.5)

Motivated by the 𝜖-insensitive loss 𝜓𝜖 and the pinball loss 𝜓𝜏, we propose the 𝜖-insensitive pinball loss 𝜓𝜖𝜏+ with an insensitive parameter 𝜖0 shown in Figure 1 defined as𝜓𝜖𝜏(𝑢)=(1𝜏)(𝑢𝜖),if𝑢>𝜖,𝜏(𝑢+𝜖),if𝑢𝜖,0,otherwise.(1.6) This loss function has been applied to online learning for quantile regression in our previous work [8]. It is applied here to a regularization scheme in the RKHS as𝑓𝐳(𝜖)=𝑓(𝜖)𝐳,𝜆,𝜏=argmin𝑓𝐾1𝑚𝑚𝑖=1𝜓𝜖𝜏𝑓𝑥𝑖𝑦𝑖+𝜆𝑓2𝐾.(1.7) The main goal of this paper is to study how the output function 𝑓𝐳(𝜖) given by (1.7) converges to the quantile regression function 𝑓𝜌,𝜏 and how explicit learning rates can be obtained with suitable choices of the parameters 𝜆=𝑚𝛼,𝜖=𝑚𝛽 based on a priori conditions on the probability measure 𝜌.

2. Main Results on Approximation

Throughout the paper, we assume that the conditional distribution 𝜌(𝑥) is supported on [1,1] for every 𝑥𝑋. Then, we see from (1.3) that we can take values of 𝑓𝜌,𝜏 to be on [1,1]. So to see how 𝑓𝐳(𝜖) approximates 𝑓𝜌,𝜏, it is natural to project values of the output function 𝑓𝐳(𝜖) onto the same interval by the projection operator introduced in [9].

Definition 2.1. The projection operator 𝜋 on the space of function on 𝑋 is defined by 𝜋(𝑓)(𝑥)=1,if𝑓(𝑥)>1,1,if𝑓(𝑥)<1,𝑓(𝑥),if1𝑓(𝑥)1.(2.1)

Our approximation analysis aims at establishing bounds for the error 𝜋(𝑓𝐳(𝜖))𝑓𝜌,𝜏𝐿𝑝𝜌𝑋 in the space 𝐿𝑝𝜌𝑋 with some 𝑝>0 where 𝜌𝑋 is the marginal distribution of 𝜌 on 𝑋.

2.1. Support Vector Regression and Quantile Regression

Our error bounds and learning rates are presented in terms of a noise condition and approximation condition on 𝜌.

The noise condition on 𝜌 is defined in [5, 6] as follows.

Definition 2.2. Let 𝑝(0,] and 𝑞(1,). We say that 𝜌 has a 𝜏-quantile of 𝑝-average type 𝑞 if for every 𝑥𝑋, there exist a 𝜏-quantile 𝑡 and constants 𝑎𝑥(0,2],𝑏𝑥>0 such that for each 𝑢[0,𝑎𝑥], 𝜌𝑡𝑦𝑢,𝑡𝑥𝑏𝑥𝑢𝑞1𝑡,𝜌𝑦,𝑡+𝑢𝑥𝑏𝑥𝑢𝑞1,(2.2) and that the function on 𝑋 taking value (𝑏𝑥𝑎𝑥𝑞1)1 at 𝑥𝑋 lies in 𝐿𝑝𝜌𝑋.

Note that condition (2.2) tells us that 𝑓𝜌,𝜏(𝑥)=𝑡 is uniquely defined at every 𝑥𝑋.

The approximation condition on 𝜌 is stated in terms of the integral operator 𝐿𝐾𝐿2𝜌𝑋𝐿2𝜌𝑋 defined by 𝐿𝐾(𝑓)(𝑥)=𝑋𝐾(𝑥,𝑢)𝑓(𝑢)𝑑𝜌𝑋(𝑢). Since 𝐾 is positive semidefinite, 𝐿𝐾 is a compact positive operator and its 𝑟-th power 𝐿𝑟𝐾 is well-defined for any 𝑟>0. Our approximation condition is given as𝑓𝜌,𝜏=𝐿𝑟𝐾𝑔𝜌,𝜏1forsome0<𝑟2,𝑔𝜌,𝜏𝐿2𝜌𝑋.(2.3)

Let us illustrate our approximation analysis by the following special case which will be proved in Section 5.

Theorem 2.3. Let 𝑋𝑛 and 𝐾𝐶(𝑋×𝑋). Assume 𝑓𝜌,𝜏𝐾 and 𝜌 has a 𝜏-quantile of 𝑝-average type 2 for some 𝑝(0,]. Take 𝜆=𝑚(𝑝+1)/(𝑝+2) and 𝜖=𝑚𝛽 with (𝑝+1)/(𝑝+2)𝛽. Let 0<𝜂<(𝑝+1)/2(𝑝+2). Then, with 𝑝=2𝑝/(𝑝+1)>0, for any 0<𝛿<1, with confidence 1𝛿, one has 𝜋𝑓𝐳(𝜖)𝑓𝜌,𝜏𝐿𝑝𝜌𝑋3𝐶log𝛿𝑚𝜂(𝑝+1)/2(𝑝+2),(2.4) where 𝐶 is a constant independent of 𝑚 or 𝛿.

If 𝑝(1/2𝜂)2 for 0<𝜂<1/4, we see that the power exponent for the learning rate (2.4) is at least (1/2)2𝜂. This exponent can be arbitrarily close to 1/2 when 𝜂 is small enough.

In particular, if we take 𝜏=1/2, Theorem 2.3 provides rates for output function 𝑓SVR𝐳 of the support vector regression (1.2) to approximate the median function 𝑓𝜌,1/2.

If we take 𝛽= leading to 𝜖=0, Theorem 2.3 provides rates for output function 𝑓Q𝑅𝐳 of the quantile regression algorithm (1.4) to approximate the quantile regression function 𝑓𝜌,𝜏.

2.2. General Approximation Analysis

To state our approximation analysis in the general case, we need the capacity of the hypothesis space measured by covering numbers.

Definition 2.4. For a subset 𝑆 of 𝐶(𝑋) and 𝑢>0, the covering number 𝒩(𝑆,𝑢) is the minimal integer 𝑙 such that there exist 𝑙 disks with radius 𝑢 covering 𝑆.

The covering numbers of balls 𝐵𝑅={𝑓𝐾𝑓𝐾𝑅} with radius 𝑅>0 of the RKHS have been well studied in the learning theory literature [10, 11]. In this paper, we assume for some 𝑠>0 and 𝐶𝑠>0 that𝐵log𝒩1,𝑢𝐶𝑠1𝑢𝑠,𝑢>0.(2.5)

Now we can state our main result which will be proved in Section 5. For 𝑝(0,] and 𝑞(1,), we denote2𝜃=min𝑞,𝑝]𝑝+1(0,1.(2.6)

Theorem 2.5. Assume (2.3) with 0<𝑟1/2 and (2.5) with 𝑠>0. Suppose that 𝜌 has a 𝜏-quantile of 𝑝-average type 𝑞 for some 𝑝(0,] and 𝑞(1,). Take 𝜆=𝑚𝛼 with 0<𝛼1 and 𝛼<(2+𝑠)/𝑠(2+𝑠𝜃). Set 𝜖=𝑚𝛽 with 𝛼𝑟/(1𝑟)𝛽. Let []0<𝜂<(1+𝑠)2+2𝑠𝑠𝛼(2+𝑠𝜃).𝑠(2+𝑠𝜃)(2+𝑠)(2.7) Then, with 𝑝=𝑝𝑞/(𝑝+1)>0, for any 0<𝛿<1, with confidence 1𝛿, one has 𝜋𝑓𝐳(𝜖)𝑓𝜌,𝜏𝐿𝑝𝜌𝑋𝐶3log𝜂23log𝛿𝑚𝜗,(2.8) where 𝐶 is a constant independent of 𝑚 or 𝛿 and the power index 𝜗 is given in terms of 𝑟,𝑠,𝑝,𝑞,𝛼, and 𝜂 by 1𝜗=𝑞min𝛼𝑟,11𝑟𝑠[]2+𝑠𝜃𝛼(2+𝑠𝜃)1(2+𝑠𝜃)(2+𝑠)𝑠𝜂,11+𝑠2+𝑠𝜃𝑠𝛼(12𝑟)(,11+𝑠)(22𝑟)𝑠2+𝑠𝜃𝛼1+𝑠21.2(2𝜃)(2.9)

The index 𝜗 can be viewed as a function of variables 𝑟,𝑠,𝑝,𝑞,𝛼,𝜂. The restriction 0<𝛼<(2+𝑠)/𝑠(2+𝑠𝜃) on 𝛼 and (2.7) on 𝜂 ensure that 𝜗 is positive, which verifies the valid learning rate in Theorem 2.5.

Assumption (2.5) is a measurement of regularity of the kernel 𝐾 when 𝑋 is a subset of 𝑛. In particular, 𝑠 can be arbitrarily small when 𝐾 is smooth enough. In this case, the power index 𝜗 in (2.8) can be arbitrarily close to (1/𝑞)min{𝛼𝑟/(1𝑟),1/(2𝜃)}. Again, when 𝛽=, 𝜖=0, algorithm (1.7) corresponds to algorithm (1.4) for quantile regression. In this case, Theorem 2.5 provides learning rates for quantile regression algorithm (1.4).

Error analysis has been done for quantile regression algorithm (1.4) in [5, 6]. Under the assumptions that 𝜌 satisfies (2.2) with some 𝑝(0,] and 𝑞>1 and the 2-empirical covering number 𝒩𝐳(𝐵1,𝜂,2) (see [5] for more details) of 𝐵1 is bounded assup𝐳𝑍𝑚log𝒩𝐳𝐵1,𝜂,21𝑎𝜂𝑠with𝑠(0,2),𝑎1,(2.10) it was proved in [5] that with confidence 1𝛿,𝜋𝑓Q𝑅𝐳𝑓𝜌,𝜏𝐿𝑝𝜌𝑋𝒟𝜏(𝜆)+𝒟𝜏(𝜆)𝜆log(3/𝛿)𝑚+𝐾𝑠,𝐶𝑝𝑎𝜆𝑠/2𝑚(𝑝+1)/(𝑝+2𝑠/2)+𝐾𝑠,𝐶𝑝𝑎𝜆𝑠/2𝑚+532𝐶𝑝log(3/𝛿)𝑚(𝑝+1)/(𝑝+2)+145log(3/𝛿)𝑚1/𝑞,(2.11) where 𝐶𝑝 and 𝐾𝑠,𝐶𝑝 are constants independent of 𝑚 or 𝜆. Here, 𝒟𝜏(𝜆) is the regularization error defined as𝒟𝜏(𝜆)=inf𝑓𝐾𝜏(𝑓)𝜏𝑓𝜌,𝜏+𝜆𝑓2𝐾,(2.12) and 𝜏(𝑓) is the generalization error associated with the pinball loss 𝜓𝜏 defined by𝜏(𝑓)=𝑍𝜓𝜏(𝑓(𝑥)𝑦)𝑑𝜌=𝑋𝑌𝜓𝜏(𝑓(𝑥)𝑦)𝑑𝜌(𝑦𝑥)𝑑𝜌𝑋(𝑥).(2.13) Note that 𝜏(𝑓) is minimized by the quantile regression function 𝑓𝜌,𝜏. Thus, when the regularization error 𝒟𝜏(𝜆) decays polynomially as 𝒟𝜏(𝜆)=𝑂(𝜆𝑟/(1𝑟)) (which is ensured by Lemma 2.6 below when (2.3) is satisfied) and 𝜆=𝑚𝛼, then 𝜋(𝑓Q𝑅𝐳)𝑓𝜌,𝜏𝐿𝑝𝜌𝑋=𝑂(log(3/𝛿)𝑚𝜗) with1𝜗=𝑞min𝛼𝑟,1𝑟𝑝+1𝑝+2(𝑠/2)1𝛼𝑠2,𝑝+1𝑝+2.(2.14) Since (𝑝+1)/(𝑝+2)=1/(2𝜃), we see that this learning rate is comparable to our result in (2.8).

2.3. Comparison with Least Squares Regression

There has been a large literature in learning theory (described in [12]) for the least squares algorithms:𝑓L𝑆𝐳=argmin𝑓𝐾1𝑚𝑚𝑖=1𝑓𝑥𝑖𝑦𝑖2+𝜆𝑓2𝐾.(2.15) It aims at learning the regression function 𝑓𝜌(𝑥)=𝑌𝑦𝑑𝜌(𝑦𝑥). A crucial property for its error analysis is the identity 𝑙𝑠(𝑓)𝑙𝑠(𝑓𝜌)=𝑓𝑓𝜌2𝐿2𝜌𝑋 for the least squares generalization error 𝑙𝑠(𝑓)=𝑍(𝑦𝑓(𝑥))2𝑑𝜌. It yields a variance-expectation bound 𝐄(𝜉2)4𝐄(𝜉) for the random variable 𝜉=(𝑦𝑓(𝑥))2(𝑦𝑓𝜌(𝑥))2 on (𝑍,𝜌) where 𝑓𝑋𝑌 is an arbitrary measurable function. Such a variance-expectation bound with 𝐄(𝜉) possibly replaced by its positive power (𝐄(𝜉))𝜃 plays an essential role for analyzing regularization schemes and the power exponent 𝜃 depends on strong convexity of the loss. See [13] and references therein. However, the pinball loss in the quantile regression setting has no strong convexity [6] and we would not expect a variance-expectation bound for a general distribution 𝜌. When 𝜌 has a 𝜏-quantile of 𝑝-average type 𝑞, the following variance-expectation bound with 𝜃 given by (2.6) can be found in [5, 7] (derived by means of Lemma 3.1 below).

Lemma 2.6. If 𝜌 has a 𝜏-quantile of 𝑝-average type 𝑞 for some 𝑝(0,] and 𝑞(1,), then 𝐄𝜓𝜏(𝑓(𝑥)𝑦)𝜓𝜏𝑓𝜌,𝜏(𝑥)𝑦2𝐶𝜃𝜏(𝑓)𝜏𝑓𝜌,𝜏𝜃,𝑓𝑋𝑌,(2.16) where the power index 𝜃 is given by (2.6) and the constant 𝐶𝜃 is 𝐶𝜃=22𝜃𝑞𝜃𝛾1𝜃𝐿𝑝𝜌𝑋.

Lemma 2.6 overcomes the difficulty of quantile regression caused by lack of strong convexity of the pinball loss. It enables us to derive satisfactory learning rates, as in Theorem 2.5.

3. Insensitive Relation and Error Decomposition

An important relation for quantile regression observed in [5] assets that the error 𝜋(𝑓𝐳(𝜖))𝑓𝜌,𝜏 taken in a suitable 𝐿𝑝𝜌𝑋 space can be bounded by the excess generalization error 𝜏(𝜋(𝑓𝐳(𝜖)))𝜏(𝑓𝜌,𝜏) when the noise condition is satisfied.

Lemma 3.1. Let 𝑝(0,] and 𝑞(1,). Denote 𝑝=𝑝𝑞/(𝑝+1)>0. If 𝜌 has a 𝜏-quantile of 𝑝-average type 𝑞, then for any measurable function 𝑓 on 𝑋, one has 𝑓𝑓𝜌,𝜏𝐿𝑝𝜌𝑋𝐶𝑞,𝜌𝜏(𝑓)𝜏𝑓𝜌,𝜏1/𝑞,(3.1) where 𝐶𝑞,𝜌=21(1/𝑞)𝑞1/𝑞{(𝑏𝑥a𝑥𝑞1)1}𝑥𝑋𝐿1/𝑞𝑝𝜌𝑋.

By Lemma 3.1, to estimate 𝜋(𝑓𝐳(𝜖))𝑓𝜌,𝜏𝐿𝑝𝜌𝑋, we only need to bound the excess generalization error 𝜏(𝜋(𝑓𝐳(𝜖)))𝜏(𝑓𝜌,𝜏). This will be done by conducting an error decomposition which has been developed in the literature for regularization schemes [9, 1315]. Technical difficulty arises for our problem here because the insensitive parameter 𝜖 changes with 𝑚. This can be overcome [16] by the following insensitive relation𝜓𝜏(𝑢)𝜖𝜓𝜖𝜏(𝑢)𝜓𝜏(𝑢),𝑢.(3.2)

Now, we can conduct an error decomposition. Define the empirical error 𝐳,𝜏(𝑓) for 𝑓𝑋 as𝐳,𝜏1(𝑓)=𝑚𝑚𝑖=1𝜓𝜏𝑓𝑥𝑖𝑦𝑖.(3.3)

Lemma 3.2. Let 𝜆>0, 𝑓𝑧(𝜖) be defined by (1.7) and 𝑓𝜆(0)=argmin𝑓𝐾𝜏(𝑓)𝜏𝑓𝜌,𝜏+𝜆𝑓2𝐾.(3.4) Then, 𝜏𝜋𝑓𝐳(𝜖)𝜏𝑓𝜌,𝜏𝑓+𝜆𝐳(𝜖)2𝐾𝒮1+𝒮2+𝜖+𝒟𝜏(𝜆),(3.5) where 𝒮1=𝜏𝜋𝑓𝐳(𝜖)𝜏𝑓𝜌,𝜏𝐳,𝜏𝜋𝑓𝐳(𝜖)𝐳,𝜏𝑓𝜌,𝜏,𝒮2=𝐳,𝜏𝑓𝜆(0)𝐳,𝜏𝑓𝜌,𝜏𝜏𝑓𝜆(0)𝜏𝑓𝜌,𝜏,(3.6)

Proof. The regularized excess generalization error 𝜏(𝜋(𝑓𝐳(𝜖)))𝜏(𝑓𝜌,𝜏)+𝜆𝑓𝐳(𝜖)2𝐾 can be expressed as 𝜏𝜋𝑓𝐳(𝜖)𝐳,𝜏𝜋𝑓𝐳(𝜖)+𝐳,𝜏𝜋𝑓𝐳(𝜖)𝑓+𝜆𝐳(𝜖)2𝐾𝐳,𝜏𝑓𝜆(0)𝑓+𝜆𝜆(0)2𝐾+𝐳,𝜏𝑓𝜆(0)𝜏𝑓𝜆(0)+𝜏𝑓𝜆(0)𝜏𝑓𝜌,𝜏𝑓+𝜆𝜆(0)2𝐾.(3.7) The fact |𝑦|1 implies that 𝐳,𝜏(𝜋(𝑓𝐳(𝜖)))𝐳,𝜏(𝑓𝐳(𝜖)). The insensitive relation (3.2) and the definition of 𝑓𝐳(𝜖) tell us that 𝐳,𝜏𝑓𝐳(𝜖)𝑓+𝜆𝐳(𝜖)2𝐾1𝑚𝑚𝑖=1𝜓𝜖𝜏𝑓𝐳(𝜖)𝑥𝑖𝑦𝑖𝑓+𝜆𝐳(𝜖)2𝐾1+𝜖𝑚𝑚𝑖=1𝜓𝜖𝜏𝑓𝜆(0)𝑥𝑖𝑦𝑖𝑓+𝜆𝜆(0)2𝐾+𝜖𝐳,𝜏𝑓𝜆(0)𝑓+𝜆𝜆(0)2𝐾+𝜖.(3.8) Then, by subtracting and adding 𝜏(𝑓𝜌,𝜏) and 𝐳,𝜏(𝑓𝜌,𝜏) and noting 𝒟𝜏(𝜆)=𝜏(𝑓𝜆(0))𝜏(𝑓𝜌,𝜏)+𝜆𝑓𝜆(0)2𝐾, we see that the desired inequality in Lemma 3.2 holds true.

In the error decomposition (3.5), the first two terms are called sample error. The last term is the regularization error defined in (2.12). It can be estimated as follows.

Proposition 3.3. Assume (2.3). Define 𝑓𝜆(0) by (3.4). Then, one has 𝒟𝜏(𝜆)𝐶0𝜆𝑟/(1𝑟),𝑓𝜆(0)𝐾C0𝜆(2𝑟1)/(22𝑟),(3.9) where 𝐶0 is the constant 𝐶0=𝑔𝜌,𝜏𝐿2𝜌𝑋+𝑔𝜌,𝜏2𝐿2𝜌𝑋.

Proof. Let 𝜇=𝜆1/(1𝑟)>0 and 𝑓𝜇=𝐿𝐾+𝜇𝐼1𝐿𝐾𝑓𝜌,𝜏.(3.10) It can be found in [17, 18] that when (2.3) holds, we have 𝑓𝜇𝑓𝜌,𝜏2𝐿2𝜌𝑋𝑓+𝜇𝜇2𝐾𝜇2𝑟𝑔𝜌,𝜏2𝐿2𝜌𝑋.(3.11) Hence, 𝑓𝜇𝑓𝜌,𝜏𝐿2𝜌𝑋𝜇𝑟𝑔𝜌,𝜏𝐿2𝜌𝑋 and 𝑓𝜇2𝐾𝜇2𝑟1𝑔𝜌,𝜏2𝐿2𝜌𝑋. Since 𝜓𝜏 is Lipschitz, we know by taking 𝑓=𝑓𝜇 in (2.12) that 𝒟𝜏(𝜆)𝑍𝜓𝜏𝑓𝜇(𝑥)𝑦𝜓𝜏𝑓𝜌,𝜏(𝑓𝑥)𝑦+𝜆𝜇2𝐾𝑓𝜇𝑓𝜌,𝜏𝐿1𝜌𝑋𝑓+𝜆𝜇2𝐾𝑓𝜇𝑓𝜌,𝜏𝐿2𝜌𝑋𝑓+𝜆𝜇2𝐾𝜇𝑟𝑔𝜌,𝜏𝐿2𝜌𝑋+𝜆𝜇2𝑟1𝑔𝜌,𝜏2𝐿2𝜌𝑋=𝑔𝜌,𝜏𝐿2𝜌𝑋+𝑔𝜌,𝜏2𝐿2𝜌𝑋𝜆𝑟/(1𝑟).(3.12) This verifies the desired bound for 𝒟𝜏(𝜆). By taking 𝑓=0 in (2.12), we have 𝜆𝑓𝜆(0)2𝐾𝒟𝜏(𝜆)𝐶0𝜆𝑟/(1𝑟).(3.13) Then the bound for 𝑓𝜆(0)𝐾 is proved.

4. Estimating Sample Error

This section is devoted to estimating the sample error. This is conducted by using the variance-expectation bound in Lemma 2.6.

Denote 𝜅=sup𝑥𝑋𝐾(𝑥,𝑥). For 𝑅1, denote𝒲(𝑅)=𝐳𝑍𝑚𝑓𝐳(𝜖)𝐾𝑅.(4.1)

Proposition 4.1. Assume (2.3) and (2.5). Let 𝑅1 and 0<𝛿<1. If 𝜌 has a 𝜏-quantile of 𝑝-average type 𝑞 for some 𝑝(0,] and 𝑞(1,), then there exists a subset 𝑉𝑅 of 𝑍𝑚 with measure at most 𝛿 such that for any 𝐳𝒲(𝑅)𝑉𝑅, 𝜏𝜋𝑓𝐳(𝜖)𝜏𝑓𝜌,𝜏𝑓+𝜆𝐳(𝜖)2𝐾2𝜖+𝐶13log𝛿𝜆𝑟/(1𝑟)𝜆max1,(1/(22𝑟))𝑚+𝐶23log𝛿𝑚1/(2𝜃)+𝐶3𝑚1/(2+𝑠𝜃)𝑅𝑠/(1+𝑠),(4.2) where 𝜃 is given by (2.6) and 𝐶1,𝐶2,𝐶3 are constants given by 𝐶1=4𝐶0+2𝜅𝐶0,𝐶2=245+324𝐶𝜃1/(2𝜃),𝐶3=406𝐶𝑠1/(1+𝑠)+408𝐶𝜃𝐶𝑠1/(2+𝑠𝜃).(4.3)

Proof. Let us first estimate the second part 𝒮2 of the sample error. It can be decomposed into two parts 𝒮2=𝒮2,1+𝒮2,2 where 𝒮2,1=𝐳,𝜏𝑓𝜆(0)𝐳,𝜏𝜋𝑓𝜆(0)𝜏𝑓𝜆(0)𝜏𝜋𝑓𝜆(0),𝒮2,2=𝐳,𝜏𝜋𝑓𝜆(0)𝐳,𝜏𝑓𝜌,𝜏𝜏𝜋𝑓𝜆(0)𝜏𝑓𝜌,𝜏.(4.4)
For bounding 𝒮2,1, we take the random variable 𝜉(𝑧)=𝜓𝜏(𝑓𝜆(0)(𝑥)𝑦)𝜓𝜏(𝜋(𝑓𝜆(0))(𝑥)𝑦) on (𝑍,𝜌). It satisfies 0𝜉|𝜋(𝑓𝜆(0))(𝑥)𝑓𝜆(0)(𝑥)|1+𝑓𝜆(0). Hence, |𝜉𝐄(𝜉)|1+𝑓𝜆(0) and 𝐄(𝜉𝐄(𝜉))2𝐄𝜉2(1+𝑓𝜆(0))𝐄(𝜉). Applying the one-side Bernstein inequality [12], we know that there exists a subset 𝑍1,𝛿 of 𝑍𝑚 with measure at least 1(𝛿/3) such that 𝒮2,17𝑓1+𝜆(0)log(3/𝛿)6𝑚+𝜏𝑓𝜆(0)𝜏𝜋𝑓𝜆(0),𝐳𝑍1,𝛿.(4.5)
For 𝒮2,2, we apply the one-side Bernstein inequality again to the random variable 𝜉(𝑧)=𝜓𝜏(𝜋(𝑓𝜆(0))(𝑥)𝑦)𝜓𝜏(𝑓𝜌,𝜏(𝑥)𝑦), bound the variance by Lemma 2.6 with 𝑓=𝜋(𝑓𝜆(0)), and find that there exists another subset 𝑍2,𝛿 of 𝑍𝑚 with measure at least 1(𝛿/3) such that 𝒮2,24log(3/𝛿)+𝜃3𝑚2𝜃/(42𝜃)𝜃122𝐶𝜃log(3/𝛿)𝑚1/(2𝜃)+𝜏𝜋𝑓𝜆(0)𝜏𝑓𝜌,𝜏,𝐳𝑍2,𝛿.(4.6)
Next, we estimate the first part 𝒮1 of the sample error. Consider the function set 𝜓𝒢=𝜏(𝜋(𝑓)(𝑥)𝑦)𝜓𝜏𝑓𝜌,𝜏(𝑥)𝑦𝑓𝐾.𝑅(4.7) A function from this set 𝑔(𝑧)=𝜓𝜏(𝜋(𝑓)(𝑥)𝑦)𝜓𝜏(𝑓𝜌,𝜏(𝑥)𝑦) satisfies 𝐄(𝑔)0, |𝑔(𝑧)|2, and 𝐄(𝑔2)𝐶𝜃(𝐄(𝑔))𝜃 by (2.16). Also, the Lipschitz property of the pinball loss yields 𝒩(𝒢,𝑢)𝒩(𝐵1,𝑢/𝑅). Then, we apply a standard covering number argument with a ratio inequality [12, 13, 19, 20] to 𝒢 and find from the covering number condition (2.5) that Prob𝐳𝑍𝑚sup𝑓𝐾𝑅𝜏(𝜋(𝑓))𝜏𝑓𝜌,𝜏𝐳,𝜏(𝜋(𝑓))𝐳,𝜏𝑓𝜌,𝜏𝜏(𝜋(𝑓))𝜏𝑓𝜌,𝜏𝜃+𝑢𝜃4𝑢1(𝜃/2)𝐵1𝒩1,𝑢𝑅exp𝑚𝑢2𝜃2𝐶𝜃+(4/3)𝑢1𝜃𝐶1exp𝑠𝑅𝑢𝑠𝑚𝑢2𝜃2𝐶𝜃+(4/3)𝑢1𝜃.(4.8) Setting the confidence to be 1(𝛿/3), we take 𝑢(𝑅,𝑚,𝛿/3) to be the positive solution to the equation 𝐶𝑠𝑅𝑢𝑠𝑚𝑢2𝜃2𝐶𝜃+(4/3)𝑢1𝜃𝛿=log3.(4.9) Then, there exists a third subset 𝑍3,𝛿 of 𝑍𝑚 with measure at least 1(𝛿/3) such that sup𝑓𝐾𝑅𝜏(𝜋(𝑓))𝜏𝑓𝜌,𝜏𝐳,𝜏(𝜋(𝑓))𝐳,𝜏𝑓𝜌,𝜏𝜏(𝜋(𝑓))𝜏𝑓𝜌,𝜏𝜃+(𝑢(𝑅,𝑚,𝛿/3))𝜃𝑢4𝛿𝑅,𝑚,31(𝜃/2),𝐳𝑍3,𝛿.(4.10) Thus, for 𝑧𝒲(𝑅)𝑍3,𝛿, we have 𝒮1=𝜏𝜋𝑓𝐳(𝜖)𝜏𝑓𝜌,𝜏𝐳,𝜏𝜋𝑓𝐳(𝜖)𝐳,𝜏𝑓𝜌,𝜏𝑢4𝛿𝑅,𝑚,31(𝜃/2)𝜏𝜋𝑓𝐳(𝜖)𝜏𝑓𝜌,𝜏𝜃+𝑢𝛿𝑅,𝑚,3𝜃𝜃1242/(2𝜃)𝑢𝛿𝑅,𝑚,3+𝜃2𝜏𝜋𝑓𝐳(𝜖)𝜏𝑓𝜌,𝜏+4𝑢𝛿𝑅,𝑚,3.(4.11) Here, we have used the elementary inequality 𝑎+𝑏𝑎+𝑏 and Young’s inequality. Putting this bound and (4.5), (4.6) into (3.5), we know that for 𝐳𝒲(𝑅)𝑍3,𝛿𝑍1,𝛿𝑍2,𝛿, there holds 𝜏𝜋𝑓𝐳𝜏𝑓𝜌,𝜏𝑓+𝜆𝐳(𝜖)2𝐾12𝜏𝜋𝑓𝐳(𝜖)𝜏𝑓𝜌,𝜏+𝜖+2𝒟𝜏(𝜆)+20𝑢𝛿𝑅,𝑚,3+7𝑓1+𝜆(0)log(3/𝛿)+6𝑚4log(3/𝛿)+3𝑚2𝐶𝜃log(3/𝛿)𝑚1/(2𝜃),(4.12) which together with Proposition 3.3 implies 𝜏𝜋𝑓𝐳𝜏𝑓𝜌,𝜏𝑓+𝜆𝐳(𝜖)2𝐾2𝜖+4𝐶0𝜆𝑟/(1𝑟)+5+22𝐶𝜃1/(2𝜃)3log𝛿𝑚1/(2𝜃)+40𝑢𝛿𝑅,𝑚,3+2𝜅𝐶03log𝛿𝜆(2𝑟1)/(22𝑟)𝑚.(4.13) Here, we have used the reproducing property in 𝐾 which yields [12] 𝑓𝜅𝑓𝐾,𝑓𝐾.(4.14) Equation (4.9) can be expressed as 𝑢2+𝑠𝜃433𝑚log𝛿𝑢1+𝑠𝜃2𝐶𝜃𝑚3log𝛿𝑢𝑠4𝐶𝑠𝑅𝑠𝑢3𝑚1𝜃2𝐶𝜃𝐶𝑠𝑅𝑠𝑚=0.(4.15) By Lemma  7.2 in [12], the positive solution 𝑢(𝑅,𝑚,𝛿/3) to this equation can be bounded as 𝑢6(𝑅,𝑚,𝛿/3)max𝑚3log𝛿,8𝐶𝜃𝑚3log𝛿1/(2𝜃),6𝐶𝑠𝑅𝑠𝑚1/(1+𝑠),8𝐶𝜃𝐶𝑠𝑅𝑠𝑚1/(2+𝑠𝜃)6+8𝐶𝜃1/(2𝜃)3log𝛿𝑚1/(2𝜃)+6𝐶𝑠1/(1+𝑠)+8𝐶𝜃𝐶𝑠1/(2+𝑠𝜃)×𝑚1/(2+𝑠𝜃)𝑅𝑠/(1+𝑠).(4.16) Thus, for 𝐳𝒲(𝑅)𝑍3,𝛿𝑍1,𝛿𝑍2,𝛿, the desired bound (4.2) holds true. Since the measure of the set 𝑍3,𝛿𝑍1,𝛿𝑍2,𝛿 is at least 1𝛿, our conclusion is proved.

5. Deriving Convergence Rates by Iteration

To apply Proposition 4.1 for error analysis, we need some 𝑅1 for 𝐳𝒲(𝑅). One may choose 𝑅=𝜆1/2 according to𝑓𝐳(𝜖)𝐾𝜆1/2,𝐳𝑍𝑚(5.1) which is seen by taking 𝑓=0 in (1.7). This choice is too rough. Recall from Proposition 3.3 that 𝑓𝜆(0)𝐾𝐶0𝜆(2𝑟1)/(22𝑟) which is a bound for the noise-free limit 𝑓𝜆(0) of 𝑓𝐳(𝜖). It is much better than 𝜆1/2. This motivates us to try similar tight bounds for 𝑓𝐳(𝜖). This target will be achieved in this section by applying Proposition 4.1 iteratively. The iteration technique has been used in [13, 21] to improve learning rates.

Lemma 5.1. Assume (2.3) with 0<𝑟1/2 and (2.5) with 𝑠>0. Take 𝜆=𝑚𝛼 with 0<𝛼1 and 𝜖=𝑚𝛽 with 0<𝛽. Let 0<𝜂<1. If 𝜌 has a 𝜏-quantile of 𝑝-average type 𝑞 for some 𝑝(0,] and 𝑞(1,), then for any 0<𝛿<1, with confidence 1𝛿, there holds 𝑓𝐳(𝜖)𝐾4𝐶31+2+𝐶1+𝐶23log𝜂23log𝛿𝑚𝜃𝜂,(5.2) where 𝜃𝜂 is given by 𝜃𝜂=max𝛼𝛽2,𝛼(12𝑟),𝛼22𝑟21,[]2(2𝜃)𝛼(2+𝑠𝜃)1(1+𝑠)(2+𝑠𝜃)(2+𝑠)+𝜂0.(5.3)

Proof. Putting 𝜆=𝑚𝛼 with 0<𝛼1 and 𝜖=𝑚𝛽 with 0<𝛽 into Proposition 4.1, we know that for any 𝑅1 there exists a subset 𝑉𝑅 of 𝑍𝑚 with measure at most 𝛿 such that 𝑓𝐳(𝜖)𝐾𝑎𝑚𝑅𝑠/(2+2𝑠)+𝑏𝑚,𝐳𝒲(𝑅)𝑉𝑅,(5.4) where with 𝜁=max{(𝛼𝛽)/2,𝛼(12𝑟)/(22𝑟),𝛼/21/2(2𝜃)}0, the constants are given by 𝑎𝑚=𝐶3𝑚(𝛼/2)(1/2(2+𝑠𝜃)),𝑏𝑚=2+𝐶13log𝛿+𝐶23log𝛿𝑚𝜁.(5.5) It follows that 𝑎𝒲(𝑅)𝒲𝑚𝑅𝑠/(2+2𝑠)+𝑏𝑚𝑉𝑅.(5.6)
Let us apply (5.6) iteratively to a sequence {𝑅(𝑗)}𝐽𝑗=0 defined by 𝑅(0)=𝜆1/2 and 𝑅(𝑗)=𝑎𝑚(𝑅(𝑗1))𝑠/(2+2𝑠)+𝑏𝑚 where 𝐽 will be determined later. Then, 𝒲(𝑅(𝑗1))𝒲(𝑅(𝑗))𝑉𝑅(𝑗1). By (5.1), 𝒲(𝑅(0))=𝑍𝑚. So we have 𝑍𝑚𝑅=𝒲(0)𝑅𝒲(1)𝑉𝑅(0)𝑅𝒲(𝐽)𝐽1𝑗=0𝑉𝑅(𝑗).(5.7) As the measure of 𝑉𝑅(𝑗) is at most 𝛿, we know that the measure of 𝐽1𝑗=0𝑉𝑅(𝑗) is at most 𝐽𝛿. Hence, 𝒲(𝑅(𝐽)) has measure at least 1𝐽𝛿.
Denote Δ=𝑠/(2+2𝑠)<1/2. The definition of the sequence {𝑅(𝑗)}𝐽𝑗=0 tells us that 𝑅(𝐽)=𝑎1+Δ+Δ2++Δ𝐽1𝑚𝑅(0)Δ𝐽+𝐽1𝑗=1𝑎1+Δ+Δ2++Δ𝑗1𝑚𝑏Δ𝑗𝑚+𝑏𝑚.(5.8) Let us bound the two terms on the right-hand side.
The first term equals 𝐶3(1Δ𝐽)/2(1Δ)𝑚((𝛼(2+𝑠𝜃)1)/(4+2𝑠2𝜃))((1Δ𝐽)/(1Δ))𝑚(𝛼/2)Δ𝐽,(5.9) which is bounded by 𝐶3𝑚(𝛼(2+𝑠𝜃)1)/(4+2𝑠2𝜃)(1Δ)𝑚((𝛼/2)(𝛼(2+𝑠𝜃)1)/(4+2𝑠2𝜃)(1Δ))Δ𝐽𝐶3𝑚[𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠)𝑚(1/(2+𝑠𝜃))2𝐽.(5.10) Take 𝐽 to be the smallest integer greater than or equal to log(1/𝜂)/log2. The above expression can be bounded by 𝐶3𝑚[𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠)+𝜂.
The second terms equals 𝐽1𝑗=1𝑎1+Δ+Δ2++Δ𝑗1𝑚𝑏Δ𝑗𝑚+𝑏𝑚𝐽1𝑗=1𝐶3𝑚((𝛼/2)1/2(2+𝑠𝜃))((1Δ𝑗)/(1Δ))𝑏Δ𝑗1𝑚𝜁Δ𝑗+𝑏1𝑚𝜁,(5.11) where 𝑏1=2+𝐶1log(3/𝛿)+𝐶2log(3/𝛿). It is bounded by 𝑚[𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠)𝐶3𝑏1𝐽1𝑗=0𝑚(𝜁([𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠)))(𝑠𝑗/(2+2𝑠)𝑗).(5.12) When 𝜁[𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠), the above expression is bounded by 𝐶3𝑏1𝐽𝑚[𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠). When 𝜁[𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠), it is bounded by 𝐶3𝑏1𝐽𝑚𝜁.
Based on the above discussion, we obtain 𝑅(𝐽)𝐶3+𝐶3𝑏1𝐽𝑚𝜃𝜂,(5.13) where 𝜃𝜂=max{[𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠)+𝜂,𝜁}. So with confidence 1𝐽𝛿, there holds 𝑓𝐳(𝜖)𝐾𝑅(𝐽)𝐶31+2+𝐶1+𝐶2𝐽3log𝛿𝑚𝜃𝜂.(5.14) Then, our conclusion follows by replacing 𝛿 by 𝛿/𝐽 and noting 𝐽2log(3/𝜂).

Now, we can prove our main result, Theorem 2.5.

Proof of Theorem 2.5. Take 𝑅 to be the right side of (5.2). By Lemma 5.1, there exists a subset 𝑉𝑅 of 𝑍𝑚 with measure at most 𝛿 such that 𝑍𝑚𝑉𝑅𝒲(𝑅). Applying Proposition 4.1 to this 𝑅, we know that there exists another subset 𝑉𝑅 of 𝑍𝑚 with measure at most 𝛿 such that for any 𝐳𝒲(𝑅)𝑉𝑅, 𝜏𝜋𝑓𝐳(𝜖)𝜏𝑓𝜌,𝜏2𝑚𝛽+𝐶13log𝛿𝑚𝛼𝑟/(1𝑟)+𝐶23log𝛿𝑚1/(2𝜃)+𝐶43log𝜂23log𝛿𝑚(𝑠/(1+𝑠))𝜃𝜂(1/(2+𝑠𝜃)),(5.15) where 𝐶4=𝐶34𝐶3𝑠/(1+𝑠)1+2+𝐶1+𝐶2.(5.16) Since the set 𝑉𝑅𝑉𝑅 has measure at most 2𝛿, after scaling 2𝛿 to 𝛿 and setting the constant 𝐶 by 𝐶=2𝐶𝑞,𝜌2+𝐶1+𝐶2+𝐶4,(5.17) we see that the above estimate together with Lemma 3.1 gives the error bound 𝜋𝑓𝐳(𝜖)𝑓𝜌,𝜏𝐿𝑝𝜌𝑋𝐶3log𝜂23log𝛿𝑚𝜗(5.18) with confidence 1𝛿 and the power index 𝜗 give by 1𝜗=𝑞min𝛽,𝛼𝑟,11𝑟𝑠2+𝑠𝜃𝜃1+𝑠𝜂,(5.19) provided that 𝜃𝜂<1+𝑠𝑠.(2+𝑠𝜃)(5.20) Since 𝛽𝛼𝑟/(1𝑟), we know that (𝛼𝛽)/2𝛼(12𝑟)/(22𝑟). By the restriction 0<𝛼<(2+𝑠)/𝑠(2+𝑠𝜃) on 𝛼, we find 𝛼(12𝑟)/(22𝑟)<(1+𝑠)/𝑠(2+𝑠𝜃) and (𝛼/2)(1/2(2𝜃))<(1+𝑠)/𝑠(2+𝑠𝜃). Moreover, restriction (2.7) on 𝜂 tell us that [𝛼(2+𝑠𝜃)1](1+𝑠)/(2+𝑠𝜃)(2+𝑠)+𝜂<(1+𝑠)/𝑠(2+𝑠𝜃). Therefore, condition (5.20) is satisfied. The restriction 𝛽𝛼𝑟/(1𝑟) and the above expression for 𝜗 tells us that the power index for the error bound can be exactly expressed by formula (2.9). The proof of Theorem 2.5 is complete.

Finally, we prove Theorem 2.3.

Proof of Theorem 2.3. Since 𝑓𝜌,𝜏𝐾, we know that (2.3) holds with 𝑟=1/2. The noise condition on 𝜌 is satisfied with 𝑞=2 and 𝑝(0,]. Then, 𝜃=𝑝/(𝑝+1)(0,1]. Since 𝑋𝑛 and 𝐾𝐶(𝑋×𝑋), we know from [11] that (2.5) holds true for any 𝑠>0. With 0<𝜂<((𝑝+1)/2(𝑝+2)), let us choose 𝑠 to be a positive number satisfying the following four inequalities: 𝑝+1<𝑝+22+𝑠,1𝑠(2+𝑠𝜃)3<[](1+𝑠)2+2𝑠(𝑠(2+𝑠𝜃)(𝑝+1))/(𝑝+2),𝑠(2+𝑠𝜃)(2+𝑠)𝑝+11𝑝+22𝜂𝑠[]2+𝑠𝜃(2+𝑠𝜃)(𝑝+1)/(𝑝+21)𝑠(2+𝑠𝜃)(2+𝑠),3(1+𝑠)𝑝+11𝑝+22𝜂.2+𝑠𝜃(5.21) The first inequality above tells us that the restrictions on 𝛼,𝛽 are satisfied by choosing 𝛼=(𝑝+1)/(𝑝+2)=1/(2𝜃) and (𝑝+1)/(𝑝+2)𝛽. The second inequality shows that condition (2.7) for the parameter 𝜂 renamed now as 𝜂 is also satisfied by taking 𝜂=1/3. Thus, we apply Theorem 2.5 and know that with confidence 1𝛿, (2.8) holds with the power index 𝜗 given by (2.9); but 𝑟=1/2,𝛼=1/(2𝜃), and 𝜂=1/3 imply that 1𝜗=21min𝛼,,12+𝑠𝜃𝑠[]2+𝑠𝜃𝛼(2+𝑠𝜃)1𝑠(2+𝑠𝜃)(2+𝑠)3(1+𝑠).(5.22) The last two inequalities satisfied by 𝑠 yield 𝜗(𝛼/2)𝜂. So (2.8) verifies (2.4). This completes the proof of the theorem.

Acknowledgment

The work described in this paper is supported by National Science Foundation of China under Grant 11001247 and by a grant from the Research Grants Council of Hong Kong (Project no. CityU 103709).