Approximation Analysis of Gradient Descent Algorithm for Bipartite Ranking
We introduce a gradient descent algorithm for bipartite ranking with general convex losses. The implementation of this algorithm is simple, and its generalization performance is investigated. Explicit learning rates are presented in terms of the suitable choices of the regularization parameter and the step size. The result fills the theoretical gap in learning rates for ranking problem with general convex losses.
In this paper we consider a gradient descent algorithm for bipartite ranking generated from Tikhonov regularization scheme with general convex losses and reproducing kernel Hilbert spaces (RKHS).
Let be a compact metric space and . In bipartite ranking problem, the learner is given positive samples and negative samples , which are randomly independent drawn from and , respectively. Given training set , the goal of bipartite ranking is to learn a real-valued ranking function that ranks future positive samples higher than negative ones.
The expected loss incurred by a ranking function on a pair of instances is , where is if is true and otherwise. However, due to the nonconvexity of , the empirical minimization method based on is NP-hard. Thus, we consider replacing by a convex upper loss function . Typical choices of include the hinge loss, the least square loss, and the logistic loss.
The expected convex risk is The corresponding empirical risk is
Let be the target function set, where is the measurable function space. We can observe that the target function is not unique. In particular, for the least square loss, the regression function is one element in this set.
The ranking algorithm we investigate in this paper is based on a Tikhonov regularization scheme associated with a Mercer kernel. We usually call a symmetric and positive semidefinite continuous function a Mercer kernel. The RKHS associated with the kernel is defined (see ) to be the closure of the linear span of the set of functions with the inner product given by . The reproducing property takes the form , for all . The reproducing property with the Schwartz inequality yields that . Then, , where .
Though the offline algorithm (1.3) has been well understood in , it might be practically challenging when the sample size or is large. The same difficulty for classification and regression algorithms is overcome by reducing the computational complexity through a stochastic gradient descent method. Such algorithms have been proposed for online regression in [3, 4], online classification in [5, 6], and gradient learning in [7, 8]. In this paper, we use the idea of gradient descent to propose an algorithm for learning a target function in .
Since is convex, we know that its left derivative is well defined and nondecreasing on . By taking functional derivatives in (1.3), we introduce the following algorithm for ranking.
Definition 1.1. The stochastic gradient descent ranking algorithm is defined for the sample by and where and is the sequence of step sizes.
In fact, Burges et al.  investigate gradient descent methods for learning ranking functions and introducing a neural network to model the underlying ranking function. From the idea of maximizing the generalized Wilcoxon-Mann-Whitney statistic, a ranking algorithm using gradient approximation has been proposed in . However, these approaches are different from ours and their analysis focuses on computational complexity. Recently, for least square loss, numerical experiments by gradient descent algorithm have been presented in . The aim of this paper is to provide generalization bounds for the gradient descent ranking algorithm (1.5) with general convex losses. To the best of our knowledge, there is no error analysis in this case; This is why we conduct our study in this paper.
We mainly analyze the errors and , which is different from previous error analysis for ranking algorithms based on uniform convergence (e.g., [12–16]) and stability analysis in [2, 17, 18]. Though the convergence rates of norm for classification and regression algorithms have been elegantly investigated in [19, 20], there is no such analysis in the ranking setting. The main difference in the formulation of the ranking problem as compared to the problems of classification and regression is that the performance or loss in ranking is measured on pairs of examples, rather than on individual examples. This means in particular that, unlike the empirical error in classification or regression, the empirical error in ranking cannot be expressed as a sum of independent random variables . This makes the convergence analysis of norm difficult and previous techniques invalid. Fortunately, we observe that similar difficulty for gradient learning has been well overcome in [7, 21, 22] for gradient learning by introducing some novel techniques. In this paper, we will develop an elaborative analysis in terms of these analysis techniques.
2. Main Result
In this section we present our main results on learning rates of algorithm (1.5) for learning ranking functions. We assume that satisfies for some and . Denote the constant
Theorem 2.1. Assume satisfies (2.1), and choose the step size as For , one takes with . Then, for any , with confidence at least , one has where is a constant independent of , and
Theorem 2.1 will be proved in the next section where the constant can be obtained explicitly. The explicit parameters in Theorem 2.1 are described in Table 1 for some special loss functions . Note that the iteration steps and iterative numbers depend on sample number . When and , we have and .
From the results in Theorem 2.1, we know that the balance of samples is crucial to reach fast learning rates. For and the least square loss, the approximation order is . Moreover, when and , we have with the order .
Now we present the estimates of under some approximation conditions.
Corollary 2.2. Assume that there is such that for some . Under the condition in Theorem 2.1, for any , with confidence at least , one has where is a constant independent of , and
For and the least square loss, by setting , we can derive the learning rate . Moreover, if , we get the approximation order .
For the least square loss, the regression function is an optimal predictor in . Then, the bipartite ranking problem can be reduced as a regression problem. Based on the theoretical analysis in [19, 20], we know that the approximation condition in Corollary 2.2 can be achieved when the regression function lies in the th power of the integral operator associated with the kernel .
The highlight of our theoretical analysis results is to provide the estimate of the distance between and the target function set in RKHS. This is different from the previous result on error analysis that focuses on establishing the estimate of . Compared with the previous theoretical studies, the approximation analysis in -norm is new and fills the gap on learning rates for ranking problem with general convex losses.
We also note that the techniques of previous error estimate for ranking problem mainly include stability analysis in [2, 17], concentration estimation based on U-statistics in , and uniform convergence bounds based on covering numbers [15, 16]. Our analysis presents a novel capacity-independent procedure to investigate the generalization performance of ranking algorithms.
3. Proof of Main Result
We introduce a special property of . Since the proof is the same as that in , we will omit it here.
Lemma 3.1. Let . For any , there holds
Now we give the one-step analysis.
Lemma 3.2. For , one has where .
Proof. Observe that
where the first and the second inequalities are derived by the convexity of and the Schwartz inequality, respectively.
By Lemma 3.1, we know that Thus, the desired result follows by combining (3.5) and (3.6) with (3.4).
To deal with the sample error iteratively by applying (3.3), we need to bound the quantity by the theory of uniform convergence. To this end, a bound for the norm of is required.
Definition 3.3. One says that is locally Lipschitz at the origin if the local Lipschitz constant is finite for any .
Now we estimate the bound of from the ideas given in .
Lemma 3.4. Assume that is locally Lipschitz at the origin. If the step size satisfies for each t, then .
Proof. We prove by induction. It is trivial that satisfies the bound.
Suppose that this bound holds true for , . Consider Let . Since we have .
Meanwhile, . Then, is a positive linear operator on and its norm is bounded by .
Since , the operator on is positive and .
Thus, This proves the lemma.
For , denote . Meanwhile, denote and .
Lemma 3.5. For every and , one has
Proof. Because of the feature of , four cases of samples change should be taken into account to use McDiarmid's inequality. Denote by the sample coinciding with except for (or ) replaced by (or ). It is easy to verify that
Based on MicDiarmid’s inequality in , we can derive the first result in Lemma 3.5. To derive the second result, we denote . Then, and . Observe that
Denote . Then,
Since , we have
With the same fashion, we can also derive Thus, the second desired result follows by combining (3.17) and (3.18).
Now we can derive the estimate of .
Lemma 3.6. If satisfies (1) for each and , then with confidence at least one has
Proof. By Lemma 3.5, we have, with confidence at least , By taking in the definition of , we see that Then, for any , we have . Thus, for . So, for each . This completes the proof.
Lemma 3.7. (1) For and ,
(2) Let and . Then
(3) For any and , there holds
Proposition 3.8. Let for some , and let satisfy . Set and as in Lemma 3.6. Denote . Then, with confidence at least , the following bound holds for : when , when ,
Proof. Since , we have . From the definition of , we know that . Thus, when , we have from Lemma 3.2 Applying this relation iteratively, we have Since , by Lemma 3.7(2), we have for and for Lemma 3.7(1) yields By Lemma 3.7(3), we also have for and for Combining the above estimations with Lemma 3.6, we derive the desired results.
Now we present the proof of Theorem 2.1.
Proof of Theorem 2.1. First we derive explicit expressions for the quantities in Proposition 3.8. Since , we have , where . By (2.1), we find that
where and .
Next, we bound . When , we have Hence, It follows that the condition in Proposition 3.8 holds true when and . Based on Proposition 3.8 and , we have, with confidence at least , where , and are constants independent of and .
Thus, when , we can derive the desired result in Theorem 2.1.
This work was supported partially by the National Natural Science Foundation of China (NSFC) under Grant no. 11001092 and the Fundamental Research Funds for the Central Universities (Program no. 2011PY130, 2011QC022). The authors are indebted to the anonymous reviewers for their constructive comments.
N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American Mathematical Society, vol. 68, pp. 337–404, 1950.View at: Google Scholar
S. Agarwal and P. Niyogi, “Stability and generalization of bipartite ranking algorithms,” In COLT, 2005.View at: Google Scholar
C. Burges, T. Shaked, E. Renshaw et al., “Learning to rank using gradient descent,” in Proceedings of the 22nd international conference on Machine learning, 2005.View at: Google Scholar
H. Chen, Y. Tang, L. Q. Li, and X. Li, “Ranking by a gradient descent algorithm,” manuscript.View at: Google Scholar
S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, and D. Roth, “Generalization bounds for the area under the ROC curve,” Journal of Machine Learning Research, vol. 6, pp. 393–425, 2005.View at: Google Scholar
C. Rudin and R. E. Schapire, “Margin-based ranking and an equivalence between AdaBoost and RankBoost,” Journal of Machine Learning Research, vol. 10, pp. 2193–2232, 2009.View at: Google Scholar
C. Rudin, “The P-norm push: a simple convex ranking algorithm that concentrates at the top of the list,” Journal of Machine Learning Research, vol. 10, pp. 2233–2271, 2009.View at: Google Scholar
S. Agarwal and P. Niyogi, “Generalization bounds for ranking algorithms via algorithmic stability,” Journal of Machine Learning Research, vol. 10, pp. 441–474, 2009.View at: Google Scholar
S. Mukherjee and Q. Wu, “Estimation of gradients and coordinate covariation in classification,” Journal of Machine Learning Research, vol. 7, pp. 2481–2514, 2006.View at: Google Scholar
S. Mukherjee and D. X. Zhou, “Learning coordinate covariances via gradients,” Journal of Machine Learning Research, vol. 7, pp. 519–549, 2006.View at: Google Scholar
C. McDiarmid, “On the method of bounded differences,” in Surveys in Combinatorics, 1989 (Norwich, 1989), vol. 141, pp. 148–188, Cambridge University Press, Cambridge, UK, 1989.View at: Google Scholar