Learning TheoryView this Special Issue
Research Article | Open Access
Regularized Ranking with Convex Losses and -Penalty
In the ranking problem, one has to compare two different observations and decide the ordering between them. It has received increasing attention both in the statistical and machine learning literature. This paper considers -regularized ranking rules with convex loss. Under some mild conditions, a learning rate is established.
In the ranking problem, one has to compare two different observations and decide the ordering between them. The problem of ranking has become an interesting field for researchers in machine learning community. It has received increasing attention both in the statistical and machine learning literature.
The problem of ranking may be modeled in the framework of statistical learning (see [1, 2]). Let be a pair of random variables taking values in . The random observation models some object and denotes its real-valued label. Let denote a pair of random variables identically distributed with (with respect to the probability ) and independent of it. In the ranking problem one observes and but not their labels and . is “better” than if . We are to construct a measurable function , called a ranking rule, which predicts the ordering between objects in the following way: if , we predict that is better than . A ranking rule has the property . The performance of a ranking rule is measured by the ranking error: that is, the probability that ranks two randomly drawn instances incorrectly. It is easily seen that attains its minimum , over the class of all measurable functions, at the ranking rule
In practice, the best rule is unknown since the probability is unknown. A widely used approach for estimating is the empirical risk minimization with convex loss.
Definition 1. one says that is a ranking loss (function) if it is convex, differentiable at with , and the smallest zero of is 1.
Examples of ranking loss include the least square loss and -norm SVM loss , where and for .
The risk of a measurable function is defined as . Denote by a minimizer of over the set of all measurable and antisymmetric functions. For example, as in the classification case (see [3, 4]), , and for ,
The following inequality holds for any : where and is some constant.
Before proceeding further, we introduce the notion of Reproducing Kernel Hilbert Space (RKHS). Recall that a continuous function is a Mercer kernel on a set , if , and given an arbitrary finite set , the matrix is positive semidefinite. The RKHS associated with the Mercer kernel is the completion of , with respect to the inner product given by . See  and ([6, Ch. 4]) for details.
For convenience, we assume hereafter that the Mercer kernels on are symmetric in the sense that Such examples are Mercer kernels of either form or , where and are the Euclidean norm and inner product, respectively.
Since the best ranking rule is antisymmetric, it is reasonable that we restricted ourselves to the subspace of anti-symmetric functions in ; that is, For any , and with , it is easily seen that Conversely, any anti-symmetric function with above expression should satisfy , provided det.
For a set of samples , let
For , the -penalty regularized ranking rule is the minimizer of the minimization problem where , known as empirical risk, is given by
Associated with any ranking rule , we construct another ranking rule as follows: Clearly, gives the same ranking rule as and it satisfies
Hereafter, we denote . The goal of this paper is to bound the excess error , which in turn together with (4) up bounds the excess ranking error . The main result of this paper is, under mild conditions, to establish a learning rate for -penalty regularized ranking rules with convex loss.
Classification with convex loss, in particular for -norm SVMs, has been the subject of many theoretical considerations in recent years. The 1-norm SVMs with regularizer being RKHS norm for ranking was investigated in [2, 7]. The -penalty has been used in [8, 9] for classification problems under the framework of SVMs. It is well known that -regularization usually leads to solution with sparse representation (see, e.g., [10–13]). In this paper, we consider ranking with convex loss and -penalty.
In , the RKHS-norm SVMs for ranking was proposed. But it was implemented over a ball of , not the whole RKHS ; that is, it solves the minimization problem A convergence rate for has been established for Gaussian kernel. The approximation error was not considered there. The asymptotic behavior of the same algorithm implemented over the whole RKHS was investigated in . Moreover, a fast learning rate is obtained under some conditions.
We would like to mention a recent paper , where the error of function is defined by . A convergence rate for the minimizer of the regularized empirical error was established. The author made use of the technique of estimation via integral operator developed in .
The rest of the paper is organized as follows. In Section 2, after making some assumptions, we state the main result, an upper bound for . As usual, it is decomposed as a sum of three terms, sample error, hypothesis error, and approximation error. Sections 3 and 4 are devoted to the estimations of hypothesis error and sample error, respectively. A proof of the main result is given in Section 5.
2. Assumptions and Main Results
For the statement of the main results, we need to introduce some notions and make some assumptions.
Denote . The following assumption is a bound for variance of , which is adopted by many authors.
Assumption 2. There is a constant such that for any , where is a constant.
For , the assumption is satisfied  for
It is known in  that if there is some positive constant such that then the assumption is satisfied for .
Suppose hereafter . We note that, for any Mercer kernel , We now construct a set of functions which contains and is independent of the samples.
Definition 3. The Banach space is defined as the function set on containing all functions of the form with the norm
Obviously, . By the definition of and (17), one has which implies that the series converges in . Consequently, and . The following also holds: , where for .
Denote . The approximation error of by with is defined as
Denote the minimizer
The next assumption is concerned with the approximation power of to .
Assumption 4. There are positive constants and such that
Recall . The above assumption is not too restrict.
Assumption 5. (i) The kernel satisfies a Lipschitz condition of order with ; that is, there exists some such that (ii) The ranking loss has an increment exponent ; that is, there exist some constants such that where denotes the right- and left-sided derivatives of , respectively.
Assumption 6. The margin distribution satisfies condition with ; that is, for some and any ball , one has
The last assumption concerns covering numbers. For a subset of a space with pseudometric and . The covering number is defined to be the minimal number such that there exist disks with radius covering . When is compact this number is finite.
Assumption 7. (i) There are some and such that (ii) For , let . There are some constant such that
We are in a position to state the main result of this paper. The proof is given in Section 5.
The first step of the proof is to decompose into errors of different types as the following: where referred to as sample error and referred to as hypothesis error. We bound hypothesis error and sample error in the next two sections, respectively.
In the estimation of sample error, Hoeffding's decomposition of -statistic, which breaks -statistic into a sum of iid random variables and a degenerate -statistic (see Section 4 for details), is a useful tool.
3. Hypothesis Error
In this section, we bound hypothesis error . This error is caused as we switch from the minimizer of in to the minimizer of in . Such errors are estimated in some papers, for example, [7, 18], and so forth. We note that, different from [18, 19], the underlying spaces and are sets of antisymmetric functions. We begin with the representations of the functions.
Lemma 9. Let . For any , one has a representation:
Proof. For any , there are sequences and such that and .
Denote . It follows from (5) that The proof is complete by .
A set is said to be -dense in if for any there exists some such that .
Proposition 10 (, Proposition 3.1). Let be drawn independently according to . Then for any , under Assumption 6 and (i) in Assumption 7, with confidence at least is -dense in , where is a constant that depends only on and .
The hypothesis error is bounded by the following proposition.
Proposition 11. Assume Assumptions 5 and 6. Then for any , with confidence at least , there holds where is a constant independent of , and . (hereafter, and are constants which are independent of , or , and may changes from line to line.)
Proof. The proof follows the line of [19, 20]. For any , let the representation with be given in Lemma 9.
By Proposition 11, with confidence at least , for any , there are some such that . For an integer , which will be determined later, denote . So by Assumption 7, where .
Choose such that . Therefore Consequently, , which together with yields, with confidence at least , On the other hand, since , the following holds by (12) and (9), with confidence at least : which together with (40) completes the proof by letting .
The above bound for is the same as in , both using of the same density of in . However, the functions considered there are defined on , instead of as present.
4. Sample Error
Moreover, for any ranking rule , we denote . Then It is the deviation of sum of independent random variables from their mean.
As seen, in Hoeffding's decomposition (43), the first term is a sum of iid variables and the second term is a degenerate -statistic. The degeneration means = .
Denote so that the sample error .
With the above decomposition and the methods in , the following proposition is established for the ball and in . The Assumption 2 and the condition on the covering number of play a crucial role. The arguments also work in the present setting; that is, for the ball . For this we note that satisfies ≤ . Therefore, under (ii) of Assumption 5 and (ii) of Assumption 7, the covering number of satisfies . The interested reader may refer to [7, 21] for the details.
The estimation for the supremum of -process is much involved. The supremum of -processes has been studied in a few papers. The following lemma follows from the proof of ([22, Theorem 3.2, ]).
Lemma 13. Suppose that a function class satisfies the following conditions. (i)For any , and .(ii) is uniformly bounded by a universal constant .(iii).Then where , are some constants, independent of , and
Proposition 14. Suppose that (ii) in Assumption 7 holds. Then one has with confidence at least where and are some positive constants.
Proof. We first claim that satisfies the conditions in Lemma 13. Indeed, condition (i) holds by definition of and for any . Also, by definition of , for any , , implying (ii). Moreover, by (ii) of Assumption 5
provided (ii) in Assumption 7. This establishes (iii), as claimed.
Applying Markov's inequality to , with and appealing to Lemma 13, we have ; that is, For single function , it is known in ([23, Proposition 2.3]) that which together with (55) completes the proof.
The estimation for is easy since does not change with the set of samples.
Proposition 15. Assume Assumption 2. For any , one has with confidence at least where are some constants, independent of and .
Proof. Clearly, the function satisfies . Then, by Assumption 2, we conclude, as in [20, 21], with confidence at least , that
It remains to estimate . For any single function with and , we have by [23, Proposition 2.3] which together with implies, with confidence at least , that where and are constants. The proof is complete.
5. Proof of Theorem 8
Proof. We note that
By , a combination of Propositions 11, 12, 14, and 15 yields that, with confidence at least ,
where , are constants, and is bounded by (49).
Putting and into the implication relation we obtain (61) from (63). Therefore, the conditional probability of the event that inequality (61) holds, given the event , is at least . The proof is complete.
Recall that is given in Theorem 8. It is easily seen that, for , Therefore, we have, by Theorem 16, with the conditional probability at least , given , If , the above inequality becomes Consequently, given event with , we have with confidence at least ,
Let . By induction, it is easy to prove = . Since , we have with confidence at least ≤ ,. Clearly, where .
The authors thank Professor Di-Rong Chen for his help.
- S. Clémençon, G. Lugosi, and N. Vayatis, “Ranking and empirical minimization of -statistics,” The Annals of Statistics, vol. 36, no. 2, pp. 844–874, 2008.
- W. Rejchel, “On ranking and generalization bounds,” Journal of Machine Learning Research, vol. 13, pp. 1373–1392, 2012.
- Y. Lin, “Support vector machines and the Bayes rule in classification,” Data Mining and Knowledge Discovery, vol. 6, no. 3, pp. 259–275, 2002.
- D.-R. Chen, Q. Wu, Y. Ying, and D.-X. Zhou, “Support vector machine soft margin classifiers: error analysis,” Journal of Machine Learning Research, vol. 5, pp. 1143–1175, 2003/04.
- N. Aronszajn, “Theory of reproducing kernels,” Transactions of the American Mathematical Society, vol. 68, pp. 337–404, 1950.
- F. Cucker and D.-X. Zhou, Learning Theory: An Approximation Theory Viewpoint, vol. 24, Cambridge University Press, Cambridge, UK, 2007.
- H. Chen and J. T. Wu, “Support vecor machine for ranking,” submitted.
- P. S. Bradley and O. L. Mangasarian, “Feature selection via concave minimization and support vector machines,” in Proceedings of the 15th International Conference on Machine Learning (ICML '98), J. Shavlik, Ed., Morgan Kaufmann, 1998.
- M. Song, C. M. Breneman, J. Bi et al., “Prediction of protein retention times in anion-exchange chromatography systems using support vector regression,” Journal of Chemical Information and Computer Sciences, vol. 42, no. 6, pp. 1347–1357, 2002.
- D. L. Donoho, “For most large underdetermined systems of linear equations the minimal ℓ1-norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics, vol. 59, no. 6, pp. 797–829, 2006.
- J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani, “1-norm support vector machines,” Advances in Neural Information Processing Systems, vol. 16, pp. 49–56, 2004.
- B. Tarigan and S. A. van de Geer, “Classifiers of support vector machine type with complexity regularization,” Bernoulli, vol. 12, no. 6, pp. 1045–1076, 2006.
- I. Daubechies, M. Defrise, and C. De Mol, “An iterative thresholding algorithm for linear inverse problems with a sparsity constraint,” Communications on Pure and Applied Mathematics, vol. 57, no. 11, pp. 1413–1457, 2004.
- H. Chen, “The convergence rate of a regularized ranking algorithm,” Journal of Approximation Theory, vol. 164, no. 12, pp. 1513–1519, 2012.
- S. Smale and D.-X. Zhou, “Learning theory estimates via integral operators and their approximations,” Constructive Approximation, vol. 26, no. 2, pp. 153–172, 2007.
- P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe, “Convexity, classification, and risk bounds,” Journal of the American Statistical Association, vol. 101, no. 473, pp. 138–156, 2006.
- I. Steinwart and C. Scovel, “Fast rates for support vector machines using Gaussian kernels,” The Annals of Statistics, vol. 35, no. 2, pp. 575–607, 2007.
- Q.-W. Xiao and D.-X. Zhou, “Learning by nonsymmetric kernels with data dependent spaces and -regularizer,” Taiwanese Journal of Mathematics, vol. 14, no. 5, pp. 1821–1836, 2010.
- H. Tong, D.-R. Chen, and F. Yang, “Support vector machines regression with -regularizer,” Journal of Approximation Theory, vol. 164, no. 10, pp. 1331–1344, 2012.
- H. Tong, D.-R. Chen, and F. Yang, “Learning rates for ℓ1-regularized kernel classifiers,” Journal of Applied Mathematics, vol. 2013, Article ID 496282, 11 pages, 2013.
- H. Tong, D.-R. Chen, and L. Peng, “Analysis of support vector machines regression,” Foundations of Computational Mathematics, vol. 9, no. 2, pp. 243–257, 2009.
- M. A. Arcones and E. Giné, “-processes indexed by Vapnik-Červonenkis classes of functions with applications to asymptotics and bootstrap of -statistics with estimated parameters,” Stochastic Processes and their Applications, vol. 52, no. 1, pp. 17–38, 1994.
- M. A. Arcones and E. Giné, “Limit theorems for -processes,” The Annals of Probability, vol. 21, no. 3, pp. 1494–1542, 1993.
Copyright © 2013 Heng Chen and Jitao Wu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.