Abstract

In the ranking problem, one has to compare two different observations and decide the ordering between them. It has received increasing attention both in the statistical and machine learning literature. This paper considers -regularized ranking rules with convex loss. Under some mild conditions, a learning rate is established.

1. Introduction

In the ranking problem, one has to compare two different observations and decide the ordering between them. The problem of ranking has become an interesting field for researchers in machine learning community. It has received increasing attention both in the statistical and machine learning literature.

The problem of ranking may be modeled in the framework of statistical learning (see [1, 2]). Let be a pair of random variables taking values in . The random observation models some object and denotes its real-valued label. Let denote a pair of random variables identically distributed with (with respect to the probability ) and independent of it. In the ranking problem one observes and but not their labels and . is “better” than if . We are to construct a measurable function , called a ranking rule, which predicts the ordering between objects in the following way: if , we predict that is better than . A ranking rule has the property . The performance of a ranking rule is measured by the ranking error: that is, the probability that ranks two randomly drawn instances incorrectly. It is easily seen that attains its minimum , over the class of all measurable functions, at the ranking rule

In practice, the best rule is unknown since the probability is unknown. A widely used approach for estimating is the empirical risk minimization with convex loss.

Definition 1. one says that is a ranking loss (function) if it is convex, differentiable at with , and the smallest zero of is 1.

Examples of ranking loss include the least square loss and -norm SVM loss , where and for .

The risk of a measurable function is defined as . Denote by a minimizer of over the set of all measurable and antisymmetric functions. For example, as in the classification case (see [3, 4]), , and for ,

The following inequality holds for any : where and is some constant.

Before proceeding further, we introduce the notion of Reproducing Kernel Hilbert Space (RKHS). Recall that a continuous function is a Mercer kernel on a set , if , and given an arbitrary finite set , the matrix is positive semidefinite. The RKHS associated with the Mercer kernel is the completion of , with respect to the inner product given by . See [5] and ([6, Ch. 4]) for details.

For convenience, we assume hereafter that the Mercer kernels on are symmetric in the sense that Such examples are Mercer kernels of either form or , where and are the Euclidean norm and inner product, respectively.

Since the best ranking rule is antisymmetric, it is reasonable that we restricted ourselves to the subspace of anti-symmetric functions in ; that is, For any , and with , it is easily seen that Conversely, any anti-symmetric function with above expression should satisfy , provided det.

For a set of samples , let

For , the -penalty regularized ranking rule is the minimizer of the minimization problem where , known as empirical risk, is given by

Associated with any ranking rule , we construct another ranking rule as follows: Clearly, gives the same ranking rule as and it satisfies

Hereafter, we denote . The goal of this paper is to bound the excess error , which in turn together with (4) up bounds the excess ranking error . The main result of this paper is, under mild conditions, to establish a learning rate for -penalty regularized ranking rules with convex loss.

Classification with convex loss, in particular for -norm SVMs, has been the subject of many theoretical considerations in recent years. The 1-norm SVMs with regularizer being RKHS norm for ranking was investigated in [2, 7]. The -penalty has been used in [8, 9] for classification problems under the framework of SVMs. It is well known that -regularization usually leads to solution with sparse representation (see, e.g., [1013]). In this paper, we consider ranking with convex loss and -penalty.

In [2], the RKHS-norm SVMs for ranking was proposed. But it was implemented over a ball of , not the whole RKHS ; that is, it solves the minimization problem A convergence rate for has been established for Gaussian kernel. The approximation error was not considered there. The asymptotic behavior of the same algorithm implemented over the whole RKHS was investigated in [7]. Moreover, a fast learning rate is obtained under some conditions.

We would like to mention a recent paper [14], where the error of function is defined by . A convergence rate for the minimizer of the regularized empirical error was established. The author made use of the technique of estimation via integral operator developed in [15].

The rest of the paper is organized as follows. In Section 2, after making some assumptions, we state the main result, an upper bound for . As usual, it is decomposed as a sum of three terms, sample error, hypothesis error, and approximation error. Sections 3 and 4 are devoted to the estimations of hypothesis error and sample error, respectively. A proof of the main result is given in Section 5.

2. Assumptions and Main Results

For the statement of the main results, we need to introduce some notions and make some assumptions.

Denote . The following assumption is a bound for variance of , which is adopted by many authors.

Assumption 2. There is a constant such that for any , where is a constant.

For , the assumption is satisfied [16] for

It is known in [17] that if there is some positive constant such that then the assumption is satisfied for .

Suppose hereafter . We note that, for any Mercer kernel , We now construct a set of functions which contains and is independent of the samples.

Definition 3. The Banach space is defined as the function set on containing all functions of the form with the norm

Obviously, . By the definition of and (17), one has which implies that the series converges in . Consequently, and . The following also holds: , where for .

Denote . The approximation error of by with is defined as

Denote the minimizer

The next assumption is concerned with the approximation power of to .

Assumption 4. There are positive constants and such that

Recall . The above assumption is not too restrict.

Assumption 5. (i) The kernel satisfies a Lipschitz condition of order with ; that is, there exists some such that (ii) The ranking loss has an increment exponent ; that is, there exist some constants such that where denotes the right- and left-sided derivatives of , respectively.

Assumption 6. The margin distribution satisfies condition with ; that is, for some and any ball , one has

The last assumption concerns covering numbers. For a subset of a space with pseudometric and . The covering number is defined to be the minimal number such that there exist disks with radius covering . When is compact this number is finite.

Assumption 7. (i) There are some and such that (ii) For , let . There are some constant such that

It was shown in [18], under Assumptions 5(i), 6, and 7(i), that the following holds: Therefor (ii) in Assumption 7 holds provided that .

We are in a position to state the main result of this paper. The proof is given in Section 5.

Theorem 8. For any , under the Assumptions 27, one has confidence at least where and , are constants independent of or .

The first step of the proof is to decompose into errors of different types as the following: where referred to as sample error and referred to as hypothesis error. We bound hypothesis error and sample error in the next two sections, respectively.

In the estimation of sample error, Hoeffding's decomposition of -statistic, which breaks -statistic into a sum of iid random variables and a degenerate -statistic (see Section 4 for details), is a useful tool.

3. Hypothesis Error

In this section, we bound hypothesis error . This error is caused as we switch from the minimizer of in to the minimizer of in . Such errors are estimated in some papers, for example, [7, 18], and so forth. We note that, different from [18, 19], the underlying spaces and are sets of antisymmetric functions. We begin with the representations of the functions.

Lemma 9. Let . For any , one has a representation:

Proof. For any , there are sequences and such that and .
Denote . It follows from (5) that The proof is complete by .

A set is said to be -dense in if for any there exists some such that .

Proposition 10 ([19], Proposition 3.1). Let be drawn independently according to . Then for any , under Assumption 6 and (i) in Assumption 7, with confidence at least is -dense in , where is a constant that depends only on and .

The hypothesis error is bounded by the following proposition.

Proposition 11. Assume Assumptions 5 and 6. Then for any , with confidence at least , there holds where is a constant independent of , and . (hereafter, and are constants which are independent of , or , and may changes from line to line.)

Proof. The proof follows the line of [19, 20]. For any , let the representation with be given in Lemma 9.
By Proposition 11, with confidence at least , for any , there are some such that . For an integer , which will be determined later, denote . So by Assumption 7, where .
Choose such that . Therefore Consequently, , which together with yields, with confidence at least , On the other hand, since , the following holds by (12) and (9), with confidence at least : which together with (40) completes the proof by letting .

The above bound for is the same as in [20], both using of the same density of in . However, the functions considered there are defined on , instead of as present.

4. Sample Error

As in [2, 7], Hoeffding's decomposition plays an important role in the estimation for sample error. For any , denote By Hoeffding's decomposition of -statistic, we have where and, for any ,

Moreover, for any ranking rule , we denote . Then It is the deviation of sum of independent random variables from their mean.

As seen, in Hoeffding's decomposition (43), the first term is a sum of iid variables and the second term is a degenerate -statistic. The degeneration means = .

Denote so that the sample error .

We first estimate . Since depends on , by (43) and (45), we need to consider the suprema of the sets where , containing , is (a subset of) a ball in .

With the above decomposition and the methods in [21], the following proposition is established for the ball and in [7]. The Assumption 2 and the condition on the covering number of play a crucial role. The arguments also work in the present setting; that is, for the ball . For this we note that satisfies . Therefore, under (ii) of Assumption 5 and (ii) of Assumption 7, the covering number of satisfies . The interested reader may refer to [7, 21] for the details.

Proposition 12. Let and . Under Assumption 2, (ii) of Assumption 5, and (ii) of Assumption 7, one has confidence at least , where is bounded by with being a constant independent of , and .

The estimation for the supremum of -process is much involved. The supremum of -processes has been studied in a few papers. The following lemma follows from the proof of ([22, Theorem 3.2, ]).

Lemma 13. Suppose that a function class satisfies the following conditions. (i)For any , and .(ii) is uniformly bounded by a universal constant .(iii).Then where , are some constants, independent of , and

Proposition 14. Suppose that (ii) in Assumption 7 holds. Then one has with confidence at least where and are some positive constants.

Proof. We first claim that satisfies the conditions in Lemma 13. Indeed, condition (i) holds by definition of and for any . Also, by definition of , for any , , implying (ii). Moreover, by (ii) of Assumption 5 we have provided (ii) in Assumption 7. This establishes (iii), as claimed.
Applying Markov's inequality to , with and appealing to Lemma 13, we have ; that is, For single function , it is known in ([23, Proposition 2.3]) that which together with (55) completes the proof.

The estimation for is easy since does not change with the set of samples.

Proposition 15. Assume Assumption 2. For any , one has with confidence at least where are some constants, independent of and .

Proof. Clearly, the function satisfies . Then, by Assumption 2, we conclude, as in [20, 21], with confidence at least , that
It remains to estimate . For any single function with and , we have by [23, Proposition 2.3] which together with implies, with confidence at least , that where and are constants. The proof is complete.

5. Proof of Theorem 8

Theorem 16. For such that and any , under Assumptions 27, one has with confidence at least where are constant independent of , or .

Proof. We note that By , a combination of Propositions 11, 12, 14, and 15 yields that, with confidence at least , where , are constants, and is bounded by (49).
Putting and into the implication relation we obtain (61) from (63). Therefore, the conditional probability of the event that inequality (61) holds, given the event , is at least . The proof is complete.

For any , denote the random event by . Obviously, for . However, to prove Theorem 8, a smaller with is desired. To this end, we apply the iteration technique for estimation of introduced in [17].

Recall that is given in Theorem 8. It is easily seen that, for , Therefore, we have, by Theorem 16, with the conditional probability at least , given , If , the above inequality becomes Consequently, given event with , we have with confidence at least ,

Let . By induction, it is easy to prove = . Since , we have with confidence at least ,. Clearly, where .

For any , let be the smallest integer such that . Substituting into (67), we bound the right hand side of (67) by , with confidence at least , where . This completes the proof.

Acknowledgment

The authors thank Professor Di-Rong Chen for his help.