Research Article | Open Access

# Convergence Analysis of an Empirical Eigenfunction-Based Ranking Algorithm with Truncated Sparsity

**Academic Editor:**Sergei V. Pereverzyev

#### Abstract

We study an empirical eigenfunction-based algorithm for ranking with a data dependent hypothesis space. The space is spanned by certain empirical eigenfunctions which we select by using a truncated parameter. We establish the representer theorem and convergence analysis of the algorithm. In particular, we show that under a mild condition, the algorithm produces a satisfactory convergence rate as well as sparse representations with respect to the empirical eigenfunctions.

#### 1. Introduction

Motivated by various applications including problems related to information retrieval, user-preference modeling, and computational biology, the problem of ranking has recently gained much attention in machine learning (see, e.g., [1–4]). This paper proposes a kernel-based ranking algorithm to search a ranking function in a data dependent hypothesis space. The space is spanned by certain empirical eigenfunctions which we select by using a truncated parameter. The notion of empirical eigenfunctions, first studied for learning algorithms in [5], has been used to develop classification and regression algorithms which are shown to have strong learning ability [6, 7]. We will use such an idea to develop learning algorithms for ranking.

##### 1.1. The Ranking Problem

The problem of ranking is distinct from both classification and regression. In ranking, one learns a real-valued function that assigns scores to instances, but the scores themselves do not matter; instead, what is important is the relative ranking of instances induced by those scores.

Formally, the problem of ranking may be modeled in the framework of statistical learning theory (see, e.g., [8] for more details). Assume is a Borel probability measure on , where is a compact metric space (input or instance space) and (output space) for some . Let be its marginal distribution on and let be the conditional distribution on at given . The learner is given a set of samples drawn independently and identically according to and the goal is to find a function that ranks future instances with larger labels higher than those with smaller labels. In other words, is to be ranked as preferred over if and lower than if ( indicates that there is no difference in ranking preference between the two instances). In this setting, the penalty of a ranking function on a pair of instances with corresponding labels and can be taken to be the least squares ranking loss: and, as a result, the quality of can be measured by its expected ranking error:

Let be the space of square integrable functions on with respect to the measure . Let be the collection of target functions which are defined to be the functions minimizing the error over ; that is, . It is apparent from (2) that the error of any ranking function of the form is the same as that of , where is some constant. Therefore, unlike the target function in classification or regression, the target function in ranking is not unique in general. It is easy to show that the regression function of defined by is a minimizer of the error (2), which indicates that any function of the form is also a target function. On the other hand, any target function in must has the form , which can be checked from Lemma 11 in [9]. Thus we conclude that the collection consists exactly of functions that have the form with being an arbitrary constant; that is, .

##### 1.2. The Mercer Kernel and Empirical Eigenpair

Our algorithm is based on a Mercer kernel and we need to introduce some notions related to kernels (see [10, 11] for more details). Recall that a Mercer kernel is defined to be a symmetric continuous function such that, for any finite subset of , the matrix is positive semidefinite. The reproducing kernel Hilbert space associated with a Mercer kernel is the Hilbert space completed by the span of under the norm induced by the inner product satisfying . The reproducing property in takes the form for all , , which indicates that where .

Let be a Mercer kernel. The integral operator given by is introduced in [12] to analyze the following regularized ranking algorithm: This operator is compact, positive, and self-adjoint. In particular, it has at most countably many nonzero eigenvalues and all of these eigenvalues are nonnegative. Let us arrange these eigenvalues (with multiplicities) as a nonincreasing sequence tending to and take an associated sequence of eigenfunctions to be an orthonormal basis of . In the remainder of this paper, we will use a general assumption that where the power of is defined in terms of and by The assumption (7) is equivalent to , where are the Fourier coefficients of with respect to ; that is, . Note from (8) that the exponent measures the decay of the coefficients of with respect to the orthonormal basis of . Thus it can be regarded as a measurement for the regularity of the regression function .

Let be the unlabeled part of the samples . An empirical version of the operator with respect to is given by The operator is self-adjoint and positive with rank at most . We denote its eigensystem to be , where the eigenvalues are arranged in nonincreasing order with whenever and the corresponding eigenfunctions form an orthonormal basis of . It can be proved that , which means that the eigenfunctions can be approximated by the empirical eigenfunctions . This fact indicates that the first eigenfunctions are reasonably promising for ranking.

##### 1.3. The Computation of Empirical Eigenpair

Before proceeding further, we need to show how the empirical eigenpairs can be found explicitly. The main difficulty here is that the kernel of is not symmetric even though it is a self-adjoint operator on , which makes the computations of empirical eigenfunctions relatively difficult (We refer the readers to [13] for some results on regression learning with indefinite kernels.)

Denote the symmetric matrix by and define and , where is the th-order unit matrix and . The proofs of Lemmas 1–5 can be found in [14].

Lemma 1. *Let be an eigenpair of . If ; then is an eigenpair of .*

It is easy to see that is a positive semidefinite matrix. Denote its eigenvalues as with being the rank of and the corresponding orthonormal eigenvectors as .

Lemma 2. *Let be eigenpairs of . Then, for , one has
*

Lemma 3. *Let be eigenpairs of . Then, for , , one has
*

It follows from Lemmas 2 and 3 that the numbers are eigenvalues of the operator and the functions are the corresponding orthonormal eigenfunctions. Moreover, we have , where denotes the rank of .

Lemma 4. *Let be eigenpairs of . Then, for , one has
**
where is the vector in obtained by restricting the function onto the sampling points.*

Lemma 5. *Let be eigenpairs of . Then, for , , one has
*

By using Lemmas 4 and 5, we come to a conclusion that the numbers are eigenvalues of the matrix , the vectors are corresponding orthonormal eigenvectors, and .

Based on the above arguments, we are now confident that the following theorem can be proved, which yields a method for computing the empirical eigenpairs explicitly.

Theorem 6. *The number of positive eigenvalues of is equal to that of . Moreover, the empirical eigenpairs of can be computed by using the eigenpairs of as follows:
**
for with denoting the rank of .*

##### 1.4. The Ranking Algorithm

Prompted by the above analysis, we propose a learning algorithm for ranking as follows. Let be a positive number (called a truncated parameter) and let denote the set of empirical eigenfunctions such that their corresponding eigenvalues are less than or equal to ; that is, . Let be the number of eigenfunctions in . Our ranking algorithm now takes the form and the output function is

We are concerned in this paper with the representer theorem, that is, the explicit solution to problem (15), and the convergence analysis in the -norm of the above algorithm. Previous work on error analysis of ranking algorithms, such as [3, 8], deals only with generalization properties of the algorithms. Though convergence analysis of classification and regression algorithms has been well studied (see, e.g., [15, 16]), little research has been conducted in establishing similar results in the setting of ranking. Perhaps the first work is that of Chen [12], who derives the convergence rate of a regularized ranking algorithm by means of the technique of operator approximation. Our results can be considered as another attempt in this direction. It should be pointed out that, for the sake of simplicity, rather than taking all target functions into consideration, we will here restrict ourselves to the regression function . In other words, we will consider the convergence bounds for , instead of .

Notice that, compared with classification or regression problems, the main difference in the formulation of ranking problems is that its performance or loss is measured on pairs of examples, rather than on individual examples. This results in the double-index summation in algorithm (15), which prevents us from directly applying the standard Hoeffding inequality used to obtain convergence bounds for classification and regression. We will tackle this problem by a McDiarmid-Bernstein type probability inequality for vector-valued random variables [17] as is done in [8, 12]. Finally, we show that when the eigenvalues decay polynomially, the algorithm produces sparse representations with respect to the empirical eigenfunctions by choosing a suitable parameter .

#### 2. The Representer Theorem

In this section we provide the representer theorem for algorithm (15). The key point in proving the representer theorem is an equality involving the empirical eigenfunctions and eigenvalues.

Theorem 7. *The solution to problem (15) is given by
**
where .*

*Proof. *The empirical error part takes the form
A routine computation gives rise to
By using (19), one can carry on with the above equality chain as follows:
Hence we have an equivalent form of (15) as
The component can be found by solving the following optimization problem:
which has the solution given by (17). This proves the theorem.

#### 3. The Error Analysis

In order to derive an error bound for algorithm (15), we need some preliminary inequalities. The following Hoffman-Wielandt inequality establishes the relationship between and , which has been investigated in [18–21].

Lemma 8. *One has
**
where is the Hilbert-Schmidt norm of , the Hilbert space of all Hilbert-Schmidt operators on , with inner product . Here denotes the trace of a linear operator.*

The inner product in can also be defined by , where is an orthonormal basis of . The space is a subspace of the space of bounded linear operators on , denoted as , with the norm relations and .

To bound the quantity , we introduce the following McDiarmid-Bernstein type of probability inequality for vector-valued random variable established in [17].

Lemma 9. *Let be independently drawn according to a probability distribution on , a Hilbert space, and measurable. If there is such that for each and almost every , then for every ,
**
where . For any , with confidence , there holds
*

From the fact (see [22]) and after tedious calculations, one can derive that, for each , . By Lemma 9, we obtain the following.

Lemma 10. *For any , with confidence , there holds
*

Define a vector-valued function by It is easy to show that and for each , . By Lemma 9, one has the following.

Lemma 11. *For any , with confidence , there holds
*

With the help of the preceding five lemmas, we are now in a position to derive an error estimate for the algorithm. We will conduct analysis for the error in the -metric, which makes the corresponding error estimate stronger than that performed in the -metric [16].

Theorem 12. *Assume (7). For any , with confidence , one has
**
where , , and are constants independent of and (given explicitly in the proof).*

*Proof. *By Lemmas 10 and 11, we know that for any there exists a subset of of measure at least such that both (26) and (28) hold for each .

Let . It follows from the orthogonal expansion in terms of the orthonormal basis that
We bound the first term on the right-hand side of (30) by decomposing it further into two parts with :
The part with is easy to deal with since is an orthonormal basis; we have
where the last inequality follows from . The part with can be estimated by the Schwarz inequality as

We continue to bound in two cases.*Case 1 *. For , we observe that
From the definition of the Hilbert-Schmidt norm, we have
By Lemma 8, we get
*Case 2 *. We notice that and obtain from the above estimate

The bounds for the two cases together with (32) give a bound for as
Here

Now we turn to the second term on the right-hand side of (30). Note that
By Theorem 7, we have
Thus, for any , with confidence , we have
Putting the bounds for and into (26), we know that, with confidence , can be bounded by
Let , , and . Then the conclusion of Theorem 12 follows by scaling to .

To illustrate the error estimate, we establish learning rate for the algorithm (15) in a special case when the eigenvalues decay polynomially.

Theorem 13. *Assume (7) and for some and , , the eigenvalues decay polynomially as
**
Let such that for or for . Then, for any , with confidence , we have
**
where and are constants independent of and (given explicitly in the proof).*

*Proof. *It follows from the asymptotic behavior of the eigenvalues that
For , we have
Combining (46) and (48), we see that
By choosing such that , we have

Similarly, for , we have
Thus, we have, in light of ,
By choosing such that , we have
This completes the proof of the theorem with and .

*Remark 14. *The truncated parameter in algorithm (15) plays a role of the regularization parameter instead of . Thus error bounds for our algorithm are closely related to the truncated parameter. Note that our learning rates are given in terms of special choices of the truncated parameter which depends on a priori condition (7). However, methods for determining directly the truncated parameter by the data are more preferrable to practical learners. This will be our future research direction.

*Remark 15. *Note that when is large enough (meaning that has high regularity), the learning rate behaves like . Moreover, the nonzero coefficients in are at most , which is much smaller than the sample size when is large. Thus, our algorithm produces sparse representations with respect to the empirical eigenfunctions under a mild condition.

#### Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

#### Acknowledgments

This work was supported by the National Nature Science Foundation of China (nos. 11301052, 11301045, 11271060, 11290143, 61227004, and 61100130), the Fundamental Research Funds for the Central Universities (no. DUT13LK45), and the Fundamental Research of Civil Aircraft (no. MJ-F-2012-04).

#### References

- S. Clemencon, G. Lugosi, and N. Vayatis, “Ranking and empirical minimization of $U$-statistics,”
*The Annals of Statistics*, vol. 36, no. 2, pp. 844–874, 2008. View at: Publisher Site | Google Scholar | MathSciNet - S. Clémençon and N. Vayatis, “Ranking the best instances,”
*The Journal of Machine Learning Research*, vol. 8, pp. 2671–2699, 2007. View at: Google Scholar | MathSciNet - W. Rejchel, “On ranking and generalization bounds,”
*Journal of Machine Learning Research*, vol. 13, pp. 1373–1392, 2012. View at: Google Scholar | MathSciNet - A. Slivkins, F. Radlinski, and S. Gollapudi, “Ranked bandits in metric spaces: learning diverse rankings over large document collections,”
*Journal of Machine Learning Research*, vol. 14, pp. 399–436, 2013. View at: Google Scholar | MathSciNet - G. Blanchard, P. Massart, R. Vert, and L. Zwald, “Kernel projection machine: a new tool for pattern recognition,” in
*Proceedings of the Advances in Neural Information Processing Systems (NIPS '04)*, pp. 1649–1656, 2004. View at: Google Scholar - H. Chen, H. Xiang, Y. Tang, Z. Yu, and X. Zhang, “Approximation analysis of empirical feature-based learning with truncated sparsity,” in
*Proceedings of the International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR '12)*, pp. 118–124, Xi'an, China, July 2012. View at: Publisher Site | Google Scholar - X. Guo and D. X. Zhou, “An empirical feature-based learning algorithm producing sparse approximations,”
*Applied and Computational Harmonic Analysis*, vol. 32, pp. 389–400, 2012. View at: Publisher Site | Google Scholar - S. Agarwal and P. Niyogi, “Generalization bounds for ranking algorithms via algorithmic stability,”
*Journal of Machine Learning Research*, vol. 10, pp. 441–474, 2009. View at: Google Scholar | MathSciNet - T. Hu, J. Fan, Q. Wu, and D. X. Zhou, “Learning theory approach to minimum error entropy criterion,”
*Journal of Machine Learning Research*, vol. 14, pp. 377–397, 2013. View at: Google Scholar | MathSciNet - N. Aronszajn, “Theory of reproducing kernels,”
*Transactions of the American Mathematical Society*, vol. 68, pp. 337–404, 1950. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - F. Cucker and D. Zhou,
*Learning Theory: An Approximation Theory Viewpoint*, vol. 24 of*Cambridge Monographs on Applied and Computational Mathematics*, Cambridge University Press, Cambridge, UK, 2007. View at: Publisher Site | MathSciNet - H. Chen, “The convergence rate of a regularized ranking algorithm,”
*Journal of Approximation Theory*, vol. 164, no. 12, pp. 1513–1519, 2012. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - H. W. Sun and Q. Wu, “Indefinite kernel network with dependent sampling,”
*Analysis and Applications*, vol. 11, no. 5, Article ID 1350020, 15 pages, 2013. View at: Publisher Site | Google Scholar | MathSciNet - M. Xu, Q. Fang, and S. F. Wang, “On empirical eigenfunction-based ranking with ${l}^{1}$ norm regularization,” Submitted. View at: Google Scholar
- F. Bauer, S. Pereverzev, and L. Rosasco, “On regularization algorithms in learning theory,”
*Journal of Complexity*, vol. 23, no. 1, pp. 52–72, 2007. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - S. Smale and D. Zhou, “Learning theory estimates via integral operators and their approximations,”
*Constructive Approximation*, vol. 26, no. 2, pp. 153–172, 2007. View at: Publisher Site | Google Scholar | MathSciNet - S. Mukherjee and D. Zhou, “Learning coordinate covariances via gradients,”
*Journal of Machine Learning Research*, vol. 7, pp. 519–549, 2006. View at: Google Scholar | MathSciNet - R. Bhatia and L. Elsner, “The Hoffman-Wielandt inequality in infinite dimensions,”
*Proceedings of the Indian Academy of Sciences: Mathematical Sciences*, vol. 104, no. 3, pp. 483–494, 1994. View at: Publisher Site | Google Scholar | MathSciNet - A. J. Hoffman and H. W. Wielandt, “The variation of the spectrum of a normal matrix,”
*Duke Mathematical Journal*, vol. 20, pp. 37–39, 1953. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - T. Kato, “Variation of discrete spectra,”
*Communications in Mathematical Physics*, vol. 111, no. 3, pp. 501–504, 1987. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet - V. Koltchinskii and E. Giné, “Random matrix approximation of spectra of integral operators,”
*Bernoulli*, vol. 6, no. 1, pp. 113–167, 2000. View at: Publisher Site | Google Scholar | MathSciNet - E. de Vito, A. Caponnetto, and L. Rosasco, “Model selection for regularized least-squares algorithm in learning theory,”
*Foundations of Computational Mathematics*, vol. 5, no. 1, pp. 59–85, 2005. View at: Publisher Site | Google Scholar | MathSciNet

#### Copyright

Copyright © 2014 Min Xu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.