Abstract
Learning to rank algorithm has become important in recent years due to its successful application in information retrieval, recommender system, and computational biology, and so forth. Ranking support vector machine (RankSVM) is one of the stateofart ranking models and has been favorably used. Nonlinear RankSVM (RankSVM with nonlinear kernels) can give higher accuracy than linear RankSVM (RankSVM with a linear kernel) for complex nonlinear ranking problem. However, the learning methods for nonlinear RankSVM are still timeconsuming because of the calculation of kernel matrix. In this paper, we propose a fast ranking algorithm based on kernel approximation to avoid computing the kernel matrix. We explore two types of kernel approximation methods, namely, the Nyström method and random Fourier features. Primal truncated Newton method is used to optimize the pairwise L2loss (squared Hingeloss) objective function of the ranking model after the nonlinear kernel approximation. Experimental results demonstrate that our proposed method gets a much faster training speed than kernel RankSVM and achieves comparable or better performance over stateoftheart ranking algorithms.
1. Introduction
Learning to rank is an important research area in machine learning. It has attracted the interests of many researchers because of its growing application in areas like information retrieval systems [1], recommender systems [2, 3], machine translation, and computational biology [4]. For example, in document retrieval domain, a ranking model is trained based on the training data of some queries. Each query contains a group of corresponding retrieved documents and their relevance levels labeled by humans. When a new query arrives for prediction, the trained model is used to rank the corresponding retrieved documents for the query.
Many types of machine learning algorithms have been proposed for the ranking problem. Among them, RankSVM [5], which is extended from the basic support vector machine (SVM) [6], is one of the commonly used methods. The basic idea of RankSVM is transforming the ranking problem into pairwise classification problem. The early implementation of RanSVM [7] was slow because the explicit pairwise transformation led a large number of the training samples. In order to accelerate the training process, [8] proposed a primal Newton method algorithm to solve the linear RankSVMstruct problem without the need of explicit pairwise transformation. And [9] proposed the RankSVM based on the structured output learning framework.
As with the SVM, kernel trick can be used to generalize the linear ranking problem to nonlinear case for RankSVM [7, 9]. Kernel RankSVM can give higher accuracy than the linear RankSVM for complex nonlinear ranking problem [10]. The nonlinear kernel can map the original features into some highdimensional space where the nonlinear problem can be ranked linearly. However, the training time of kernel RankSVM dramatically grows as the training data set increases in size. The computational complexity is at least quadratic in the number of training examples because of the calculation of kernel matrix. Kernel approximation is an efficient way to solve the above problem. It can avoid computing kernel matrix by explicitly generating a vector representation of data that approximates the kernel similarity between any two data points.
The approximation methods can be classified into two categories: the Nyström method [11, 12] and random Fourier features [13, 14]. The Nyström method approximates the kernel matrix by a low rank matrix. The random Fourier features method approximates the shiftinvariant kernel based on Fourier transformation of nonnegative measure [15]. In this paper, we use the kernel approximation method to solve the problem of lengthy training time of kernel RankSVM.
To the best of our knowledge, this is the first work using the kernel approximation method to solve the learning to rank problem. We use two types of approximation methods, namely, the Nyström method or random Fourier features, to map the features into highdimensional space. After the approximation mapping, primal truncated Newton method is used to optimize pairwise L2loss (squared Hingeloss) function of the RankSVM model. Experimental results demonstrate that our proposed method can achieve high performance and fast training speed than the kernel RankSVM. Compared to stateoftheart ranking algorithms, our proposed method can also get comparable or better performance. Matlab code for our algorithm is available online (https://github.com/KaenChan/rankkernelappr).
2. Background and Related Works
In this section, we present the background and related works of learning to rank algorithm and RankSVM.
2.1. Learning to Rank Algorithms
Learning to rank algorithms can be classified into three categories: pointwise approach, pairwise approach, and listwise approach.(i)Pointwise: it transforms the ranking problem into regression or classification on single objects. Then existing regression or classification algorithms are directly applied to model the labels of single objects. This approach includes McRank [16] and OC SVM [17].(ii)Pairwise: it transforms the ranking problem into regression or classification on object pairs. It can model the preferences within the object pairs. This approach includes RankSVM [5] and RankBoost [18].(iii)Listwise: it takes ranking lists as instances in both learning and prediction and can optimize the listwise loss function directly. This approach includes ListNet [19], AdaRank [20], BoltzRank [21], and SVM MAP [22].In this paper, we focus on the pairwise ranking algorithm based on SVM.
2.2. Linear RankSVM
Linear RankSVM is a commonly used pairwise ranking algorithm [5]. For the web search problem with queries and a set of documents of each query, features are extracted from the querydocument pair (, ) and label is the relevance level of the to the query . Thus, the training data is a set of labelqueryinstance tuples . Let denote the set of preference pairs. If , and are in the same query () and is preferred over (). The goal of linear RankSVM is to get a ranking functionsuch that , , and .
RankSVM has a good generalization due to the marginmaximization property. According to [27], the margin is defined as the closest distance between two data points when the data points project to the ranking vector :Maximizing the margin is good because data point pairs with small margins represent very uncertain ranking decisions. RankSVM can guarantee to find a ranking vector with the maximum margin [27]. Figure 1 shows the marginmaximization of four data points for linear RankSVM. The weights of two linear ranking, namely, and , can both rank the four data correctly. But generalizes better than because the margin of is larger than the margin of .
For L1loss (Hingeloss) linear RankSVM [5], the objective loss function iswhere is the regularization parameter. Equation (3) can be solved by standard SVM classification on pairwise difference vectors . But this method is very slow because of the large size of .
In [8], an efficient algorithm was proposed to solve the L2loss (squared Hingeloss) linear RankSVM problemThey used a sparse matrix to obtain the pairwise difference training sample implicitly (). If , there exists a number such that and and the rest is 0. Let . Equation (4) can be written aswhere is a diagonal matrix with if and 0 otherwise. Then, (5) is optimized by primal truncated Newton method in .
2.3. Kernel RankSVM
The key of kernel method is that if kernel function is positive definite, there exists a mapping into the reproducing kernel Hilbert spaces (RKHS), such thatwhere denotes the inner product. The advantage of the kernel method is that the mapping never has to be calculated explicitly.
For L1loss RankSVM, the objective loss function with the kernel mapping has the form [7]The primal problem of (7) can be transformed to the dual problem using the Lagrange multipliers.where each Langrage multiplier corresponds to the pair index in and Solving the kernel RankSVM is a large quadratic programming problem. Instead of directly computing the matrix , we can save the cost by in (5). The ranking function of the kernel RankSVM has the formThe computation of requires kernel evaluations. It is difficult to scale to large kernel RankSVM by solving (8).
Several works have been proposed to accelerate the training speed of kernel RankSVM, such as 1slack structural method [9], representer theorem reformulation [27], and pairwise problem reformulation [10]. However, these methods are still slow for largescale ranking problem because the computational cost is at least quadratic in the number of training examples.
3. RankSVM with Kernel Approximation
3.1. A Unified Model
The drawback of kernel RankSVM is that it needs to store many kernel values during optimization. Moreover, needs to be computed for new data during the prediction, possibly for many vector . This problem can be solved by approximating the kernel mapping explicitly:where is the mapping of kernel approximation. The original feature can be mapped into the approximated Hilbert space by . The objective function of RankSVM with the kernel approximation can be written aswhere is a loss function for SVM, such as for L1loss SVM and for L2loss SVM. The problems of (13) can be solved using linear RankSVM after the approximation mapping. The kernel never needs to be calculated during the training process. Moreover, the weights can be computed directly without the need of storing any training sample. For new data , the ranking function is
Our proposed method mainly includes mapping process and ranking process.(i)Mapping process: the kernel approximation is used to map the original data into high dimensional space. We use two kinds of kernel approximation methods, namely, the Nyström method and random Fourier features, which will be discussed in Section 3.2.(ii)Ranking process: the linear RankSVM is used to train a ranking model. We use the L2loss RankSVM because of its high accuracy and fast training speed. The optimization procedure will be described in Section 3.3. The Nyström method is data dependent and the random Fourier features method is data independent [28]. The Nyström method can usually get a better approximation than random Fourier features, whereas the Nyström method is slightly slower than the random Fourier features. Additionally, in the ranking process, we can replace the L2loss RankSVM with any other linear ranking algorithms, such as ListNet [19] and FRank [23].
3.2. Kernel Approximation
3.2.1. Nyström Method
Nyström method gets a lowrank approximation of kernel matrix by uniformly sampling examples from , denoted by . Let and . The rows and columns of and can be rearranged as where and . Then the rank approximation matrix of can be calculated as [11] where is the pseudoinverse of and is the best rank approximation of . The solution of can be obtained by singular value decomposition (SVD) of , , where is an orthonormal matrix and is the diagonal matrix with . The solution of can be obtained aswhere is the first columns of and . Thus, the nonlinear feature mapping of Nyström method can be written as [28]The algorithm of the Nyström method is described in Algorithm 1. The total time complexity of the approximation of samples is . The approximation error of the Nyström method is [11].

3.2.2. Random Fourier Features
Random Fourier features is an efficient feature transformation method for kernel matrix approximation by calculating the inner product of relatively low dimensional mappings.
When kernel is shiftinvariant, continuous, and positivedefinite, the Fourier transform of the kernel can be written aswhere is a probability density function and . According to Bochner’s theorem [15], the kernel can be approximated aswhere is sampled from . Since and are real, where is drawn uniformly from [13]. The expectation in (20) can be approximated by the mean over Fourier components aswhere is sampled from the distribution and is uniformly sampled from . The algorithm is described in Algorithm 2. The total time complexity of the approximation of samples is . The approximation error of the Nyström method is [14].

3.3. Ranking Optimization
In this section, we solve the L2loss (squared Hingeloss) ranking problem of (13) after the kernel approximation mapping of training dataSimilar as (5), the loss function can be rewritten aswhere . The gradient and the generalized Hessian matrices of (23) arewhere is the identity matrix. The Hessian matrix does not need to be computed explicitly using truncated Newton method [8]. The Newton step can be approximately computed using linear conjugate gradient (CG). The main computation of linear CG method is the Hessianvector multiplication for some vector Assuming that the embedding space has dimensions, the total complexity of this method is where . The main step of our proposed algorithm is described in Algorithm 3. We calculate the approximation embedding using the Nyström method or random Fourier features in line (1). Then is applied to all training samples in line (2). The linear RankSVM model with primal truncated Newton method is applied in the embedding space in line (3)–(11).

4. Experiments
4.1. Experimental Settings
We use three data sets from LETOR (http://research.microsoft.com/enus/um/beijing/projects/letor), namely, OHSUMED, MQ2007, and MQ2008, to validate our proposed ranking algorithm. The examples of the data sets are extracted from the information retrieval data collections. These data sets are often used for evaluating new learning to rank algorithms. Table 1 lists the properties of the data sets. Mean average precision (MAP) [29] and normalized discounted cumulative gain (NDCG) [30] are chosen as the evaluation metrics on the performance of the ranking models.
We compare our proposed method with linear and kernel RankSVM as follows:(i)RankSVMPrimal [8]: it is discussed in Section 2.1 by solving the primal problem of linear L2loss RankSVM (http://olivier.chapelle.cc/primal/).(ii)RankSVMStruct [9]: it solves an equivalent 1slack structural SVM problem with linear kernel (http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html).(iii)RankSVMTRON [10]: it solves the linear or kernel ranking SVM problem by trust region Newton method (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/).(iv)RankNystöm: our proposed RankSVM with the Nyström kernel approximation.(v)RankRandomFourier: our proposed RankSVM with the random Fourier features kernel approximation.
The hyperparameters of the algorithms are selected by grid search. The regularization parameter of each algorithm is chosen from . For kernel RankSVM and our approximation methods, the parameter of RBF kernel is chosen from . For MQ2007 dataset, the number of sampling for kernel approximation is set to 2000, whereas for the other datasets. All experiments are conducted on a high performance server with 2.0 GHz 16cores CPU and 64 GB of memory.
4.2. Comparison of the Nyström Method and Random Fourier Features
Figure 2 shows the performance comparison of RankSVM with the Nyström method and random Fourier features on MQ2007 dataset. We take the linear RankSVM algorithm, RankSVMPrimal, as the baseline method, which is plotted as dotted line. The remaining two lines represent RankNyström and RankRandomFourier, respectively. In the beginning, the performances of kernel approximate methods are worse than linear RankSVM. But along with the increase of (the number of sampling of approximation), both of the kernel approximate methods can outperform the linear RankSVM. We also observe that RankNyström gets better results than RankRandomFourier when is small and the two methods obtain similar results when .
(a)
(b)
(c)
(d)
4.3. Comparison with Linear and Kernel RankSVM
In this part, we compare our proposed kernel approximation ranking algorithms to other linear and kernel RankSVM algorithms. We take for the kernel approximation. Table 2 gives the results of different RankSVM algorithms on the first fold of MQ2007 dataset. The linear RankSVM algorithms use less training time, but their MeanNDCG values are lower than the values of the kernel RankSVM algorithms. Our kernel approximation methods obtain better performance than the kernel RankSVMTRON with much faster training speed in this dataset. The training time of our kernel approximation methods is about ten seconds, whereas the training time of the kernel RankSVMTRON is more than 13 hours. The result of random Fourier features is slightly better than the RankNyström method. Moreover, the L2loss RankSVM can get better performance than the L1loss RankSVM on this dataset. The MeanNDCG of RankSVMPrimal (linear) is slightly higher than RankSVMTRON (linear). The kernel approximation methods get better MeanNDCG than RankSVMTRON with RBF kernel.
4.4. Comparison with StateoftheArt
In this part, we compare our proposed algorithm with the stateoftheart ranking algorithms. Most of the results of the comparison algorithms come from the baselines of LETOR. The remaining results come from the papers of the algorithms. The hyperparameters and of our proposed kernel approximation RankSVM are selected by grid search as in Section 4.1.
Table 3 provides the comparison of testing NDCG and MAP results of different ranking algorithms on the TD2004 dataset. The number of sampling for kernel approximation is set to 500. We can observe that the kernel approximation ranking methods can achieve the best performances on 3 terms of all the 6 metrics. Also, the results of RankNyström and RankRandomFourier are similar.
Table 4 provides the performance comparison on the OHSUMED dataset. is set to 500. We once observe that RankRandomFourier achieves the best performances on 3 metrics of all the 6 metrics. RankNyström gets the best results on 2 metrics.
Table 5 provides the comparison of results on the MQ2007 dataset. is set to 2000. We observe that RankNyström obtains the best scores on 3 metrics on MQ2007 dataset. BLMART also achieves the best scores on 3 metrics. However, BLMART trains 10,000 LambdaMART and creates bagged model by randomly selecting a subset of these models, whereas our proposed RankNyström algorithm only trains one model.
5. Conclusions
In this paper, we propose a fast RankSVM algorithm with kernel approximation to solve the problem of lengthy training time of kernel RankSVM. First, we proposed a unified model for kernel approximation RankSVM. Approximation method is used to avoid computing kernel matrix by explicitly approximating the kernel similarity between any two data points. Then, two types of methods, namely, the Nyströem method and random Fourier features, are explored to approximate the kernel matrix. Also, the primal truncated Newton method is used to optimize the L2loss (squared Hingeloss) objective function of the ranking model. Experimental results indicate that our proposed method requires much less computational cost than kernel RankSVM and achieves comparable or better performance over stateoftheart ranking algorithms. In the future, we plan to use more efficient kernel approximation and ranking models for largescale ranking problems.
Competing Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This work was mainly supported by Natural Science Foundation of China (61125201, 61303070, U1435219).