Abstract

We propose an adaptive reordered method to deal with the PageRank problem. It has been shown that one can reorder the hyperlink matrix of PageRank problem to calculate a reduced system and get the full PageRank vector through forward substitutions. This method can provide a speedup for calculating the PageRank vector. We observe that in the existing reordered method, the cost of the recursively reordering procedure could offset the computational reduction brought by minimizing the dimension of linear system. With this observation, we introduce an adaptive reordered method to accelerate the total calculation, in which we terminate the reordering procedure appropriately instead of reordering to the end. Numerical experiments show the effectiveness of this adaptive reordered method.

1. Introduction

As the Internet develops rapidly, the development of web search engines is becoming more and more important. As one of the most popular search engines, Google attributes the famous PageRank algorithm to compute the PageRank vector, which amounts to computing the stationary distribution of the transition possibility matrix that describes the web graph with the power method. In the web graph, there are many dangling nodes, that is, pages without links to the other pages. These dangling nodes could trigger storage and computational problems during the PageRank computation [1].

First, let us introduce some principles and symbols in the PageRank problem. The web structure with nodes can be expressed in the form of a spares hyperlink matrix . The th element of is if there is a link from node to node , and otherwise, where is the number of the node 's outlink. This spares matrix is called transition possibility matrix. The row sums are 1 for the nondangling nodes and for the dangling nodes. The PageRank vector is the stationary distribution for the Markov Chain related to . With the existence of the dangling nodes, there are rows with all entries, making not stochastic. To remedy this, one can replace the rows with , called personalization vector, where is the vector of all ones. This replacement gives where is an indexing vector with its element if row of is row. To guarantee the existence and uniqueness of the PageRank vector, one more rank-1 update is needed, which yields the stochastic and irreducible Google matrix where . Then, the power method is used on to calculate the stationary vector, that is, the PageRank vector [2]. Due to the slow convergence rate of the power method, many acceleration methods have been proposed, such as the extrapolation methods [3], the adaptive method [4], and some other numerical methods [511].

With the large amount of the dangling nodes in the web, there are many rows in the hyperlink matrix . It is worthwhile to consider methods to take advantage of these identical rows. Lee et al. give a two-stage algorithm to improve the PageRank computation by lumping the dangling nodes and aggregating the nondangling nodes [12]. By recursively using the lumping procedure, Langville and Meyer propose a reordered algorithm [13]. In their algorithm, the top left submatrix gets smaller as the reordering procedure proceeds, by recursively reordering the rows of the top left submatrix to the bottom, until there is no rows in the new top left submatrix. The stationary vector of the final top left submatrix is easy to compute since its dimension is much smaller compared to the origin matrix. Then, forward substitutions are carried on to get the full PageRank vector.

Although it much easier to calculate the PageRank vector with a reduced matrix, the continual reorderings are consuming. In this paper, we try to find a compromise between getting a reduced submatrix and saving the calculation spent on reordering. We give a threshold for the recursively reordered method and terminate the reordering procedure halfway, which we call the adaptive reordered method.

This paper is organized as follows. We briefly review the reordered method [13] in Section 2. In Section 3, an improved reordered method with an adaptive stopping criterion is presented. Numerical experiments on two realistic matrices are presented using both original reordered method and adaptive reordered method in Section 4. Conclusions and future work are presented in Section 5.

2. Reordered Method for the PageRank Problem

In this section, we briefly introduce the reordered method given by Langville and Meyer [13].

Theorem  2.1 in [13] reveals the relationship between the PageRank vector and the solution to the previous linear system that solving and letting produces the PageRank vector.

On the other side, the nodes of web can be classified into dangling nodes (D) and nondangling nodes (ND). In the hyperlink matrix , the rows corresponding to dangling nodes are all rows. By permuting the rows and columns of , we can achieve the aim that all the rows are at the bottom of . As each of dangling nodes and nondangling nodes corresponds to its particular row, one can permute the rows of and then get , whose rows corresponding to dangling nodes are all at the bottom: where represents the links among nondangling nodes and represents the links from nondangling nodes to dangling nodes. The zero rows at the bottom correspond to the dangling nodes. According to the definition of hyperlink matrix, we can draw the conclusions that , , and the row sums of and are all equal to 1. Then, the coefficient matrix in the previous linear system is and its inverse is Then, the solution to the linear system can be written as where the personalization vector has been partitioned into nondangling and dangling parts, corresponding to and separately.

Based on the previous idea, Langville and Meyer come out with a method that recursively permutes the top left block until the final has no zero rows. The recursive permutations make the dimension of smaller than just permuting once. Eventually, one can get a matrix with the following form: where is the number of blocks of each row. Then, the corresponding coefficient matrix of linear system is

This reordered method can be described in Algorithm 1.

(1) Reorder the hyperlink matrix so that in the top left submatrix all the zero rows
  are at the bottom in every reordering procedure, until there is no rows in the
  new top left submatrix.
(2) Solve using Jacobi method.
(3) For to , Compute .
(4) Compute .

This reordered method reduced the computation of the PageRank vector that one just needs to solve a much smaller linear system, , and the forward substitution in the third step gives the full PageRank vector. However, the recursively reordering procedure in the first step is time consuming and may bring too much overhead if , the number of blocks, is too large. So it is imperative to introduce some mechanism to overcome this drawback, which we will describe in the following section.

3. Adaptive Reordered Method for the PageRank

3.1. Adaptive Reordered Method

In this section, we introduce an adaptive reordered method. This method is based on Langville and Meyer's work in [13] and utilizes a stopping criterion that once the criterion is reached, the reordering procedure will be terminated. The adaptive reordered method can both get rid of the potential overhead brought by the recursive reordering and keep the merit of reduction of computation. The adaptive reordered algorithm is shown in Algorithm 2.

(1) Reorder the hyperlink matrix so that in the top left submatrix all the zero rows
  are at the bottom in every reordering procedure, until the given stopping criterion
  is reached.
(2) Solve using Jacobi method.
(3) For to , Compute .
(4) Compute .

According to Langville and Meyer's work, the merit of the reordered method lies in the second step of Algorithm 1 that the computation is reduced because of the obtained smaller linear system. But the consuming recursively reordering procedure could bring too much overheads that offset this merit, or even worse. So, we consider to bring in a stopping criterion to make a compromise between reordering's merit and overhead.

3.2. Analysis of the Adaptive Reordered Method

From the merit-and-overhead analysis previous, we can get the idea that once the overhead of the first step exceeds the computational reduction brought by the smaller linear system in the second step, the recursively reordering procedure should be terminated. We assume that, before deciding whether the next reordering procedure is necessary, the rank of the top left block is , and after the following reordering step, the rank of the new top left block is . Apparently, . The new reordering step will bring an extra reordering expense which is approximately , an extra reordering expense which is approximately. It also brings the computational reduction which is approximately , in which 130 is the approximate times of matrix-vector products in Jacobi method before the result reaches the precision demand. With these analyses, we can obtain the following stopping criterion.

Stopping Criterion. Once , the reordering procedure in the first step of Algorithm 2 stops.

With this stopping criterion, our adaptive reordered method would stop at the appropriate time, which both reduces the dimension of the linear system to be solved and saves the excessive expense of the recursively reordering procedure.

4. Numerical Experiments

In this section, we present the numerical experiments to compare the adaptive reordered method to the original reordered method [13]. The experiments are on two matrices. The first matrix is wb-cs-stanford.mat [14], which contains 9914 pages, and the second matrix is Stanford.mat [15], which contains 281903 pages. Both matrices are direct graph and before the experiments they are firstly turned into row-stochastic matrices that all the row sums are 1. Similar to [13], we also choose Jacobi method to solve the linear system. We use as the damping factor and as the convergence tolerance. The numerical experiments are carried out on MATLAB. We call the reordered method and the adaptive reordered method RD and aRD for short. The performances of both methods are as follows.

Figures 1 and 2 show the structures of the original matrix and those being reordered using RD and aRD. One can observe that the rough shapes of the matrices reordered by both methods are similar, which means in both methods that the dimensions of the linear systems to be solved in the second step are almost equal and the expenses of solving the linear systems are close. Actually, in the experiment on wb-cs-stanford.mat, the dimension of when using aRD is 6592 and that is 6585 when using RD.

Table 1 shows and the time spent on the reordering step of the two methods applied on the two matrices. In this table, represents the time spent on reordering step and the time on the remaining steps, and is the sum of and . With respect to wb-cs-stanford.mat, there are four blocks if aRD is used and their sizes are 6592, 88, 356, and 2861 successively, while there are seven blocks when RD is used and their sizes are 6585, 3, 4, 17, 88, 356, and 2861. Comparing these two sets of data, one can tell that, when using RD, the follow-up reorderings that get the subdivided blocks, 6585, 3, 4, and 17, does not reduce the dimension of the top left block substantially, which means that these reorderings are not very useful. Our adaptive reordered method also saves much expense when it is applied to Stanford.mat. There are 141 blocks when using RD while there are only 3 blocks when using aRD. After reviewing the data, we find that the sizes of blocks of aRD are 258177, 3338, and 20315 and those of RD are 257824, …, 14, 73, 3338, and 20315, where the sizes omitted are all trivial and the reorderings spent on those blocks are not worthy. As one more block brings in one more reordering operation, the data of this table shows that aRD saves much more expense of reordering procedure than RD does.

5. Conclusion

In this paper, we introduce an adaptive reordered method for solving the PageRank problem. Comparing to the original reordered method [13], we bring in an effective stopping criterion which can both get rid of the overhead brought by the recursively reordering procedure and keep its fast computational speed. Numerical results with two examples demonstrate the good performance of this method. Jacobi method is utilized in this paper, and meanwhile other acceleration methods and preconditioning approaches can also be utilized together, which could enhance the effect of the reordered method. These are topics that deserve to be studied and will be our work in the future.

Acknowledgments

This research is supported by 973 Program (2013CB329404) and Chinese Universities Specialized Research Fund for the Doctoral Program (20110185110020).