Abstract

We introduce a partition of the web pages particularly suited to the PageRank problems in which the web link graph has a nested block structure. Based on the partition of the web pages, dangling nodes, common nodes, and general nodes, the hyperlink matrix can be reordered to be a more simple block structure. Then based on the parallel computation method, we propose an algorithm for the PageRank problems. In this algorithm, the dimension of the linear system becomes smaller, and the vector for general nodes in each block can be calculated separately in every iteration. Numerical experiments show that this approach speeds up the computation of PageRank.

1. Introduction

The rapid growth of the World Wide Web has created a need for search tools. One of the best-known algorithms in web search is Google’s PageRank algorithm [1]. Google’s PageRank algorithm is based on a random surfer model [1] which can be viewed as a stationary distribution of a Markov chain. Simultaneously with the random surfer model, a different but closely related approach, the HITS algorithm, was invented in [2]. Another model SALSA [3] incorporated ideas from both HITS and PageRank to create another ranking of webpages.

In this paper, we focus on Google’s PageRank algorithm. Let us introduce some notations about Google’s PageRank algorithm. We can model the web as a directed graph with the web pages as the nodes and the hyperlinks as the directed edges. In the graph, if there is a link from page to page , then, for page , it has an outlink to page , and, for page , it has an inlink from page . Then we can define the elements of a hyperlink matrix as follows.

If the web page has outlinks , then, for each link from page to another page , the element of the matrix is . If there is no link from page to page , then the element of is 0. The scalar is the number of outlinks from the page . Thus, each nonzero row of sums to 1. If the page has no outlinks at all (such as a pdf, image, or audio file), it is called a dangling node, and all elements in the th row of are set to 0.

The problem is that if at least one node has zero outdegree, that is, no outlinks, then the Markov chain is absorbing, so a modification to is needed. In order to resolve this, the founders of Google, Brin and Page suggest replacing each zero row (corresponding to a dangling node) of the sparse hyperlink matrix with a dense nonnegative vector (; is the column vector of all ones and also could be a personalized vector, see [4, 5]) and create the new stochastic matrix denoted by , . In the vector , the element if the th row of corresponds to a dangling node, and 0 otherwise. Another problem is that there is nothing in our definition so far that guarantees the convergence of the PageRank algorithm or the uniqueness of the PageRank vector with the matrix . In general, if the matrix is irreducible, this problem can be settled. Thus, Brin and Page added another dense perturbation matrix that creates direct connections between each page to force the matrix to be irreducible. Then, the stochastic, irreducible matrix is called the Google matrix and given by where (a typical value for is between 0.85 and 0.95. It is shown in [6] that controls the convergence rate of the PageRank algorithm). Mathematically, the PageRank vector is the stationary distribution of the so-called Google matrix .

Now, we have got many methods for solving the PageRank vector , such as the famous Power Method [1, 7, 8]. Due to the sheer size of the web (over 3 billion pages), this computation can take several days. In [9], Arasu et al. used values from the current iteration as they become available, rather than using only values from the previous iteration. They also suggested that exploiting the “bow-tie” structure of the web [10] would be useful in computing PageRank. In [11], Kamvar et al. presented a variety of extrapolation methods. In [12], Avrachenkov et al. showed that Monte Carlo methods already provide good estimation of the PageRank for relatively important pages after one iteration. Gleich et al. in [13] presented an inner-outer iterative algorithm for accelerating PageRank computations. To put it another way, for the existence of the dangling nodes, Lee et al. [14] partitioned the web into dangling and nondangling nodes and applied an aggregation method to this partition.

Recently, the structure of the web link graph has been noticed. Kamvar et al. in [4] brilliantly exploited the block structure of the web for computing PageRank. They also exploited the fact that pages with lower page rank tend to converge faster and propose adaptive methods in [15]. Based on the characteristics of the web link graph, research on parallelization of PageRank can be found in [1621]. In [21], Manaskasemsak and Rungsawang discussed a parallelization of the power method. In [17], Gleich et al. introduced a method to compare the various linear system formulations in terms of parallel runtime performance. Cevahir et al. in [16] proposed the site-based partitioning and repartitioning techniques for parallel PageRank computation. Some special models for parallel PageRank were proposed in [1820].

In our paper, we combine ideas from the existence of the dangling nodes and the block structure of the web and exploit a new structure for the hyperlink matrix . Then some parallel computation methods are applied to speed up the computation of PageRank by using a partition of the nodes. Firstly, we present that our target is to compute the PageRank of the nondangling nodes in the linear system for the Google problem [22] (Section 2). Secondly, according to the partition of the web pages, we get a special structure of the hyperlink matrix, and then we propose an algorithm (Section 3). Finally, we make an analysis of our algorithms, and some numerical results are given (Sections 4 and 5).

2. The Problem

Generally, the Google problem is to solve the eigenvector of the matrix in the following equation: Here, we introduce some theorems to show that the Google problem can turn out to be a linear system problem and only need to compute the unnormalized PageRank subvector of the nondangling nodes. In the following, the matrix denotes the identity matrix.

Theorem 1 (see [22, linear system for Google problem]). Suppose that the matrix is a hyperlink matrix. Solving the linear system, and letting produce the PageRank vector.

Since the coefficient matrix in (3) is an -matrix (Theorem 8.(4.2) in [23]) as well as nonsingular and irreducible, thus, the solution of the linear system in Theorem 1 is existent and unique.

The rows in the matrix corresponding to the dangling nodes would be zero. It is natural as well as efficient to exclude the dangling nodes from the PageRank computation. This can be done by partitioning the web nodes into nondangling nodes and dangling nodes. This is similar to the method of “lumping" all the dangling nodes into a single node [24]. Supposing that the rows and columns of are permuted corresponding to the partition, then the rows corresponding to the dangling nodes are at the bottom of the matrix: where is the set of the nondangling nodes and is the set of the dangling nodes.

Then, the coefficient matrix in (3) becomes and the inverse of this matrix is Therefore, the unnormalized PageRank vector in (4) can be written as Then, Langville and Meyer [22] proposed two reordered PageRank algorithms for computing the PageRank vector. One is Algorithm 1, called reordered PageRank algorithm, and the other is called reordered PageRank algorithm. However, unfortunately, the reordered PageRank algorithm is not necessarily an improvement over Algorithm 1 in some cases.

(1) Partition the web nodes into dangling and nondangling nodes, so that the hyperlink
  matrix has the structure of (4).
(2) Solve from .
(3) Compute .
(4) .

In this reordered PageRank Algorithm 1, the only system that must be solved is . The reordered PageRank Algorithm 2 is based on a process of locating zero rows which can be repeated recursively on smaller and smaller submatrices of , continuing until a submatrix is created that has no zero rows. For interested readers, the detail of the reordered PageRank algorithms can be found in [22]. However, this structure of the web they exploit in reordered PageRank Algorithm 2 is not practical, as reordering the web matrix according to this structure requires depth-first search, which is prohibitively costly on the web. To put it another way, even though some hyperlink matrices can be suited to the reordered PageRank algorithm, the structure may not exist for some hyperlink matrices. Thus the reordered PageRank Algorithm 2 will have no advantage over Algorithm 1 in this worst case. Similarly, we can find the same conclusion in their experiments. Thus, we come back to (4) and reorder the structure of the matrix to speed up the computation of PageRank vector. The objective function becomes where the coefficient matrix is the nontrivial leading principal submatrix of and it is nonsingular (Theorem 6.(4.16) of [23]).

(1) Partition the web nodes which form blocks: into
   blocks: , so the hyperlink matrix has the structure of (12).
(2) Partition the given vector and PageRank vector
   according to the size of the blocks:
    , , , ;
    , , , .
(3) Compute the limiting vector of by iterations as follow:
 (a) Compute ;
 (b) Solve for in
          .
(4) Compute
        ,
        .
(5) Normalize .

3. PageRank Algorithms Based on a Separation of the Common Nodes

3.1. The Block Structure of the Web

It is noted in [4] that when sorted by Uniform Resource Location (URL), the web link graph has a nested block structure: the vast majority of hyperlinks link pages on a host to other pages on the same host. This property was demonstrated by examination on realistic datasets. So in the following sections, we consider the webs that have block structure. To simplify notation, without loss of generality, we will assume that a web link graph has a block structure of blocks: . So the hyperlink matrix is Then, we separate the dangling nodes from each of the blocks. Thus, we get the new blocks , , which are the original blocks with dangling nodes removed. The set of nodes is , where and is the set of the dangling nodes. The rows and columns of can be permuted, making the rows corresponding to the dangling nodes at the bottom of the matrix just like (4) in Section 2: In the above equation, the submatrix is

3.2. A Separation of the Common Nodes

To investigate the detail of the web structure, we can see the experiments in [4]. They used LARGEWEB link graph [25] and considered the version of LARGEWEB with dangling nodes removed, which contains roughly 70 M nodes, with over 600 M edges, and requires 3.6 GB of storage. They partitioned the links in the graph into “intrahost" links, which means links from a page to another page in the same host, and “interhost" links, which means links from a page to a page in a different host. Through counting the number of the two different links separately, Table  2 in [4] shows that 93.6% of the links in the datasets are intrahost links and 6.4% are interhost links, which means that larger majority of links are intrahost links and only a minority of links are interhost links. They also found the same result by partitioning the links according to different domains. This result leads to a deeper study of the structure of the hyperlink matrix . That is, if the pages are grouped by domain, host, or others, the graph for the pages will appear as a block structure. Then in each subblock, a minority of nodes have links to other blocks, and in this paper we call them common nodes. The definition of common node is given as follows.

Definition 2 (common node). Assume that a web link graph with dangling nodes removed has blocks . If a node in a block () has at least one outlink to another different block (, ) or inlink from another different block (, ), we call it common node.

If a node in a web link graph is not a dangling node or a common node, then we call it general node. The nodes in a web link graph are divided into three classes: dangling node, common node, and general node. Specially, the common nodes and general nodes belong to the nondangling nodes.

There is no dangling node in the blocks , so we consider separating all the common nodes from the blocks and form a new block denoted by . Hence, the set of nodes is . The new block () is the block with common nodes removed. Thus, any hyperlink submatrix corresponding to two different blocks and becomes zero matrix because there are no interlinks between different blocks in .

In Figure 1, a simple example is shown to illustrate the change after a separation of the common nodes. In Figure 1(a), there are four blocks , , , and in a web link graph, and each of them has links to others. However, in Figure 1(b), after separating the common nodes from the four blocks and lumping the common nodes into a block denoted by , there are no links among the four new blocks. The links exist only between the and the four new blocks. Once the above is done, the hyperlink matrix corresponding to the partition of the web nodes, , has the following structure: Then the submatrix , corresponding to the hyperlinks among the nondangling nodes, turns out to be It is apparent that after the separation of the common nodes, the structure of the above matrix seems much simpler than the former one in (11).

3.3. A PageRank Algorithm

Notice that the matrix in (13) has nonzero submatrices only in the diagonal, the last row, and the last column. This special structure can reduce the computation in every iteration. Let Then The coefficient matrix has the following structure: Therefore, after Gaussian elimination, can be written as where and are divided into general and common sections. The only system that must be solved is (17).

Notice that the matrix is a block diagonal matrix. Therefore, the subvectors of which are partitioned according to the number and size of the blocks can be calculated independently in each iteration. For example, in th iteration, calculate and divide into according to the number and size of the blocksl then, for vectors , we have the following function: or

As a result, the PageRank system in (8) can be reduced into the smaller linear system formulation in (17) in which the subvectors can be calculated independently in each iteration by (20). In summary, we now have an algorithm based on the separation of the common nodes. Meanwhile, this algorithm is an extension of the dangling node method in Section 2.

4. Analysis of Algorithm 2

As we know, some web link graphs appear to have a nested block structure. Then according to the definition of common node, it is not difficult to find the common nodes among the different blocks. This can be done by a process of locating nonzero entries on submatrices of in (10) (, , ). For example, if the th entry of is nonzero, then the th nodes and the th nodes are common nodes. This process can be repeated on different submatrices of at the same time by using separate computers. At the end, gather the common nodes together from different computers and get rid of the repetitive nodes, and then we get the last set of the common nodes. Since the dimension of is much smaller and we can use parallel searching, so the step 1 in Algorithm 2 will not take much time for separating the common nodes.

Note that there is no links among the new blocks after the separation of the common nodes just as the zero submatrices in the matrix in (12). In effect, step 3 in Algorithm 2 reduces time consuming for large matrices by turning a large matrix into many smaller submatrices . It shows that vectors , , can be computed separately by and the results are used together to yield a new vector for the next iteration. The parallel computation in this step can save much time.

Since are not required to be accurate in each iteration, we can compute by . Moreover, it can be solved by any appropriate direct or iterative method. Meanwhile, in [22], they have found that acceleration methods [9, 11, 15, 26], such as extrapolation and preconditioners, can be applied to the small system to achieve even greater speedups.

5. Numerical Experiments

5.1. Experiment Foundation

In this section, we give an example to present our algorithms.

Example. We consider three experiments based on three web link graphs: graph 1, graph 2, and graph 3. We assume that each of the graphs contains 200 nodes and four blocks; moreover, the size of the blocks is the same in each graph. Based on our definition about web pages, there are three classes of pages in a web: dangling nodes, common nodes, and general nodes. In order to make comparisons among the experiments, we suppose that the numbers of the dangling nodes are equivalent in these three graphs. Then we set different proportions of the general nodes and the common nodes in these three graphs. Without loss of generality, we assume that there are three kinds of proportions: they are 3 : 7 in graph 1, 5 : 5 in graph 2, and 7 : 3 in graph 3, which indicate that the number of the common nodes relatively decreases and the number of the general nodes relatively increases. We also assume that, in each graph, the proportion between the general nodes and the common nodes in each subblock is similar to the proportion in the whole web link graph. Meanwhile, in these three web link graphs, the choosing of the common nodes and the links in and between the subblocks is random.

For the dot plot graph of these three web link graphs, if there exists a link from node to node , then point is colored; otherwise, point is white. We assure that these three web link graphs satisfy three characters in [4].(1)There is a definite block structure to the web.(2)The individual blocks are much smaller than entire web.(3)There are clear nested blocks.

For example, Figure 2, it is the graph 3 which contains 200 pages and has a nested block structure of four blocks. The proportion is 7 : 3 in the whole graph.

Then, in each experiment, we separate the nodes into dangling nodes, common nodes, and the rest (general nodes). The result of this process is a decomposition of the matrix. Figure 3 shows the change of the structure of in (4) after this process, which is based on the dataset of Figure 2. Figure 3(a) is the web link graph of before reordering, and Figure 3(b) is the new web link graph of after reordering. This process amounts to a simple reordering of the indices of the Markov chain. It shows that the character of the new structure is better than the original one.

5.2. Experimental Results and Analysis

Based on the three experiment datasets, we compare Algorithm 2 to the other two algorithms: original PageRank and reordered PageRank. We assume the scaling factor and the convergence tolerance . The experimental results are shown in Figure 4 and Table 1. Figures 4(a), 4(b), and 4(c) are the comparison among the three algorithms about the acceleration of convergence in the three separate experiments. It shows that Algorithm 2 possesses both good capability to search PageRank vector and rigid convergence speed in comparison with reordered PageRank. That is because the dimension of the linear system for Algorithm 2 is smaller than the dimension of the linear system for reordered PageRank. The result in Table 1 implies that Algorithm 2 needs more iterations than Power method. However, since the application of parallel computation in Algorithm 2, Algorithm 2 can largely accelerate the computation time of PageRank. For the next work, we will try to experiment on real data.

6. Conclusion

It has investigated that the hyperlink graphs of some web pages have nested block structure which can be found in [4]. Then we exploit a reordered block structure and present an algorithm to compute PageRank in a fast manner. Algorithm 2 has basically two stages. In Stage 1, the focus is on the partition of nodes in a web. In Stage 2, the vector of general nodes in each block for next iteration is computed independently. Then we calculate the unnormalized PageRank vectors for common nodes and dangling nodes directly. At last, normalize the vector and give the PageRank. The numerical experiments show that Algorithm 2 is guaranteed to outperform the other two algorithms, as long as an appropriate block structure of web exists. However, in real data, the common nodes may increase as the number of the blocks increases, and the dimension of the submatrix could be larger. Then it will take much time to calculate the value of . In this case, similar to Algorithm 2, we will consider calculating the vector for common nodes first and then calculating the vector for general nodes in each block independently. We aslo need to experiment on real data and make comparison with other more existing methods in the future work.

Acknowledgments

The authors would like to express their great thankfulness to the referees and the editor for their much helpful suggestions for revising this paper. This research is supported by 973 Program (2013CB329404), NSFC (61370147, 61170311, and 61170309), Chinese Universities Specialized Research Fund for the Doctoral Program (20110185110020), and Sichuan Province Sci. & Tech. Research Project (2012GZX0080).