Applications of Methods of Numerical Linear Algebra in EngineeringView this Special Issue
Research Article | Open Access
A Model Based on Cocitation for Web Information Retrieval
According to the relationship between authority and cocitation in HITS, we propose a new hyperlink weighting scheme to describe the strength of the relevancy between any two webpages. Then we combine hyperlink weight normalization and random surfing schemes as used in PageRank to justify the new model. In the new model based on cocitation (MBCC), the pages with stronger relevancy are assigned higher values, not just depending on the outlinks. This model combines both features of HITS and PageRank. Finally, we present the results of some numerical experiments, showing that the MBCC ranking agrees with the HITS ranking, especially in top 10. Meanwhile, MBCC keeps the superiority of PageRank, that is, existence and uniqueness of ranking vectors.
In the past, search engines ranked pages by using word frequency or similar measures. However, the relevancy of webpages returned by this traditional web information retrieval is still lacking, because the webpages are created with varying qualities. Recently, some new algorithms have been created that greatly improve rankings. One of the popular ideas is to use hyperlinks to determine the value of different webpages. This hyperlink graph contains useful information: if webpage has a link pointing to webpage , it usually indicates that the creator of considers to contain relevant information for . Such useful opinions and knowledge are therefore registered in the form of adjacency matrix which is denoted by . if there is a link from to , or 0, otherwise.
Two most popular ranking algorithms based on hyperlink analysis are the PageRank algorithm [1, 2] and the HITS (Hyper-text Induced Topic Selection) algorithm . Generally, PageRank considers the hyperlink weight normalization and the equilibrium distribution of random surfers as the citation score. For more information about the calculation methods of PageRank refer to [4–6]. HITS makes the distinction between hubs and authorities and then computes them in a mutually reinforcing way. For each of these two algorithms, the ranking vector is the dominant eigenvector of some matrix describing the network. How this matrix is defined differs in each method. There are other works which have recognized that the hyperlink structure can be very valuable for locating information [3, 7, 8].
This paper is organized as follows. In Section 2, we introduce the PageRank and HITS algorithms and briefly discuss the limitations in HITS. Then in Section 3, we emphasize the role of cocitation (Figure 1) and provide a hyperlink weighting scheme to describe the strength of the relevancy between any two webpages. In order to ensure the existence of solutions and uniqueness of solutions in the new model (MBCC), we also combine ideas from PageRank. In Section 4, some experiments are presented. The result shows that the MBCC ranking is close well to the HITS ranking. Conclusions are given in Section 5.
2. PageRank and HITS
We treat the web as a directed graph : the nodes in correspond to the pages, and a directed edge indicates the existence of a link from to . We say that the out-degree of a node denoted by is the number of nodes it has links point to, and the in-degree of denoted by is the number of nodes that have links point to it. We also denote that
2.1. Review of PageRank
PageRank [1, 2] uses a web surfing model based on a random walk process. Suppose there is a link from page to page ; that is, . Consider a random surfer visiting page at time . Then at the next time , the surfer lands at page with probability . Once the above is done, the PageRank algorithm assigns a rank value for the page as a function of the rank of the pages that point to it:
If the page has no outlink, that is, , then, at time , the surfer chooses any page with probability . Thus, we replace with . Then the stationary distribution is determined by the following matrix form:
Here , is the adjacency matrix of the directed web graph, , and . In the vector , the element if the th row of corresponds to a dangling node (), or 0, otherwise.
In order to calculate the above recursive equation and get a unique stationary probability distribution, it is important to guarantee that (3) is convergent. This problem can be solved if the directed graph is strongly connected, which is generally not the case for the directed graph. In the context of computing PageRank, the standard way of ensuring this property is to add a new set of complete outgoing transitions, with small transition probabilities (in this work, we set each of them as ), to all nodes in . Then the modified transition probability called Google matrix is where . Here ; thus is a matrix of all 1’s. The PageRank algorithm is to solve the eigenvector of the Google matrix where is stochastic and irreducible.
PageRank models two types of random jumps on the Internet. With probability a surfer randomly chooses a new page. Otherwise, the surfer follows one of directed edges from the present node.
2.2. Review of HITS
In the HITS algorithm , each webpage has both a hub score (based on the links going from the page) and an authority score (based on the links going to the page). Let denote the vector of all authority weights, let denote the vector of all hub weights, and let be the adjacency matrix of the directed web graph. In HITS, there are two operations at each iteration. One is defined as operation which sets the authority vector to . It indicates that a good authority is pointed by many good hubs. Another is defined as operation which sets the hub vector to . It indicates that a good hub points to many good authorities. This mutually reinforcing relationship can be written in the following matrix representations:
The final authority and hub scores are the principal eigenvectors of and which are corresponding to the dominant eigenvalue . Since and determine the authority ranking and hub ranking, we call the authority matrix and the hub matrix.
In the fields of citation analysis and bibliometrics, it has shown that the authority matrix has interesting connections to cocitation . Here cocitation is defined as the number of webpages that cocite , . In the authority matrix, is the in-degree of page ; that is, This implies that
For , is the number of webpages that cocite , that is denoted by . Therefore the authority matrix is the sum of in-degree and cocitation [10, 11] The self cocitation in is not defined and is usually set to .
2.3. Existence and Uniqueness of Ranking Vectors
In this section, we present the existence and uniqueness of ranking vectors in the above two algorithms.
Since the Google matrix in (4) is stochastic and irreducible, for the PageRank algorithm, the PageRank ranking vector exists, and it is unique and positive. See the equivalent theorem in [12, Theorem 3.8]. For the HITS algorithm, it has been proved that the hub and authority ranking vectors exist but may not be unique. In , they show that the HITS algorithm badly behaved on certain networks, meaning that (i) it can return ranking vectors that are not unique but depend on the initial seed vector or (ii) it can return ranking vectors that inappropriately assign zero weights to parts of the network.
There are also other limitations for HITS; see [12, 13]. Thus, to address these limitations, a modification for HITS is needed, for example, exponentiated input method in . In the next section, we combine both features of HITS and PageRank. The ranking produced by the new model is expected to be unique and close to the HITS ranking.
3. A Model Based on Cocitation (MBCC)
In HITS, according to (9), the authority ranking value can be expressed as revealing the close relationship between authorities and cocitations. It also implies that, if two distinct webpages , are cocited by many other webpages as shown in Figure 1, then , are likely to be related in some sense. In this paper, we present a property for HITS corresponding to (10).
Property 1 (relationship between authority value and cocitation). If the number of webpages that cocite webpages and , that is, , is larger, the page could receive more authority value from the page , even though there are no links between and .
The fact that the webpages cocite two distinct webpages and indicates that , have certain commonality. Therefore, we say that the number of cocitations represents the relevancy among the pages. Then, in the following, we focus on the use of cocitation for analyzing the relevancy among the pages.
Note that, in Section 2.1, the rank of a page in PageRank is divided among its forward links evenly; see (2); that is, a web surfer could chose the forward outlinks randomly. However, this process of dividing the rank equally may seem unrealistic; that is, a web surfer may have a priori idea of the value of pages, favoring pages from the relevant sites. Since it shows that the number of cocitations could represent the relevancy among the pages, we say that the number of cocitations between two pages can impact the behavior of web surfers. Therefore, we define a new hyperlink weighting scheme based on cocitation as follows:
Definition 2 (hyperlink weighting scheme based on cocitation). Let be the number of webpages that cocite two webpages , . Specially, , and is the in-degree of webpage . Then we define the following function as the value of which will receive form : where .
Under this assignment method, the rank value for the page is determined by
The matrix form of above equation is , where . The problem is that, if at least one page has zero in-degree, that is, no in-links and , then the matrix is absorbing and its dominant eigenvector does not exist. In order to resolve this, similarly to PageRank, we assume that, if the page has no link that points to it, then at time , the page divides its value equally to any other page with probability . The modified matrix is given by where we replace with , and . can be computed as In the vector , the element if the -th row of corresponds to a page with no in-degree, or 0, otherwise. Therefore, the modified matrix becomes a stochastic matrix, that is, each column in sum to 1.
In order to get a unique stationary probability distribution, it is important to guarantee that is strongly connected. Similarly to PageRank, we add a new set of complete outgoing transitions. The final transition probability matrix based on using cocitation as a hyperlink weighting scheme is where and . The model based on cocitation (MBCC) is to solve the following function:
We assume that the solution of (16) denoted by is the MBCC authority ranking vector, and is the MBCC hub ranking vector. Since the matrix in (15) is stochastic and irreducible, just like the Google matrix in PageRank, the solution of (16) exists, and it is unique and positive.
4. Numerical Experiments
First, we present an example to describe the assignment process in Definition 2.
Example 1. Suppose that there are six webpages , and the directed graph is shown in Figure 2. The conclusion can be found from Table 1 and Figure 2. In Table 1, is the number of webpages that cocite webpages and ; is obtained by (11). In Figure 2, the left one is the original link structure of PageRank where the value of the page is divided equally to the pages that it points to, and the right one divides the value of based on cocitation.
Then, we compare the MBCC model with HITS and PageRank, experimenting with dataset from http://www.cs.toronto.edu/~tsap/experiments/datasets/. The dataset is about the topic computational geometry which contains a total of 1100 webpages. We set . Meanwhile, we use as the convergence tolerance and measure the convergence rates of the three algorithms using the L1 norm of the residual vector. Table 2 shows the list of the top 20 authorities with HITS, MBCC, and PageRank. Table 3 shows the list of the top 20 hubs with HITS and MBCC. It shows that MBCC authority ranking is closer to HITS authority ranking than PageRank ranking which is close to HITS authority ranking. The comparison between MBCC and HITS ranking vectors in Table 4 indicates that MBCC ranking agrees well with HITS ranking, especially in top 10.
In this work, we emphasize the role of cocitation in defining authorities. First, we observe that, in the HITS algorithm, if two distinct webpages , are cocited by many other webpages , then , are likely to be related in some sense or have certain commonality. According to this close relationship, we come to the conclusion that the higher the number of webpages that cocite webpages and , the stronger the relevancy between the two pages. The page with stronger relevancy should obtain more values from page . Therefore, we develop a hyperlink weighting scheme for extracting information from the link structure. Then we combine hyperlink weight normalization and random surfing schemes as used in PageRank to justify the model.
The experimental results show that the MBCC authority (hub) ranking is close well to the HITS authority (hub) ranking in top 20, and in general a surfer seldomly browses beyond these webpages in top 20 . Moreover, MBCC keeps the superiority of PageRank: the authority vector of MBCC in (16) exists, and it is unique and positive, while the authority and hub vectors of HITS may not be unique. Therefore, we can use the authority (hub) ranking vector of MBCC as the authority (hub) ranking vector of HITS.
Conflict of Interests
The authors declare that they have no conflict of interests regarding the publication of this paper.
This research is supported by NSFC (61370147, 61170309), Chinese Universities Specialized Research Fund for the Doctoral Program (20110185110020).
- S. Brin, L. Page, R. Motwami, and T. Winograd, “The PageRank citation ranking: bringing order to the web,” Tech. Rep. 1999-0120, Computer Science Department, Stanford University, Stanford, Calif, USA, 1999.
- S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” Computer Networks and ISDN Systems, vol. 30, no. 1–7, pp. 107–117, 1998.
- J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the ACM, vol. 46, no. 5, pp. 604–632, 1999.
- K. Avrachenkov, N. Litvak, D. Nemirovsky, and N. Osipova, “Monte Carlo methods in PageRank computation: when one iteration is sufficient,” SIAM Journal on Numerical Analysis, vol. 45, no. 2, pp. 890–904, 2007.
- D. F. Gleich, A. P. Gray, C. Greif, and T. Lau, “An inner-outer iteration for computing RageRank,” SIAM Journal on Scientific Computing, vol. 32, no. 1, pp. 349–371, 2010.
- R. S. Wills and I. C. F. Ipsen, “Ordinal ranking for Google’s PageRank,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 4, pp. 1677–1696, 2009.
- A. N. Langville and C. D. Meyer, “A reordering for the PageRank problem,” SIAM Journal on Scientific Computing, vol. 27, no. 6, pp. 2112–2120, 2006.
- S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub, “Exploiting the block structure of the web for computing PageRank,” Tech. Rep. 2003-17, Stanford University, Stanford, Calif, USA, 2003.
- H. Small, “Co-citation in the scientific literature: a new measure of the relationship between two documents,” Journal of the American Society for Information Science, vol. 24, no. 4, pp. 265–269, 1973.
- C. Ding, H. Zha, X. He, P. Husbands, and H. Simon, “Analysis of hubs and authorities on the web,” Tech. Rep. 47847, Lawrence Berkeley National Laboratory, 2001.
- C. Ding, X. He, P. Husbands, H. Zha, and H. Simon, “PageRank, HITS and a unified framework for link analysis,” Tech. Rep. 49372, Lawrence Berkeley National Laboratory, 2001.
- A. Farahat, T. Lofaro, J. C. Miller, G. Rae, and L. A. Ward, “Authority rankings from HITS, PageRank, and SALSA: existence, uniqueness, and effect of initialization,” SIAM Journal on Scientific Computing, vol. 27, no. 4, pp. 1181–1201, 2006.
- A. N. Langville and C. D. Meyer, “A survey of eigenvector methods for web information retrieval,” SIAM Review, vol. 47, no. 1, pp. 135–161, 2005.
Copyright © 2014 Yue Xie and Ting-Zhu Huang. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.