Abstract

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.

1. Introduction

With the rapid growth of network information, the Internet has become the greatest information base. How to get the knowledge of interest from massive information has become a hot topic in current research. But the first important task of those researches is to collect relevant information from the Internet, namely, crawling web pages. Therefore, in order to crawl web pages effectively, researchers proposed web crawlers. Web crawlers are programs that collect information from the Internet. It can be divided into general-purpose web crawlers and special-purpose web crawlers [1, 2]. General-purpose web crawlers retrieve enormous numbers of web pages in all fields from the huge Internet. To find and store these web pages, general-purpose web crawlers must have long running times and immense hard-disk space. However, special-purpose web crawlers, known as focused crawlers, yield good recall as well as good precision by restricting themselves to a limited domain [35]. Compared with general-purpose web crawlers, focused crawlers obviously need a smaller amount of runtime and hardware resources. Therefore, focused crawlers have become increasingly important in gathering information from web pages for finite resources and have been used in a variety of applications such as search engines, information extraction, digital libraries, and text classification.

Classifying the web pages and selecting the URLs are two most important steps of the focused crawler. Hence, the primary task of the effective focused crawler is to build a good web page classifier to filter irrelevant web pages of a given topic and guide the search. It is generally known that Term Frequency Inverse Document Frequency (TFIDF) [6, 7] is the most common approach of term weighting in text classification problem. However, TFIDF does not take into account the difference of expression ability in the different page position and the proportion of feature distribution when computing weights. Therefore, our paper presents a TFIDF-improved approach, ITFIDF, to make up for the defect of TFIDF in web page classification. According to ITFIDF, the page content is classified into four sections: headline, keywords, anchor text, and body. Then we set different weights to different sections based on their expression ability for page content. That means, the stronger expression ability of page content is, the higher weight would be obtained. In addition, ITFIDF develops a new weighting equation to improve the convergence of the algorithm by introducing the information gain of the term.

The approach of selecting the URLs has also another direct impact on the performance of focused crawling. The approach ensures that the crawler acquires more web pages that are relevant to a given topic. The URLs are selected from the unvisited list, where the URLs are ranked in descending order based on weights that are relevant to the given topic. At present, most of the weighting methods are based on link features [8, 9] that include current page, anchor text, link-context, and URL string. In particular, current page is the most frequently used link feature. For example, Chakrabarti et al. [10] suggested a new approach to topic-specific Web resource discovery and Michelangelo et al. [11] suggested focused crawling using context graphs. Motivated by this, we propose link priority evaluation (LPE) algorithm. In LPE, web pages are partitioned into some smaller content blocks by content block partition (CBP) algorithm. After partitioning the web page, we take a content block as a unit to evaluate each content block, respectively. If relevant, all unvisited URLs are extracted and added into frontier, and the relevance is treated as priority weight. Otherwise, discard all links in the content block.

The rest of this paper is organized as follows: Section 2 briefly introduces the related work. In Section 3, the approach of web page classification based on ITFIDF is proposed. Section 4 illustrates how to use LPE algorithm to extract the URLs and calculate the relevance. The whole crawling architecture is proposed in Section 5. Several relevant experiments are performed to evaluate the effectiveness of our method in Section 6. Finally, Section 7 draws a conclusion of the whole paper.

Since the birth of the WWW, researchers have explored different methods of Internet information collection. Focused crawlers are commonly used instruments for information collector. The focused crawlers are affected by the method of selecting the URLs. In what follows, we briefly review some work on selecting the URLs.

Focused crawlers must calculate the priorities for unvisited links to guide themselves to retrieve web pages that are related to a given topic from the internet. The priorities for the links are affected by topical similarities of the full texts and the features (anchor texts, link-context) of those hyperlinks [12]. The formula is defined aswhere is the priority of the link () and is the number of links. is the number of retrieved web pages including the link . is the similarity between the topic and the full text , which corresponds to web page including the link . is the similarity between the topic and the anchor text corresponding to anchor texts including the link .

In the above formula, many variants have been proposed to improve the efficiency of predicting the priorities for links. Earlier, researchers took the topical similarities of the full texts of those links as the strategy for prioritizing links, such as Fish Search [13], Shark Search algorithm [14], and other focused crawlers including [8, 10, 15, 16]. Due to the features provided by link, the anchor texts and link-context in web pages are utilized by many researchers to search the web [17]. Eiron and McCurley [18] put forward a statistical study of the nature of anchor text and real user queries on a large corpus of corporate intranet documents. Li et al. [19] presented a focused crawler guided by anchor texts using a decision tree. Chen and Zhang [20] proposed HAWK, which is simply a combination of some well-known content-based and link-based crawling approaches. Peng and Liu [3] suggested an improved focused crawler combining full texts content and features of unvisited hyperlink. Du et al. [2] proposed an improved focused crawler based on semantic similarity vector space model. This model combines cosine similarity and semantic similarity and uses the full text and anchor text of a link as its documents.

3. Web Page Classification

The purpose of focused crawling is to achieve relevant web pages of a given topical and discard irrelevant web pages. It can be regarded as the problem of binary classification. Therefore, we will build a web page classifier by Naive Bayes, the most common algorithm used for text classification [21]. Constructing our classifier adopts three steps: first, pruning the feature space, then term weighting, and finally building the web page classifier.

3.1. Pruning the Feature Space

Web page classifier embeds the documents into some feature space, which may be extremely large, especially for very large vocabularies. And, the size of feature space affects the efficiency and effectiveness of page classifier. Therefore, pruning the feature space is necessary and significant. In this paper, we adopt the method of mutual information (MI) [22] to prune the feature space. MI is an approach of measuring information in information theory. It has been used to represent correlation of two events. That is, the greater the MI is, the more the correlation between two events is. In this paper, MI has been used to measure the relationship between feature and class .

Calculating MI has two steps: first, calculating MI between feature in current page and each class and selecting the biggest value as the MI of feature . Then, the features are ranked in descending order based on MI and maintain features which have higher value better than threshold. The formula is represented as follows:where denote the MI between the feature and the class ; denote the probability that a document arbitrarily selected from the corpus contains the feature ; denote the probability that a document arbitrarily selected from the corpus belongs to the class ; denote the joint probability that this arbitrarily selected document belongs to the class as well as containing the feature at the same time.

3.2. Term Weighting

After pruning the feature space, the document is represented as . Then, we need to calculate weight of terms by weighting method. In this paper, we adopt ITFIDF to calculate weight of terms. Compared with TFIDF, the improvements of the ITFIDF are as follows.

In ITFIDF, the web page is classified into four sections: headline, keywords, anchor text, and body, and we set the different weights to different sections based on their express ability for page content. The frequency of term in document is computed as follows: where , , , and represent occurrence frequency of term in the headline, keywords, anchor text, and content of the document , respectively; , , , and are weight coefficients, and .

Further analysis found that TFIDF method is not considering the proportion of feature distribution. We also develop a new term weighting equation by introducing the information gain of the term. The new weights calculate formula as follows: where is the weight of term in document ; and are, respectively, the term frequency and inverse document frequency of term in document ; is the total number of documents in sets; is the information gain of term and might be obtained by is the information entropy of document set and could be obtained by is the conditional entropy of term and could be obtained by is the probability of document . In this paper, we compute based on [23], and the formula is defined aswhere refers to the sum of feature frequencies of all the terms in the document .

3.3. Building Web Page Classifier

After pruning feature space and term weighting, we build the web page classifier by the Naïve Bayesian algorithm. In order to reduce the complexity of the calculation, we fail to consider the relevance and order between terms in web page. Assume that is the number of web pages in set ; is the number of web pages in the class . According to Bayes theorem, the probability of web page that belongs to class is represented as follows: where and the value is constant; is constant too; is the term of web page for the document ; and can be represented as eigenvector of , that is, . Therefore, is mostly impacted by . According to independence assumption above, are computed as follows:where is the number of terms in the document ; is vocabulary of class .

In many irrelevant web pages, there may be some regions that are relevant to a given topic. Therefore, in order to more fully select the URLs that are relevant to the given topic, we propose the algorithm of link priority evaluation (LPE). In LPE algorithm, web pages are partitioned into some smaller content blocks by content block partition (CBP) [3, 24, 25]. After partitioning the web page, we take a content block as a unit of relevance calculating to evaluate each content block, respectively. A highly relevant region in a low overall relevance web page will not be obscured, but the method omits the links in the irrelevant content blocks, in which there may be some anchors linking the relevant web pages. Hence, in order to solve this problem, we develop the strategy of JFE, which is the relevance evaluate method between link and the content block. If a content block is relevant, all unvisited URLs are extracted and added into frontier, and the content block relevance is treated as priority weight. Otherwise, LPE will adopt JFE to evaluate the links in the block.

4.1. JFE Strategy

Researchers often adopt anchor text or link-context feature to calculate relevance between the link and topic, in order to achieve the goal of extracting relevant links from irrelevant content block. However, some web page designers do not summarize the destination web pages in the anchor text. Instead, they use words such as “Click here,” “here,” “Read more,” “more,” and “next” to describe the texts around them in anchor text. If we calculate relevance between anchor text and topic, we may omit some destination link. Similarly, if we calculate relevance between link-context and topic, we may also omit some links or extract some irrelevant links. In view of this, we propose JFE strategy to reduce abovementioned omission and improve the performance of the focused crawlers. JFE combine the features of anchor text and link-context. The formula is shown as follows:where is the similarity between the link and topic ; is the similarity between the link and topic when only adopting anchor text feature to calculate relevance; is the similarity between the link and topic when only adopting link-context feature to calculate relevance; () is an impact factor, which is used to adjust weighting between and . If , then the anchor text is more important than link-context feature in the JFE strategy; if , then the link-context feature is more important than anchor text in the JFE strategy; if , then the anchor text and link-context feature are equally important in the JFE strategy. In this paper, is assigned a constant 0.5.

4.2. LPE Algorithm

LPE is uses to calculate similarity between links of current web page and a given topic. It can be described specifically as follows. First, the current web page is partitioned into many content blocks based on CBP. Then, we compute the relevance of content blocks with the topic using the method of similarity measure. If a content block is relevant, all unvisited URLs are extracted and added into frontier, and the content block similarity is treated as priority, if the content block is not relevant, in which JFE is used to calculate the similarity, and the similarity is treated as priority weight. Algorithm 1 describes the process of LPE.

Input: current web page, eigenvector of a given topic, threshold
Output: url_queue
() procedure LPE
() block_list CBP(web page)
() for each block in block_list
()  extract features from block and compute weights, and generate eigenvector of block
()  
()  if    then
()   link_list  extract each link of block
()    for each link in link_list
()      Priority(link) 
()      enqueue its unvisited urls into url_queue based on priorities
()    end for
()  else
()    temp_queue  extract all anchor texts and link_contexts
()    for each link in temp_queue
()      extract features from anchor text and compute weights, and generate eigenvector of anchor text
()      extract features from link_contexts and compute weights, and generate eigenvector of link_contexts text
()      
()      if    then
()       Priority(link) 
()       enqueue its unvisited urls into url_queue based on priorities
()      end if
()      dequeue url in temp_queue
()    end for
()  end if
() end for
() end procedure

LPE compute the weight of each term based on TFC weighting scheme [26] after preprocessing. The TFC weighting equation is as follows:where is the frequency of term in the unit (content block, anchor text, or link-context); is the number of feature units in the collection; is the number of all the terms; is the number of units where word occurs.

Then, we are use the method of cosine measure to compute the similarity between link feature and topic. The formula is shown as follows:where is eigenvector of a unit, that is, ; is eigenvector of a given topic, that is, ; and are the weight of and , respectively. Hence, when is eigenvector of the content block, we can use the above formula to compute . In the same way, we can use the above formula to compute and too.

5. Improved Focused Crawler

In this section, we provide the architecture of focused crawler enhanced by web page classification and link priority evaluation. Figure 1 shows the architecture of our focused crawler. The architecture for our focused crawler is divided into several steps as follows:(1)The crawler component dequeues a URL from the url_queue (frontier), which is a priority queue. Initially, the seed URLs are inserted into the url_queue with the highest priority score. Afterwards, the items are dequeued on a highest-priority-first basis.(2)The crawler locates the web pages pointed and attempts to download the actual HTML data of the web page by the current fetched URL.(3)For each downloaded web page, the crawler adopts web page classifier to classify. The relevant web pages are added into relevant web page set.(4)Then, web pages are parsed into the page’s DOM tree and partitioned into many content blocks according to HTML content block tags based on CBP algorithm. And calculating the relevance between each content block and topic is by using the method of similarity measure. If a content block is relevant, all unvisited URLs are extracted and added into frontier, and the content block relevance is treated as priority weight.(5)If the content block is not relevant, we need to extract all anchors and link-contexts and adopt the JFE strategy to get each link’s relevance. If relevant, the link is also added into frontier, and the relevance is treated as priority weight; otherwise give up web page.(6)The focused crawler continuously downloads web pages for given topic until the frontier becomes empty or the number of the relevant web pages reaches a default.

6. Experimental Results and Discussion

In order to verify the effectiveness of the proposed focused crawler, several tests have been achieved in this paper. The tests are Java applications running on a Quad Core Processor 2.4 GHz Core i7 PC with 8 G of RAM and SATA disk. The experiments include two parts: evaluate the performance of web page classifier and evaluate the performance of focused crawler.

6.1. Evaluate the Performance of Web Page Classifier
6.1.1. Experimental Datasets

In this experiment, we used the Reuters-21,578 (evaluate the performance), Reuters Corpus Volume 1 (RCV1) (http://trec.nist.gov/data/reuters/reuters.html), 20 Newsgroups (http://qwone.com/~jason/20Newsgroups/), and Open Directory Project (http://www.droz.org/) as our training and test dataset. Of the 135 topics in Reuters-21,578, 5480 documents from 10 topics are used in this paper. RCV1 has about 810,000 Reuters, English language news stories collected from the Reuters newswire. We use “topic codes” set, which include four hierarchical groups: CCAT, ECAT, GCAT, and MCAT. Among 789,670 documents, 5,000 documents are used in this paper. The 20 Newsgroups dataset has about 20,000 newsgroup documents collected by Ken Lang. Of the 20 different newsgroups in dataset, 8540 documents from 10 newsgroups are used in this paper. ODP is the largest, most comprehensive human-edit directory of the Web. The data structure of ODP is organized as a tree, where nodes contain URLs that link to the specific topical web pages; thus we use the first three layers and consider both hyperlink text and the corresponding description. We choose ten topics as samples to test the performance of our method, and 500 samples are chosen from each topic.

6.1.2. Performance Metrics

The performance of web page classifier can reflect the availability of the focused crawler directly. Hence, it is essential to evaluate the performance of web page classifier. Most classification tasks are evaluated using Precision, Recall, and -Measure. Precision for text classifying is the fraction of documents assigned that are relevant to the class, which measures how well it is doing at rejecting irrelevant documents. Recall is the proportion of relevant documents assigned by classifier, which measures how well it is doing at finding all the relevant documents. We assume that is the set of relevant web pages in test dataset; is the set of relevant web pages assigned by classifier. Therefore, we define Precision [3, 27] and Recall [3, 27] as follows:

The Recall and Precision play very important role in the performance evaluation of classifier. However, they have certain defects; for example, when improving one performance value, the other performance value will decline [27]. For mediating the relationship between Recall and Precision, Lewis [28, 29] proposes -Measure that is used to evaluate the performance of classifier. Here, -Measure is also used to measure the performance of our web page classifier in this paper. -Measure is defined as follows:where is a weight for reflecting the relative importance of Precision and Recall. Obviously, if , then Recall is more important than Precision; if , then Precision is more important than Recall; if , then Recall and Precision are equally important. In this paper, is assigned a constant 1.

6.1.3. Evaluate the Performance of Web Page Classifier

In order to test the performance of ITFIDF, we run the classifier using different term weighting methods. For a fair comparison, we use the same method of pruning the feature space and classification model in the experiment. Figure 2 compares the performance of -Measure achieved by our classifying method using ITFIDF and TFIDF weighting for each topic on the four datasets.

As can be seen from Figure 2, we observe that the performance of classification method using ITFIDF weighting is better than TFIDF on each dataset. In Figure 2, the average of the ITFIDF’s -Measure has exceeded the TFIDF’s 5.3, 2.0, 5.6, and 1.1 percent, respectively. Experimental results show that our classification method is effective in solving classifying problems, and proposed ITFIDF term weighting is significant and effective for web page classification.

6.2. Evaluate the Performance of Focused Crawler
6.2.1. Experimental Data

In this experiment, we selected the relevant web pages and the seed URLs for the above 10 topics as input data of our crawler. These main topics are basketball, military, football, big data, glasses, web games, cloud computing, digital camera, mobile phone, and robot. The relevant web pages for each topic accurately describe the corresponding topic. In this experiment, the relevant web pages for all of the topics were selected by us, and the number of those web pages for each topic was set to 30. At the same time, we used the artificial way to select the seed URLs for each topic. And, the seed URLs were shown in Table 1 for each topic.

6.2.2. Performance Metrics

The performance of focused crawler can also reflect the availability of the crawling directly. Perhaps the most crucial evaluation of focused crawler is to measure the rate at which relevant web pages are acquired and how effectively irrelevant web pages are filtered out from the crawler. With this knowledge, we could estimate the precision and recall of focused crawler after crawling web pages. The precision would be the fraction of pages crawled that are relevant to the topic and recall would be the fraction of relevant pages crawled. However, the relevant set for any given topic is unknown in the web, so the true recall is hard to measure. Therefore, we adopt harvest rate and target recall to evaluate the performance of our focused crawler. And, harvest rate and target recall were defined as follows:(1)The harvest rate [30, 31] is the fraction of web pages crawled that are relevant to the given topic, which measures how well it is doing at rejecting irrelevant web pages. The expression is given by where is the number of web pages crawled by focused crawler in current; is the relevance between web page and the given topic, and the value of can only be 0 or 1. If relevant, then ; otherwise .(2)The target recall [30, 31] is the fraction of relevant pages crawled, which measures how well it is doing at finding all the relevant web pages. However, the relevant set for any given topic is unknown in the Web, so the true target recall is hard to measure. In view of this situation, we delineate a specific network, which is regarded as a virtual WWW in the experiment. Given a set of seed URLs and a certain depth, the range reached by a crawler using breadth-first crawling strategy is the virtual Web. We assume that the target set is the relevant set in the virtual Web; is the set of first pages crawled. The expression is given by

6.2.3. Evaluation the Performance of Focused Crawler

An experiment was designed to indicate that the proposed method of web page classification and the algorithm of LPE can improve the performance of focused crawlers. In this experiment, we built crawlers that used different techniques (breadth-first, best-first, anchor text only, link-context only, and CBP), which are described in the following, to crawl the web pages. Different web page content block partition methods have different impacts on focused web page crawling performance. According to the experimental result in [25], alpha in CBP algorithm is assigned a constant 0.5 in this paper. Threshold in LPE algorithm is a very important parameter. Experiment shows that if the value of threshold is too big, focused crawler finds it hard to collect web page. Conversely, if the value of threshold is too small, the average harvest rate for focused crawler is low. Therefore, according to the actual situations, threshold is assigned a constant 0.5 in the rest of the experiments. In order to reflect the comprehensiveness of our method, Figures 3 and 4 show the average harvest rate and average target recall on ten topics for each crawling strategy, respectively.

Figure 3 shows a performance comparison of the average harvest rates for six crawling methods for ten different topics. In Figure 3, -axis represents the number of crawled web pages; -axis represents the average harvest rates when the number of crawled pages is . As can be seen from Figure 3, as the number of crawled web pages increases, the average harvest rates of six crawling methods are falling. This occurs because the number of crawled web pages and the number of relevant web pages have different increasing extent, and the increment of the former was bigger than that of the latter. From Figure 3, we can also see that the numbers of crawled web pages of the LPE crawler is higher than those of the other five crawlers. In addition, the harvest rates of breadth-first crawler, best-first crawler, anchor text only crawler, link-context only crawler, CBP crawler, and LPE crawler are, respectively, 0.16, 0.28, 0.39, 0.48, 0.61, and 0.80, at the point that corresponds to 10000 crawled web pages in Figure 3. These values indicate that the harvest rate of the LPE crawler is 5.0, 2.9, 2.0, 1.7, and 1.3 times as large as those of breadth-first crawler, best-first crawler, anchor text only crawler, link-context only crawler, and CBP crawler, respectively. Therefore, the figure indicates that the LPE crawler has the ability to collect more topical web pages than the other five crawlers.

Figure 4 shows a performance comparison of the averagetarget recallfor six crawling methods for ten different topics. In Figure 4, -axis represents the number of crawled web pages; -axis represents the average target recall when the number of crawled pages is . As can be seen from Figure 4, as the number of crawled web pages increases, the average target recall of six crawling methods is rising. This occurs because the number of crawled web pages is increasing, but the target set is unchanged. The average target recall of the LPE crawler is higher than the other five crawlers for the numbers of crawled web pages. In addition, the harvest rates of breadth-first crawler, best-first crawler, anchor text only crawler, link-context only crawler, CBP crawler, and LPE crawler are, respectively, 0.10, 0.15, 0.19, 0.21, 0.27, and 0.33, at the point that corresponds to 10000 crawled web pages in Figure 4. These values indicate that the harvest rate of the LPE Crawler is 3.3, 2.2, 1.7, 0.16, and 1.2 times as large as those of breadth-first crawler, best-first crawler, anchor text only crawler, link-context only crawler, and CBP crawler, respectively. Therefore, the figure indicates that the LPE crawler has the ability to collect greater qualities of topical web pages than the other five crawlers.

It can be concluded that the LPE crawler has a higher performance than the other five focused crawlers. For the 10 topics, the LPE crawler has the ability to crawl greater quantities of topical web pages than the other five crawlers. In addition, the LPE crawler has the ability to predict more accurate topical priorities of links than other crawlers. In short, the LPE, by CBP algorithm and JFE strategy, improves the performance of the focused crawlers.

7. Conclusions

In this paper, we presented a novel focused crawler which increases the collection performance by using the web page classifier and the link priority evaluation algorithm. The approaches proposed and the experimental results draw the following conclusions.

TFIDF does not take into account the difference of expression ability in the different page position and the proportion of feature distribution when building web pages classifier. Therefore, ITFIDF can be considered to make up for the defect of TFIDF in web page classification. The performance of classifier using ITFIDF is compared with classifier using TFIDF in four datasets. Results show that the ITFIDF classifier outperforms TFIDF for each dataset. In addition, in order to gain better selection of the relevant URLs, we propose link priority evaluation algorithm. The algorithm was classified into two stages. First, the web pages were partitioned into smaller blocks by the CBP algorithm. Second, we calculated the relevance between links of blocks and the given topic using LPE algorithm. The comparison between LPE crawler and other crawlers uses 10 topics, whereas it is superior to other techniques in terms of average harvest rate and target recall. In conclusion, web page classifier and LPE algorithm are significant and effective for focused crawlers.

Competing Interests

The authors declare that they have no competing interests.