Research Article

Construction of Online English Corpus Based on Web Crawler Technology

Table 2

Comparison of web crawler frameworks.

Web crawler frameworkPrograming languageDescribe

Apache NutchJavaIt can collect all the contents of the website (general crawler and whole web crawler), regardless of the accuracy of collection and analysis. It is suitable for web search engines. However, nutch’s crawler customization ability is relatively weak, modular design, and strong scalability; rich extraction page APIs. Support multithreading and distributed crawling. Support JS dynamic rendering page crawling. There is no framework dependency and can be flexibly embedded into the project.
WebmagicJava
WebcollectorJavaThe Java crawler framework, which does not need configuration and is convenient for secondary development, can realize a powerful crawler with only a small amount of code. Support distributed.
Hretrix3JavaThe extensibility is enhanced to facilitate users to realize their own crawl logic. The biggest feature of the lightweight single machine open source crawler framework based on Java is simplicity. In addition, it also supports multithreading and proxy and can filter duplicate URLs.
Crawler4j
scrapyPythonA fast, simple, efficient and extensible web content capture framework developed entirely based on Python is used to extract the required data from the website. Scrapy has a wide range of uses and can be used for data mining, monitoring, information processing, and automated testing. Using scrapy, you can easily modify it according to your needs (scrapy is available).