Wireless Communications and Mobile Computing

Research Article

Construction of Online English Corpus Based on Web Crawler Technology

Comparison of web crawler frameworks.


Web crawler framework	Programing language	Describe

Apache Nutch	Java	It can collect all the contents of the website (general crawler and whole web crawler), regardless of the accuracy of collection and analysis. It is suitable for web search engines. However, nutch’s crawler customization ability is relatively weak, modular design, and strong scalability; rich extraction page APIs. Support multithreading and distributed crawling. Support JS dynamic rendering page crawling. There is no framework dependency and can be flexibly embedded into the project.
Webmagic	Java
Webcollector	Java	The Java crawler framework, which does not need configuration and is convenient for secondary development, can realize a powerful crawler with only a small amount of code. Support distributed.
Hretrix3	Java	The extensibility is enhanced to facilitate users to realize their own crawl logic. The biggest feature of the lightweight single machine open source crawler framework based on Java is simplicity. In addition, it also supports multithreading and proxy and can filter duplicate URLs.
Crawler4j	Java
scrapy	Python	A fast, simple, efficient and extensible web content capture framework developed entirely based on Python is used to extract the required data from the website. Scrapy has a wide range of uses and can be used for data mining, monitoring, information processing, and automated testing. Using scrapy, you can easily modify it according to your needs (scrapy is available).