Paper
20 October 2022 How to circumvent common anti-crawler mechanism of target websites via Scrapy
Xinkai Gao, Fengshan Yuan, Jihui Fan
Author Affiliations +
Proceedings Volume 12350, 6th International Workshop on Advanced Algorithms and Control Engineering (IWAACE 2022); 123501O (2022) https://doi.org/10.1117/12.2652837
Event: 6th International Workshop on Advanced Algorithms and Control Engineering (IWAACE 2022), 2022, Qingdao, China
Abstract
In the field of artificial intelligence and machine learning, only when enterprises obtain a large amount of data can they train enough reliable models [1]. How to obtain massive data at a low cost has become one of the key prerequisites for the success of data intelligence enterprises. Mastering a large amount of data is an important prerequisite for gaining competitive advantage [2]. There is a cognitive trend among enterprises with massive data. If the massive data as their advantages are collected by peers, their advantages will be weakened or even lost. Therefore, more and more massive data owners adopt various mechanisms to protect their public data in network applications and avoid data being crawled by crawlers [3]. From the perspective of data collectors, this paper introduces some common anti crawling mechanisms in details based on the Scrapy framework and the recruitment website of a well-known internet enterprise, and then gives some techniques to circumvent the above crawling mechanism. Finally, it successfully crawls all the job information on the recruitment website of the enterprise. The experimental results show that the techniques provided by the paper can effectively bypass the anti-crawling mechanism of some large websites, so as to help collectors obtain massive data.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Xinkai Gao, Fengshan Yuan, and Jihui Fan "How to circumvent common anti-crawler mechanism of target websites via Scrapy", Proc. SPIE 12350, 6th International Workshop on Advanced Algorithms and Control Engineering (IWAACE 2022), 123501O (20 October 2022); https://doi.org/10.1117/12.2652837
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Data storage

Data modeling

Internet

Databases

Fluctuations and noise

Logic

Machine learning

Back to Top