Current location - Loan Platform Complete Network - Big data management - Everyone should know about web crawler. How much do you know?
Everyone should know about web crawler. How much do you know?
Web crawler is a program or script that automatically crawls Internet information according to certain rules. [2] It can be understood as a small robot that automatically accesses web pages and performs related operations. Essentially, it reads and collects network information efficiently and automatically. The crawler program was first developed by Eichmann of the University of Houston in 1994. The Google crawler used by the famous Google Company was developed in Python in 1998 by students Brin and Page of Stanford University at that time.

The legal risks that may be brought by using crawler technology mainly come from several aspects:

(1) Violates the will of the crawled party, such as circumventing the anti-crawling measures set by the website and forcibly breaking through its anti-crawling measures;

(2) The use of crawler has caused the actual consequences of interfering with the normal operation of the visited website;

(3) Crawlers capture specific types of information protected by law. Among them, the third risk mainly comes from grabbing unpublished information on the internet by avoiding reptiles.

A: It is not illegal to abide by the robot agreement.

A: Check the website domain name plus the file under the link/robots.txt.

For example, tiktok:/robots.txt.

User agent: robots (such as "Googlebot", etc.) to which the following rules apply. ).

Disallow: the page that you want to prevent the robot from accessing (multiple lines are prohibited as needed).

Block the whole website: not allowed:/

Block the directory and everything in it: Not allowed: /private_directory/

Blocked page: Not allowed:/private _ file.html.

Block pages and/or directories named private: Disallow: /private.

Allow: Pages that do not need to be blocked by robots.

Noindex: Pages that you want search engines to block from indexing (or de-index them if they have been indexed before). Support Google, not Yahoo and Live Search. Other searches are unknown.

For example, in order to let the robot check everything/tutorial/Zhang Zhan/2017/061771/

Reference: /article/2 172053.html