浅谈百度爬虫的HTTP状态码返回机制

2023-12-26 23阅读

本文将就HTTP协议中相关的返回机制以及在不同情况下会出现何种返回代号作一番浅显易懂地介绍。返回404 Not Found 时表明找不到相关页面;

一、简介

HTTP状态码是指在Web服务器上运行的应用程序发送到客户端（浏览器）的信息。它包含了诸如200 OK之类的标准代号，用来告诉客户端当前页面所处的情况。而对于百度来说，其为了能够正常采集数据并将其存储到数据库中，必须要遵循HTTP协议中相关的规则。因此，本文将就HTTP协议中相关的返回机制以及在不同情况下会出现何种返回代号作一番浅显易懂地介绍。

二、HTTP 状态代号

1. 200 OK: 这是最常见也是最重要的 HTTP 状态代号之一, 在大部分情况下, 此时表明 Web 服务器已成功处理了该请求;

2. 301 Moved Permanently: 这意味者永久性重定向, 针对特定链接, 如 www.example.com/old-page.html , 此时会将 URL 重新引导到 www.example.com/new-page .html ;

3. 302 Found (Moved Temporarily): 这意味者临时性重定向, 和301 Moved Permanently 相似, 但302 Found 是临时更新URL;

4. 404 Not Found: 返回404 Not Found 时表明找不到相关页面;

5 403 Forbidden : 有时候 Web 服务器会阻止特定 IP 地址或由特定 IP 地址执行特定方法(例如 POST) , 此时就会返回403 Forbidden ;

三、Http Status Code Return Mechanism of Baidu Crawler

1、Baidu crawler will first send a request to the server and wait for the response from the server in order to get the content of web page or other resources on it . If there is no response within certain time limit , then Baidu crawler will consider that this request has failed and stop crawling this page .

2、When receiving a response from server , Baidu crawler will check whether it is an error code or not according to HTTP status codes returned by server . If it is an error code such as 404 Not found or 403 Forbidden etc., then Baidu crawler will stop crawling this page immediately without further processing . Otherwise if it is a normal status code like 200 OK , then Baidu crawler can continue its work and start downloading contents from this page .

3、In addition to checking HTTP status codes returned by servers , Baidu also checks robots exclusion protocol (robots txt ) before sending requests so as to avoid wasting resources on pages which are forbidden for crawling by website owners themselves through robots txt files stored on their websites .

4、After getting all contents successfully downloaded from target webpages with normal status codes returned by servers , baidu spider will store them into database for later use such as indexing these data into search engine results list when users enter related keywords in search box of baidus homepage etc..

5、Finally after finishing all tasks above mentioned above successfully without any errors occurred during processings of each step involved in whole procedure described hereabove , baud spider can move onto next webpage waiting for being crawled until all webpages listed in task queue have been processed completely one after another orderly just like what we have discussed hereabove briefly but clearly enough hopefully !

以上就是关于浅谈百度爬虫的HTTP状态码返回机制的相关知识，如果对你产生了帮助就关注网址吧。