筆記：robots.txt 管理搜尋引擎的爬蟲動作

2022022308:20

避免爬蟲在短時間內大量抓取網頁

可以設定延遲時間
例如 5 是指每五秒才能抓取一頁
(不過，網路上的爬蟲種類眾多，也有可能同一秒鐘有數隻爬蟲同時來抓取網頁)

User-agent: *
Crawl-delay: 5

拒絕所有爬蟲

User-agent: *
Disallow: /

若要禁止特定爬蟲，則是：

User-agent: Baiduspider   #爬蟲名稱
Disallow: /

下列寫法，則是禁止所有爬蟲，
但開放 Googlebot 可抓取 /plugin/ 以外的網頁或檔案：

User-agent: *
Disallow: /

User-agent: Googlebot
Disallow: /plguin/

禁止爬取特定檔案或目錄

User-agent: *
Disallow: /contactus.htm
Disallow: /index.htm
Disallow: /admin/

不過，不建議把 "後台網址" 寫在 robots.txt 內
這樣等於讓駭客知道後台的網址在那兒，增加被駭風險

所以，完整的 robots.tx 可能長這樣

User-agent: *
Disallow: /plugin/

User-agent: msnbot
Disallow: /*.txt
Disallow: /plugin/

User-agent: AhrefsBot
Crawl-delay: 5

User-agent: Baiduspider
Disallow: /

User-agent: Petalbot 
Disallow: /

Sitemap: https://www.xxx.com/sitemap.xml
Sitemap: https://www.xxx.com/sitemap2.xml
Sitemap: https://www.xxx.com/sitemap3.xml

常見的爬蟲名稱

Googlebot   Google 爬蟲
Mediapartners-Google  若網站有放置 AdSense 廣告，就會有這隻 Google 爬蟲出現
Yahoo! Slurp   Yahoo 爬蟲
bingbot 微軟 bing 爬蟲

AhrefsBot   網路分析/SEO公司的爬蟲 https://ahrefs.com/
MJ12bot     網路分析/SEO公司的爬蟲 https://www.mj12bot.com/
Baiduspider  百度的爬蟲
YandexBot  俄羅斯的搜尋公司爬蟲
Petalbot  華為的爬蟲 https://aspiegel.com/petalbot

參考：
Google 如何解讀 robots.txt 規格

Google 檢索器 (使用者代理程式) 總覽 (Google 所有的爬蟲名稱列表)

What is a robots.txt file?

相關文章

我要留言

筆記：robots.txt 管理搜尋引擎的爬蟲動作

避免爬蟲在短時間內大量抓取網頁

拒絕所有爬蟲

禁止爬取特定檔案或目錄

所以，完整的 robots.tx 可能長這樣

常見的爬蟲名稱

note: PHP / Apache HTTP Server 一些設定 (安全相關)

Google AdSense 各國點擊收益 (單次點擊出價)

Amazon Elastic Compute Cloud (EC2) 筆記

中国政府的网络封锁技术方案与网民的反网络封锁技术方案

sitemap 產生器

關鍵字、股市分析

Great FireWall ，簡稱 GFW