How to ignore robots.txt for Scrapy spiders
Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites usingrobots.txt file. The file resides on the root directory of a website and contains rules such as the following;
User-agent: * Disallow: /secret Disallow: password.txt
A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.
If you run a scrapy crawl command for a project, it will first look for the robots.txt file and abide by all the rules.
$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---
You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.
Steps to ignore robots.txt for Scrapy spiders:
-
Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
$ crapy crawl spidername
-
Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
$ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
-
Open Scrapy's configuration file in your project folder using your favorite editor.
$ vi scrapyproject/settings.py
-
Look for the ROBOTSTXT_OBEY option.
# Obey robots.txt rules ROBOTSTXT_OBEY = True
-
Set the value to False
ROBOTSTXT_OBEY = False
-
Scrapy should no longer check for robots.txt and your spider will crawl for everything regardless of what's defined in the robots.txt file.