Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites usingrobots.txt file. The file resides on the root directory of a website and contains rules such as the following;
User-agent: * Disallow: /secret Disallow: password.txt
A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.
If you run a scrapy crawl command for a project, it will first look for the robots.txt file and abide by all the rules.
$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---
You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.
Steps to ignore robots.txt for Scrapy spiders:
-
Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
$ crapy crawl spidername
-
Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
$ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
-
Open Scrapy's configuration file in your project folder using your favorite editor.
$ vi scrapyproject/settings.py
-
Look for the ROBOTSTXT_OBEY option.
# Obey robots.txt rules ROBOTSTXT_OBEY = True
-
Set the value to False
ROBOTSTXT_OBEY = False
-
Scrapy should no longer check for robots.txt and your spider will crawl for everything regardless of what's defined in the robots.txt file.