How to ignore robots.txt for Scrapy spiders

webmaster

2 years ago

Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites usingrobots.txt file. The file resides on the root directory of a website and contains rules such as the following;

User-agent: * Disallow: /secret Disallow: password.txt

A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.

If you run a scrapy crawl command for a project, it will first look for the robots.txt file and abide by all the rules.

$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---

You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.

Steps to ignore robots.txt for Scrapy spiders:

Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.
```
$ crapy crawl spidername
```
Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.
```
$ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername
```
Open Scrapy's configuration file in your project folder using your favorite editor.
```
$ vi scrapyproject/settings.py
```

Look for the ROBOTSTXT_OBEY option.

# Obey robots.txt rules ROBOTSTXT_OBEY = True

Set the value to False
```
ROBOTSTXT_OBEY = False
```
Scrapy should no longer check for robots.txt and your spider will crawl for everything regardless of what's defined in the robots.txt file.

Steps to ignore robots.txt for Scrapy spiders:

Share this: