Website owners tell web spiders such as Googlebot what can and can't be crawled on their websites usingrobots.txt file. The file resides on the root directory of a website and contains rules such as the following;

User-agent: * Disallow: /secret Disallow: password.txt

A good web spider will first read the robots.txt file and adhere to the rule, though it's actually not compulsory.

If you run a scrapy crawl command for a project, it will first look for the robots.txt file and abide by all the rules.

$ scrapy crawl myspider 2018-06-19 12:05:10 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: scrapyproject) ---snipped--- 2018-06-19 12:05:10 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', ---snipped--- 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/robots.txt> (referer: None) 2018-06-19 12:05:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/> (referer: None) ---snipped---

You can ignore robots.txt for your Scrapy spider by using the ROBOTSTXT_OBEY option and set the value to False.

Steps to ignore robots.txt for Scrapy spiders:

  1. Crawl a website normally using scrapy crawl command for your project to use the default to adhere to robots.txt rules.

    $ crapy crawl spidername

  2. Use set option to set ROBOTSTXT_OBEY option to False when crawling to ignore robots.txt rules.

    $ crapy crawl --set=ROBOTSTXT_OBEY='False' spidername

  3. Open Scrapy's configuration file in your project folder using your favorite editor.

    $ vi scrapyproject/settings.py

  4. Look for the ROBOTSTXT_OBEY option.

    # Obey robots.txt rules ROBOTSTXT_OBEY = True

  5. Set the value to False

    ROBOTSTXT_OBEY = False

  6. Scrapy should no longer check for robots.txt and your spider will crawl for everything regardless of what's defined in the robots.txt file.