How to scrape HTML table using Scrapy

By webmaster • 2 years ago • Scrapy

The common way of presenting data on websites is usingHTML table, and Scrapy is perfect for the job.

An HTML table starts with a table tag with each row defined with tr and column with td tags respectively. Optionally thead is used to group the header rows and tbody to group the content rows.

Related: HTML Tables guide

To scrape data from HTML table, basically, we need to find the table that we're interested in on a website and iterate for each row the columns that we want to get our data from.

Steps to scrape HTML table using Scrapy:

Go to the web page that you want to scrape the table data from using your web browser.

For this example we're to scrape Bootstrap's Table documentation page

Inspect the element of the table using your browser's built-in developer tools or by viewing the source code.

In this case, the table is assigned the classes of table and table-striped Here's the actual HTML code for the table

<table class="table table-striped">   <thead>     <tr>       <th scope="col">#</th>       <th scope="col">First</th>       <th scope="col">Last</th>       <th scope="col">Handle</th>     </tr>   </thead>   <tbody>     <tr>       <th scope="row">1</th>       <td>Mark</td>       <td>Otto</td>       <td>@mdo</td>     </tr>     <tr>       <th scope="row">2</th>       <td>Jacob</td>       <td>Thornton</td>       <td>@fat</td>     </tr>     <tr>       <th scope="row">3</th>       <td>Larry</td>       <td>the Bird</td>       <td>@twitter</td>     </tr>   </tbody> </table>

Launch Scrapy shell at the terminal with the web page URL as an argument.

$ scrapy shell https://getbootstrap.com/docs/4.0/content/tables/ 2020-05-26 02:52:01 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapybot) 2020-05-26 02:52:01 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 18.9.0, Python 3.8.2 (default, Apr 27 2020, 15:53:34) - [GCC 9.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-31-generic-x86_64-with-glibc2.29 2020-05-26 02:52:01 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor 2020-05-26 02:52:01 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',  'LOGSTATS_INTERVAL': 0} ##### snipped

Check HTTP response code to see if the request was successful.
```
In [1]: response Out[1]: <200 https://getbootstrap.com/docs/4.0/content/tables/>
```
200 is the OK success respond status code for HTTP.

Search for the table you're interested in using the xpath selector.

In [2]: table = response.xpath('//*[@class="table table-striped"]')  In [3]: table Out[3]: [<Selector xpath='//*[@class="table table-striped"]' data='<table class="table table-striped">\n ...'>]

In this case the table is assigned table table-striped CSS classes and that's what we use as our selector.

Narrow down search to tbody if applicable.

In [4]: table = response.xpath('//*[@class="table table-striped"]//tbody')  In [5]: table Out[5]: [<Selector xpath='//*[@class="table table-striped"]//tbody' data='<tbody>\n    <tr>\n      <th scope="row...'>]

Get the table rows by searching for tr.

In [6]: rows = table.xpath('//tr')  In [7]: rows Out[7]: [<Selector xpath='//tr' data='<tr>\n      <th scope="col">#</th>\n   ...'>,  <Selector xpath='//tr' data='<tr>\n      <th scope="row">1</th>\n   ...'>,  <Selector xpath='//tr' data='<tr>\n      <th scope="row">2</th>\n   ...'>,  <Selector xpath='//tr' data='<tr>\n      <th scope="row">3</th>\n   ...'>,  <Selector xpath='//tr' data='<tr>\n      <th scope="col">#</th>\n   ...'>, ##### snipped

Select a row to test.
```
In [8]: row = rows[2]
```
Multiple rows are stored as an array.
Access the row's column via the <td> selector and extract column's data.
```
In [9]: row.xpath('td//text()')[0].extract() Out[9]: 'Jacob//
```
The first column uses <th> instead of <td> thus our array index starts at the First column of the table.

Combine everything into a complete code by iterating each rows with a for loop.

In [10]: for row in response.xpath('//*[@class="table table-striped"]//tbody//tr'):     ...:     name = {     ...:         'first' : row.xpath('td[1]//text()').extract_first(),     ...:         'last': row.xpath('td[2]//text()').extract_first(),     ...:         'handle' : row.xpath('td[3]//text()').extract_first(),     ...:     }     ...:     print(name)     ...: {'first': 'Mark', 'last': 'Otto', 'handle': '@mdo'} {'first': 'Jacob', 'last': 'Thornton', 'handle': '@fat'} {'first': 'Larry', 'last': 'the Bird', 'handle': '@twitter'}

Create a Scrapy spider from the previous codes (optional).

scrape-table.py

import scrapy     class ScrapeTableSpider(scrapy.Spider):     name = 'scrape-table'     allowed_domains = ['https://getbootstrap.com/docs/4.0/content/tables']     start_urls = ['http://https://getbootstrap.com/docs/4.0/content/tables/']         def start_requests(self):         urls = [             'https://getbootstrap.com/docs/4.0/content/tables',         ]         for url in urls:             yield scrapy.Request(url=url, callback=self.parse)       def parse(self, response):         for row in response.xpath('//*[@class="table table-striped"]//tbody/tr'):             yield {                 'first' : row.xpath('td[1]//text()').extract_first(),                 'last': row.xpath('td[2]//text()').extract_first(),                 'handle' : row.xpath('td[3]//text()').extract_first(),             }

Run the spider with JSON output.

$ scrapy crawl --nolog --output -:json scrape-table [ {"first": "Mark", "last": "Otto", "handle": "@mdo"}, {"first": "Jacob", "last": "Thornton", "handle": "@fat"}, {"first": "Larry", "last": "the Bird", "handle": "@twitter"} ]

OnlineWebTools

Steps to scrape HTML table using Scrapy:

Like this:

Related

How to save cookies from cURL request

Like this:

How to change the user agent for cURL

Like this:

How to ignore SSL certificate error in cURL

Like this:

How to send a POST request in cURL

Like this:

How to send data in HTTP request using cURL

Like this:

OnlineWebTools

How to scrape HTML table using Scrapy

Steps to scrape HTML table using Scrapy:

Share this:

Like this:

Related

Recent Posts

Featured Category: cURL

How to save cookies from cURL request

Share this:

Like this:

How to change the user agent for cURL

Share this:

Like this:

How to ignore SSL certificate error in cURL

Share this:

Like this:

How to send a POST request in cURL

Share this:

Like this:

How to send data in HTTP request using cURL

Share this:

Like this:

Top Posts

Categories