Getting started with Scrapy: A beginner's guide to web scraping in Python

I’d long been curious about web scraping and am pleased to have finally made a start in this direction. Previously, any scraping job I needed was carried out via import.io, but now I’ve branched out to Scrapy. I’d also wanted to practise my use of Python, so this was a great opportunity to kill two birds with one stone. Here I’ll share my first foray into this area - it may be useful for others who are also starting out (as well as for my future self, as a reminder).
I’m planning to do at least one more post on this same topic, but for now, we’ll start with a simple case where all the content we are interested in is on one single page - no need to cycle across multiple pages of results etc. A good candidate I’ve found is this page on LookFantastic, listing all their currently active discounts and vouchers. There are quite a few, but thankfully the page is fairly tidy, hence providing the perfect beginner case study. So here we go!
But wait, why should I bother?
Granted, this is just a toy example - but even here, there could be practical implications for scraping the offers. For instance, we might be interested in multiple e-shops selling a particular item, so it would be useful to know when an offer becomes active for that particular item. We could use scheduled tasks via cron
and regularly scrape these sites, and then create an alert or email notification when a relevant offer appears. In fact, web scraping can be useful to design entire services of this nature (without knowing their infrastructure, Pouch or Honey may use some form of web scraping).
Libraries and tools
We’ll start by loading the necessary libraries. I should add that for the purposes of this post, I am using Anaconda v.2020.02, a Python 3.7 interpreter, and Scrapy v.1.6.0. Equally, a really handy sidekick is the Web Developer menu → Inspector tool within Firefox (there are equivalent tools in other browsers).
|
With that taken care of, let’s get stuck in. At the time of writing, the code below works smoothly for the page I chose, but I cannot guarantee this will still be the case in the future (but you can see the current lay of the land via this offline copy of the page).
CSS and XPath Selectors in Scrapy
Based on the documentation,
Scrapy comes with its own mechanism for extracting data. They’re called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.
So this is exactly what we are doing below: using a helping hand from requests
which we loaded previously, we then extract the HTML content of this page:
|
Now that we have extracted the HTML from the URL, we have to figure out how to access specific elements that are of interest to us. This is where Firefox’s Inspector comes in. When the Inspector is active and we hover over various elements on the page, this will highlight the relevant HTML code that governs them. For instance, we can quickly see that the class of each offer “chunk” is voucher-info-wrapper
.

Bearing this in mind, we can use a selector in Scrapy
to grab all the elements whose class is voucher-info-wrapper
. The two methods below (CSS and XPath) lead to the same output: a SelectorList
object. Depending on the task at hand, you’ll be able to use either CSS or XPath to extract the same information - the choice is up to you, although in some cases, it will be either the CSS or the XPath version which is more direct:
|
If you are wondering, the XPath notation //*
here signifies that we are looking for our given class anywhere within the document. We can then explore the elements extracted using the get()
or getall()
methods:
|
There are many ways to use selectors and have them take advantage of other element characteristics, not just class
: for instance element id
, or location relative to the rest of the HTML document. Here are some pairs of examples:
|
Systematically extracting info to serve up as a pandas DataFrame
So now that we have a rough idea of how to use XPath and CSS selectors in Scrapy
, we can target particular pieces of information from each offer. For (almost) any offer, we’ll observe:
- a title
- a main offer message/text
- a type of offer (discount, or nth product free etc)
- an end date
- a URL
So let’s start picking these off one by one:
|
Final output
And voilà: we have scraped all the current offers on the LookFantastic site, and organised them into a tidy-looking DataFrame. The final output should look like this:

This is just a start, and I plan to add at least one more (complex) example. Until then, this should hopefully illustrate some of the basic things you can achieve with Scrapy
.