Getting stuck in with Scrapy: A beginner's guide to web scraping in Python


I’d long been curious about web scraping and am pleased to have finally made a start in this direction. Previously, any scraping job I needed was carried out via, but now I’ve branched out to Scrapy. I’d also wanted to practise my use of Python, so this was a great opportunity to kill two birds with one stone. Here I’ll share my first foray into this area - it may be useful for others who are also starting out (as well as for my future self, as a reminder).

Using Generalised Additive Mixed Models (GAMMs) to predict visitors to Edinburgh and Craigmillar Castles

           · ·

If you attended my talk on “Generalised Additive Models applied to tourism data” at the Newcastle Upon Tyne Data Science Meetup in May 2019, please find my (more detailed) slides below. Craigmillar Castle. Image source here. I’d been curious about generalised additive (mixed) models for some time, and the opportunity to learn more about them finally presented itself when a new project came my way, as part of my work at The Data Lab.

Four tips for creating interactive visualisations with Shiny

           · · ·

I’ve recently presented a toy Shiny app at the Edinburgh Data Visualization Meetup to demonstrate how Shiny can be used to explore data interactively. In my code-assisted walkthrough, I began by discussing the data used: a set of records detailing customer purchases made on Black Friday (i.e., each customer was given a unique ID, which was repeated in long format in the case of multiple purchases). Both the customers and the items purchased are described along various dimensions (e.

Exploring transport routes, journey characteristics and postcode networks using R Shiny

           · ·

Leaflet map (blurred): the thicker/‘redder’ the route, the more travelled it is. As part of The Data Lab, I worked on a project for visualising the traffic flow within a subsidised transport service, operated by a Scottish council. This visualisation needed to display variations in traffic flow conditional on factors such as the time of day, day of the week, journey purpose, as well as other criteria. The overall aim here was to explore and identify areas of particular activity, as well as provide some insight into how this transport service might be improved.

Using Shiny for interactive displays of health data: The Scottish Burden of Diseases

           · ·

The Accelerator programme run by The Data Lab between 19 April 2018 - 06 September 2018 was a Scottish Government collaborative project, open to employees of the Scottish Government, the Information Services Division, the National Records of Scotland and Registers of Scotland. Employees applying to take part had a background in statistics, economics, operational research and social research, and sought to improve their data skills across a variety of areas.

Linking Google Analytics data to website changes or GitHub commits

           · ·

If you’ve wondered how page views may vary in response to website changes, you’ve come to the right place. Setting up your website around a GitHub repo (see options: Netlify + Hugo, and GitHub Pages + Jekyll) is a great way to ensure that this is a smooth process. The beauty of relying on GitHub to store your site is that you are creating an effortless log of site changes as you go, without having to devote attention to this as a separate process.

Using R remotely: some options and tips

           · · ·

Why would you need to do this? Say, for instance, you are dealing with sensitive data that should not leave a specific system, or quite simply that you are away on a work retreat - but your laptop is far less powerful than your work desktop computer which you left behind - so you want to keep using it from a distance. For such reasons, I’ve been looking into what options are available to log in remotely to a machine and run R there for some analysis.

Dealing with many dimensions in historical data: Tracking cooperation & conflict patterns over space and time in R

           · ·

For this post, I’ve managed to find some extremely interesting historical event data offered by the Cline Center on this page. As you will see, this dataset can be quite challenging because of the sheer number of dimensions you could look at. With so many options, it becomes tricky to create visualisations with the ‘right’ level of granularity: not so high-level that any interesting patterns are obscured, but not too detailed and overcrowded either.

Data guidelines: A set of recommendations for clean and usable data


The extent to which a dataset follows a set of commonly expected guidelines will often determine how much time you have left to spend thinking about your analysis. Ideally, you might intend to spend 20% of your time cleaning the data for a project, and 80% planning and carrying out your actual analysis. But often, it might turn out to be the complete opposite. A messy, non-standardized dataset can end up taking up most of your time, so that when you finally bring it into a usable format, you realize you have to rush and finish up with your project.

LA maps of crime: Using R to map criminal activity in LA since 2010

           · ·

I’ve recently come across — a huge resource for open data. At the time of writing, there are close to 17,000 freely available datasets stored there, including this one offered by the LAPD. Interestingly, this dataset includes almost 1.6M records of criminal activity occurring in LA since 2010 — all of them described according to a variety of measures (you can read about them here). Using information like the date and time of a crime, its location (longitude & latitude), and the type of crime committed (among other things), you can come up with some pretty interesting visualizations.