Building a tidy data pipeline: Covid19 Northern Ireland

One of the many scientific benefits that have occurred since the start of the Coronavirus pandemic is the emergence of large-scale open access data describing the dynamics of the virus at a range of scales. Governments, health organizations, and many others have produced data at a unparalleled rate to help society understand how the disease has rampaged through society.

Of course the quality of production of this data varies greatly depending on the producer. Ideally for those interested in data analytics/science the data would be readily available in a clean and tidy format which allows quick analysis to be done. Such ideas are generally considered by data engineers (which I, unfortunately, wouldn’t count myself as) to produce pipelines to collect, clean, and update the data on a regular basis to keep up with the latest figures.

Motivated by this, I have produced a framework to consider such data describing the pandemic in the case of Northern Ireland. The Department of Health NI produce a fantastic summary of the main metrics describing the disease including, cases, testing, hospitilizations, and deaths arising from Covid-19 on a daily-ish basis (more on this in a moment). Unfortunately, for those of us who want to access data as quickly and cleanly as possible, the data is stored in rather clunky .xlsx spreadsheets with each tab representing a different summary.

So the aim of this post is to describe how one can produce a pipeline to

  1. Look for the latest spreadsheet on a daily basis.
  2. Extract the needed info from each of the sheets of the .xlsx file individually.
  3. Clean this data in a more usable format.
  4. store the data in a public repository for others to use.

This is exactly the framework introduced in the covid19northernireland repo hosted on my github. Let me now talk you through the main workings of the software and how the individual codes work.

First let’s load the required packages in R.

library(tidyverse) # for data wrangling
library(janitor) # for cleaning dataframes
library(padr) # dealing with missing values
library(zoo) # time series handling
library(readxl) # reading xlsx workbooks
library(httr) # for storing downloaded files temporarily

First of all, the spreadsheets are stored at a URL with the following form:

https://www.health-ni.gov.uk/sites/default/files/publications/health/doh-dd-19-november-2021.xlsx

so we will need to provide the current date to build the URL, however the files are also only updated on weekdays so we can use a little function to correct for that and then save the file lcoally for analysis as follows

most_recent_weekday <- function(date){
  # function to take a date input and return the most recent weekday
  if(weekdays(date) == "Saturday"){
    return(date - 1)
  }else if(weekdays(date) == "Sunday"){
    return(date - 2)
  }else{
    return(date)
  }
}

data_date = format(most_recent_weekday(Sys.Date()), "%d%m%y")

fname = paste0('https://www.health-ni.gov.uk/sites/default/files/publications/health/doh-dd-', data_date,'.xlsx')

# download the file and store a connection to it in xls_file
httr::GET(fname, write_disk(xls_file <- tempfile(fileext = ".xlsx")))
## Response [https://www.health-ni.gov.uk/sites/default/files/publications/health/doh-dd-280122.xlsx]
##   Date: 2022-01-30 21:33
##   Status: 200
##   Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
##   Size: 3.14 MB
## <ON DISK>  C:\Users\JOEYOB~1\AppData\Local\Temp\Rtmp2V42oQ\file4a6871c01034.xlsx

Now we have a connection to the latest version of the spreadsheet stored in the variable xls_file (thanks to the httr package). Generally we use .csv files when interacting with smaller tabular files, however in this case it is actually an xlsx file and such it has sheets. Fortunately the readxl packages allows us to read directly from a given sheet by providing its name (we’ll look at the Summary Tests sheet) as follows

test_df <- read_excel(xls_file, sheet = "Summary Tests")
test_df
## # A tibble: 730 x 7
##    Sample_Date         `TOTAL TESTS` `INDIVIDUALS TESTED PO~ `ALL INDIVIDUALS T~
##    <dttm>                      <dbl>                   <dbl>               <dbl>
##  1 2020-01-05 00:00:00             3                       0                   0
##  2 2020-01-07 00:00:00             1                       0                   0
##  3 2020-01-11 00:00:00             1                       0                   0
##  4 2020-01-16 00:00:00             2                       0                   0
##  5 2020-01-18 00:00:00             2                       0                   0
##  6 2020-01-21 00:00:00             1                       0                   0
##  7 2020-01-22 00:00:00             1                       0                   0
##  8 2020-01-25 00:00:00             1                       0                   0
##  9 2020-01-27 00:00:00             1                       0                   0
## 10 2020-01-31 00:00:00             1                       0                   0
## # ... with 720 more rows, and 3 more variables:
## #   ROLLING 7 DAY POSITIVE TESTS <dbl>, ROLLING 7 DAY INDIVIDUALS TESTED <dbl>,
## #   POSITIVITY RATE PER 100K POP <dbl>

Now if we take a look through this data we will note that its hygiene (no offence…) isn’t the best. First of all the column names have spaces, and there is nothing a data scientist dislikes more than spaces in variable names (perhaps a slight exaggeration but then again maybe not). As such the first thing we will do is use the janitor::clean_names() function which is fantastic at repairing these troublesome naming conventions. Second we don’t want all the columns in this instance so we can use dplyr::select() to take what we need. Furthermore, if we check the data we note that the date column Sample_Date isn’t consistent in that some days aren’t present this is quickly fixed with the padr::pad() function that, as the name suggests, pads the missing days. Lastly, there are some columns where those days with no values (cases, deaths,…) are represented by NA which we don’t really want to be the case as there is information in these values so that can be resolved using tools from dplyr also. All together we have went through a whirlwind step of data cleaning to have a much tidier and user-friendly data frame as seen below

test_df %>%
  clean_names() %>%
    select(date = sample_date,
           cases = individuals_tested_positive,
           tests = all_individuals_tested,
           cases_per_100k = positivity_rate_per_100k_pop
    ) %>%
    pad('day') %>%
    mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .))) %>%
    filter(date >= '2020-03-05')
## # A tibble: 694 x 4
##    date                cases tests cases_per_100k
##    <dttm>              <dbl> <dbl>          <dbl>
##  1 2020-03-05 00:00:00     2    18          0.106
##  2 2020-03-06 00:00:00     2    14          0.106
##  3 2020-03-07 00:00:00     5    14          0.264
##  4 2020-03-08 00:00:00     3    18          0.158
##  5 2020-03-09 00:00:00     3    35          0.158
##  6 2020-03-10 00:00:00     5    92          0.264
##  7 2020-03-11 00:00:00    11    91          0.581
##  8 2020-03-12 00:00:00     4    72          0.211
##  9 2020-03-13 00:00:00    11    78          0.581
## 10 2020-03-14 00:00:00    10    62          0.528
## # ... with 684 more rows

Now there is so much more data in the file that we can pull but this would turn into a rather long post if I continued describing every one of the steps but please do check out the entire code here if interested.

The main point however is that from this collection + cleaning exercise we create four separate tidy .csv files (found here) which are much easier to analyse in a data science setting.

Of course we want these files to continually update each time the corresponding spreadsheet is updated. And no, I do not want to remember to run the script each day. Therefore the best option is to set up some pipeline that will run the script automatically on a virtual machine each day. This is done using a .yaml script that runs my script from the cloud each night at 10pm before committing the files in the aforementioned repository for use by others. The format of this script looks like this:

name: covid_northernireland_data_update
# Controls when the action will run.
on:
  schedule:
    - cron:  '00 22 * * *'
jobs:
  CovidDataScrape:
    # The type of runner that the job will run on
    runs-on: windows-latest
    # Load repo and install R
    steps:
    - uses: actions/checkout@master
    - uses: r-lib/actions/setup-r@master
    # Set-up R
    - name: Install Packages
      run: |
        install.packages('tidyverse')
        install.packages('janitor')
        install.packages('padr')
        install.packages('zoo')
        install.packages('readxl')
        install.packages('httr')
      shell: Rscript {0}
    # Run R script
    - name: Get Data
      run: Rscript code/01_data_update.R
    # Add new files in data folder, commit along with other modified files, push
    - name: Commit Files
      run: |
        git config --local user.name 'obrienjoey'
        git config --local user.email "mr.joeyob@gmail.com"
        git add data/*
        git commit -am "update NI Covid19 data $(date)"
        git push origin main
      env:
        REPO_KEY: ${{secrets.GITHUB_TOKEN}}
        username: obrienjoey

So there we have it, a fully fledged workflow to collect, clean, and store the required data in a systematic and tidy manner. I hope this discussion can be useful for you in creating your own data pipelines (and make us all appreciate the godly work of data engineers more!) allowing for clean and automatic data collection for whatever project comes your way!

Avatar
Joey O'Brien
Senior Quantitative Analyst

Senior Consultant in quantitative finance and data science for Acadia, working in R and Python.