Thursday, April 6, 2017

A Homebrew ergast F1 Data Science Environment

As regular fans of F1 know, the data and computing requirements of Formula One now play a major part in team operations. I'm not sure what technology stacks the teams actually use, but as datajunkie fans, can we turn our own datawrangling activities into opportunities to help us keep up to speed with innovations in the computing world?

In this post, I'll describe how we can make use of one of the popular approaches to lightweight virtual computing - Docker containers - to run a couple of linked applications - a database, and a data anlaysis environment - in a platform independent way.

So let's get started...

One of most useful sources of openly licensed F1 results and timing data I know of is the ergast motor racing database. As well as making a JSON API available, the data is also regularly released as a MySQL 5.1 database dump. So what's an easy way of working with it?

One of my preferred ways is to set up a linked container environment using Docker containers. Docker containers are like a lightweight virtual machine that carries just enough of its own operating system to run the application inside it. (Typically, each Docker container runs just a single application.)

In Windows and Mac environments, Docker containers run inside a virtual machine on the host computer. Docker also runs on a wide variety of Linux operating systems and cloud based stacks such as Microsoft Azure and Amazon Web Services (AWS).

One of the easiest ways to get started with using Docker on your own computer is via the Kitematic graphical user interface to Docker, installed via the legacy Docker Toolbox, or via the Docker menu for more recent versions on Docker.


If you download and install Kitematic, you're presented with an interface that looks a bit like an app store:


Each item displayed is a Docker image that bundles a particular application. You can download an image and then launch an instance of that image as a container. The application - or service - will then run in a Docker virtual machine. When you're finished using it, you can then hibernate the container (along with any changed state), or destroy it and fire up a virgin instance of the application i a new container next time you want to run it. You can also mount a linked volume to persist state. This lets you run an application in one container, save the state in the linked volume, destroy the container, then fire up a new one and link it to the data contained in the linked volume.

As well as running independent applications, we can also link two or more applications running in separate containers together using Docker Compose. This is how I tend to run data analysis applications such as Jupyter notebooks or RStudio, in associate with a database that contains the data I'm working with.

So let's see how it works.

To begin with, let's create our own Docker image, building on top of one that already exists. In particular, I'm going to make use of the official MySQL image. I can create a so-called Dockerfile that specifies how to build a particular image (Dockerfile is the name of the file).


#Dockerfile

#Build this container based on the pre-existing offical MySQL Docker image
FROM mysql

#The offical image listens to requests on localost, IP address 127.0.0.1
#We're going to want it to be a bit more promiscuous
#and listen out for requests coming from *any* IP address (0.0.0.0)
#sed is a Linux commandline text editor - it updates the IP address in the MySQL config file
RUN sed -i -e"s/^bind-address\s*=\s*127.0.0.1/bind-address=0.0.0.0/" /etc/mysql/my.cnf

#Install a command line utility called wget
RUN apt-get update && apt-get install -y wget && apt-get clean

#Use wget to download a copy of the MySQL database dump from ergast.com
RUN wget http://ergast.com/downloads/f1db.sql.gz -P /docker-entrypoint-initdb.d

#Unzip the download
#The original MySQL container is configured to install items
#in the /docker-entrypoint-initdb.d/ directory when it starts up
RUN gunzip /docker-entrypoint-initdb.d/f1db.sql.gz


So that's a Dockerfile  - a recipe for building a Docker image, either from a base operating system layer (typically a Linux variant) or a pre-existing image. We can use this Dockerfile to build our own container - or we can orchestrate its build along with the build of other containers using Docker Compose.

So let's see how that works. If we create a file called docker-compose.yaml, we can use it do define a composition that connects together several different containers.

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"
    
f1dj_rstudio:
  image: rocker/tidyverse 
  ports:
    - "8787:8787"
  links:
    - ergastdb:ergastdb
  volumes:
    - ./rstudio:/home/rstudio


Here, I'm defining two containers - ergastdb and f1dj_rstudio. The ergastdb container will build an image based on the Dockerfile shown above, placed in the ergastdb/ directory relative to the directory containing the docker-compose.yaml file. The container will run with a couple of settings - the root password will be set to f1, and when the MySQL DBMS starts up it will create the ergastdb database and switch to it. It will also expose port 3306  - the default MySQL port - so it can be accessed from outside the container. (Recall also that in the Dockerfile we put the ergast database export in the startup directory; when the container starts up, it will be seeded with the ergast data in the ergastdb database.)

The other container, f1dj_rstudio,  is based on the rocker/tidyverse image, which contains RStudio and a host of Hadleyverse/tidyverse R libraries preinstalled. This container exposes port 8787via public port 8787, and also uses the alias ergastdb to link to the ergastdb container. The volumes setting mounts a directory on root (the rstudio subdirectory relative to the directory we run docker compose from, onto the /home/rstudio directory inside the container. This lets us both persist files on the host (even if the container is destroyed), and access them from inside the container.

We can set up both images at the same time from the Docker command line - reached via Kitematic. Change directory to the directory containing the docker-compose.yaml file (which should also contain the ergastdb folder, which itself contains the Dockerfile) and run the command:

docker-compose build --force-rm

This will build the images and throw away (remove, rm) any intermediate containers.

Now launch the containers running in detached (-d) mode - that is, in the background:

docker-compose up -d

The running containers should now appear in Kitematic.



If you select the RStudio container, you should see on the right hand side some administrative information for it, such as the local IP address on which we can access the RStudio application via a browser.



library(RMySQL)

con=dbConnect(MySQL(), user='root', password='f1', host='ergastdb', port=3306, dbname='ergastdb')

dbListTables(con)

If we now load the RMySQL library (installed as part of the tidyverse image), we can connect to the ergastdb database running inside the aliased ergastdb container, using the password we defined in the MySQL image Dockerfile. Any files we create will be saved into the mounted rstudio directory on host.

Once you've finished with the applications, you can hibernate them using the command:

docker-compose pause

Reawaken them with:

docker-compose pause

Similarly, we can shutdown and restart the containers using:

docker-compose stop
docker-compose start

To shut the containers down and remove them, use:

docker-compose down

One of the great advantages of Docker is that a wide variety of pre-built containers are already available - which can be a boon when figuring out an installation script is otherwise problematic. A second advantage is that you can also build on top of those previous images. A third: Docker Compose makes it easy to link multiple applications together, all in the privacy of a virtual machine (so you shouldn't interfere with, or be troubled by, any other applications running on host. And a fourth is the ability to mount host directories onto directories inside a container also means transferring content into a container, and saving it out to persist it is a doddle.

So if you fancy getting started with some F1 data wrangling, but haven't known where to start when it comes to setting up a data-analysis environment, why not start with Docker? And if you want some ideas as to what to do next, why not check out the Wrangling F1 Data With R book (also available in print, on demand, from Lulu)?

No comments:

Post a Comment

There seem to be a few issues with posting comments. I think you need to preview your comment before you can submit it... Any problems, send me a message on twitter: @psychemedia