Thursday, January 29, 2015

Calculating Track Position from Laptime Data

When all the cars in a race are on the same lap, the track position of each car and their race positions are all in sequence. However, as cars start to get lapped, the order in which the cars cross the start/finish line (the track position) may bear little, if any, resemblance to their race positions.

So how can we capture the track position of each car - that is, the order in which they cross the start/finish?

The timing sheets published via the FIA website include a Race History Chart that tabulates the order in which cars pass the start/finish line relative to the laps completed by the current leader of the race. As the example below shows, if the leader laps a car on any given lead lap, the passed car does not have a time recorded for the previous leader lap because it did not complete that lap.


Unfortunately, the FIA don't release the timing sheets as data, preferring instead to use immutable PDF documents. (That doesn't mean we can't scrape the data of course...)

So how might we generate the track position given data we do have ready access to? The ergast database, for example, published lap time information - so can we use that to recreate track positions? Indeed we can...

One observation we might make is that a race track is a closed circuit; the second that the accumulated race time to date is the same for each driver, given that they all start the race at the same time. (The race clock is not started as each driver passes the start finish line - the race clock starts when the lights go green. To this extent, drivers lower placed on the grid server a positional time penalty compared to cars further up grid. This effective time penalty corresponds to the time it takes a lower placed car to physically get as far up the track as the cars in the higher placed grid positions.)


If we get hold of all of the lap time data for a particular race, with laptimes described in a milliseconds column,  we can find the track position of a car in the following way.

First, identify which leader’s lap each driver is on and then use this as the basis for deciding whether a car is on the same lap - or a different one - compared with any car immediately ahead or behind on track. One way of doing this is on the basis of accumulated race time. If we order the drivers by the accumulated race time, and flag whether or not a particular driver is the leader on particular lap, we can count the accumulated number of “lap leader” flags to give us the current lead lap count irrespective of how many laps a given driver has completed.

library(plyr)

#For each driver, calculate their accumulated race time at the end of each lap
lapTimes=ddply(lapTimes, .(driverId), transform,
               acctime=cumsum(milliseconds))

#Order the rows by accumulated lap time
lapTimes=arrange(lapTimes,acctime)
#This ordering need not necessarily respect the ordering by lap.

#Flag the leader of a given lap - this will be the first row in new leader lap block
lapTimes$leadlap= (lapTimes$position==1)
head(lapTimes[lapTimes$position<=3,c('driverRef','leadlap')],n=5)

This gives a result of the form:

##             driverRef leadlap
## 1              button    TRUE
## 2            hamilton   FALSE
## 3  michael_schumacher   FALSE
## 22             button    TRUE
## 23           hamilton   FALSE


A Boolean TRUE value has numeric value 1, a Boolean FALSE numeric value 0.

#Calculate a rolling count of leader lap flags.
#Recall that the cars are ordered by accumulated race time.
#The accumulated count of leader flags is the lead lap number each driver is on.
lapTimes$leadlap=cumsum(lapTimes$leadlap)
head(lapTimes[lapTimes$position<=3,c('driverRef','leadlap')],n=6)

So when we count the flags, we get something like this:

##             driverRef leadlap
## 1              button       1
## 2            hamilton       1
## 3  michael_schumacher       1
## 22             button       2
## 23           hamilton       2
## 24 michael_schumacher       2

Let’s now calculate the track position for a given lead lap, where the leader in a given lap is in both race position and track position 1, the second car through the start/finish line is in track position 2 (irrespective of their race position), and so on. (In your mind’s eye, you might imagine the cars passing the finish line to complete each lap, first the race leader, then either car in second, or a lapped back marker, and so on.) Specifically, we group by leadlap and then accumulated race time within that lap, and assign track positions in incremental order.

lapTimes=arrange(lapTimes,leadlap,acctime)
lapTimes=ddply(lapTimes,.(leadlap),transform,
               trackpos=1:length(position))
lapTimes[lapTimes$leadlap==33,c('code','lap','position','acctime','leadlap','trackpos')]

We now have track - as well as race - positions:

##     code lap position acctime leadlap trackpos
## 616  BUT  33        1 3100735      33        1
## 617  HAM  33        2 3111538      33        2
## 618  VET  33        3 3113745      33        3
## 619  SEN  32       16 3115035      33        4
## 620  RIC  32       17 3115829      33        5
## 621  ALO  33        4 3125951      33        6
## 622  WEB  33        5 3131009      33        7
## 623  MAL  33        6 3133006      33        8
## 624  RAI  33        7 3141269      33        9
## 625  KOB  33        8 3147051      33       10
## 626  GLO  32       18 3150703      33       11
## 627  PER  33        9 3153818      33       12
## 628  ROS  33       10 3159053      33       13
## 629  VER  33       11 3162088      33       14
## 630  DIR  33       12 3172712      33       15
## 631  MAS  33       13 3177681      33       16
## 632  PET  33       14 3184974      33       17
## 633  PIC  32       19 3186685      33       18
## 634  KOV  33       15 3188375      33       19

In this example, we see Timo Glock (GLO) has only completed 32 laps compared to 33 for the race leader and the majority of the field. On track, he is placed between Kobyashi (KOB) and Perez (PER).

This code will form part of forthcoming chapter in the Wrangling F1 Data With R book, initially in a chapter that revisits an old idea: battle charts.





Saturday, January 17, 2015

Seasonal Churn

With just a couple of weeks to go until testing begins for the 2015 F1 season, I thought I'd have a quick look at how the championship standings have churned over the years (eg Calculating Churn in Seasonal Leagues).



The churn value shows how different the league standings were to the previous year: a low churn (or adjusted churn - a volume normalised to the range 0..1) suggests little change that may be indicative of low levels of competitive change on a year on year basis.

For more, see the the Keeping an Eye on Competitiveness - Tracking Churn chapter [initial version here] of Wrangling F1 Data With R.

Friday, January 9, 2015

F1 Datajunkie Experimental Virtual Machine

To try to make playing along with the Wrangling F1 Data With R book a little easier, I've posted a simple Docker file for launching a containerisd virtual machine that runs RStudio and contains a local copy of a SQLite3 version of the ergast database (to 2013), a scrape of some of the results data from the F1 results website, and some sample R files: Wrangling F1 Data - docker.

The virtual machine allows you to work with the data in a self-contained virtual machine running on your own machine. (I will explore how to run it via a cloud service when I get a chance...) The RStudio application that runs in the container can be accessed via your browser.



I've also started exploring the idea of a code bundle alongside the book that will contain additional software files. The Github repository linked above will probably lag the book extras package, but will be added to over time.

Tuesday, December 16, 2014

Spotting Contested Positions in F1 Races Using Graph Theory

In the latest update to the Wrangling F1 Data With R book, I posted a recipe describing how to automatically identify the positions being contested in a race (which could equally be the championship race) by virtue of positions that had changed hands lap on lap.

The method comes via an answer to a question posted on Stack Overflow about how to spot disjoint sets of grouped items in a list. The trick is to construct a graph in which edges are placed between elements in each subset, and then clusters identified from the whole set of items.

So for example, in this fragment of a lap chart showing race positions going from one lap to another, we see several position changes:


Many of the drivers do not change position at all, but there are position changes between four distinct groups of drivers: those in 1st and 2nd; those in 4th, 5th and 6th; those in 9th and 10th; and those in 17th, 18th and 19th.

If we connect nodes in a graph for each driver going from the position they held in the previous lap to the position they hold in the current graph (and ignore drivers that didn't change position), we get the following groupings:


Notice how the nodes - representing positions - are connected to each other by arrows, showing how a car placed in one position moved to another position. So for example, we see that the cars in positions 9 and 10 changed place with each other, as did those in positions 1 and 2. The car in 19th went to 18th, the one in 18th to 17th, and the one in 17th fell back to 19th. And so so.

The chapter containing the code for constructing the graph and partitioning it into separate clusters can currently be found as part of the preview for the Wrangling F1 Data With R book... but I'm not sure how long it will remain so...

See also: OUseful.info - Identifying Position Change Groupings in Rank Ordered Lists

Saturday, December 13, 2014

Career Comparison - Championship Position vs Age - Jenson Button and Fernando Alonso

This week finally saw the announcement of Alonso's move to McLaren and the retention of Jenson Button, so with the driver line up sorted there, how do these drivers compare in terms of the their F1 careers?

The following diagrams plot each driver's season standings for each year they've spent in F1 up to the end of the 2013 season against age (Button in blue, Alonso in red), along with the team they were driving for at the time.

The best fit lines represent linear, quadratic and cubic performance models respectively, of the form pos ~ I(age-30) +I( (age-30)^2 ) + I( (age-30)^3 ).



The confidence limits around each line show how variable Button's career has been compared to Alonso's more consistent career trajectory.

These charts were generated using code described in the "Career Trajectory" chapter of Wrangling F1 Data With R.

Tuesday, December 2, 2014

Position Change Charts

Inspired by an old Joe Saward post on lap charts I had a quick doodle around the notion of position change charts that plot the names of drivers against laps on just the laps where their position at the end of the lap was different to the position on the previous lap.
This chart shows just the position changes for each driver over the course of the race; the leftmost labels correspond to grid positions. The trick to reading this chart is to look left from a driver label to the previous occurrence if the same label: this position gives the position from which the change too place. The intervening gap is the length of time that driver held the position to the left. Emphasising pit stop laps though the use of italics, for example, would add further richness to this chart.

For a complete description of how to generate this chart using data from the ergast API, see the Wrangling F1 Data With R book.

Sunday, November 23, 2014

Maximising Team Points Hauls

With the final race of the 2014 season run, the Mercedes drivers' battle for the Drivers' Championship over, and the future of McLaren's drivers still uncertain, now may be a good time to ask how well the drivers supported each other in terms of maximising team points haul.

Let's start with the Mercedes. The following charts shows how the drivers fared in terms of ranked position in each race, and points taken. The coloured drop line identifies which driver had the upper hand and also clearly indicates how far apart the drivers were.





In terms of points, the team's points haul across the rounds of the 2014 championship can be summarised using the following chart (final race points have been halved for the purposes of this chart):


The horizontal x-axis shows the number of points taken in a particular race by the highest placed driver in the team. The vertical y-axis is shows the number of points taken in a corresponding race by the lower placed team-mate. The red line is the points maximisation line - points on the line show that the team maximised points in a race given the position of the highest placed driver in the team.

The numbers represent a count of races where a particular points combination occurred. The circle is size proportionate to this value.

If we split the drivers out and generate co-ordinate points based on the points taken across the driver pairing for each race, we get the following style of chart.


This time, we have two guides representing the points support each driver offers the other. Marks away from the dotted line show how far away a driver was from maximising the team points haul based on the the points taken by the higher placed driver in the team. If there are lots of marks in the lower right half of the chart, the driver on the vertical y-axis is the underperformer. If the marks appear in the top left half, the driver  identified on the horizontal x-axis is the underperformer. Marks on the red dotted line show the x-axis driver was better placed, but team points were maximised. Conversely, marks on the blue dotted line show the y-axis driver was higher placed, but again, given that position, team points we maximised. If the team always maximised points, the magenta best fit line would be within the two dotted lines.

Here are the corresponding charts for McLaren.









These charts are working sketches and are likely to appear in some form in the Wrangling F1 Data With R book. Data used to generate the charts was obtained from the ergast API.