Saturday, February 28, 2015

Laps Led Counts

One of the summary statistics that often gets reported is the "laps led" count, that describes either the total number of laps each driver led for in a particular race, over a season, or over their career, or as a percentage of the number of laps they completed, or of the total number of race distance laps for the races they competed in.

The latest chapter in the Wrangling F1 Data With R book starts to look at various lead lap calculations along with several ways of visualising the results.

The following crude sketch shows lead lap counts as a percentage of race laps for various grid positions by circuit from the years 2011-2013:


There are obviously a few issues with this chart that make it difficult to read, but even so we get the impression that a circuit such as Yeongam (Korean Grand Prix) seems to result in grid-to-finish completion of the race by a single driver, whereas Silverstone, Suzuka and Catalunya perhaps make for a more uncertain spectacle. From the ordering of the circuits on y-axis, it is not clear whether races come to be more or less uncertain over the course of each season.

As well as circuit based comparisons, we can look to see how the drivers compare, this time by reporting on the counts of the actual number of laps led per season from a particular grid position by year, rather than reporting this figure as a percentage of laps completed, for example.


Although not clearly labelled, the x-axis is actually presented as a log scale to help improve readability - the highest laps led counts tend to come from starting positions at the front of the grid (the chart could be further clarified in this respect by adding a dashed line to identify the front row of the grid). The angular rotation of the text labels also helps to reduce label overlap.

This chart shows how well Alonso in particular performed in 2011 and 2012 from starts behind the front row of the grid.

See a draft preview of the chapter, or find out more about the Wrangling F1 Data With R book. 

Tuesday, February 3, 2015

More Recommended Reading - Making Sense of Squiggly Lines

If you like trying to read line charts as stories, you'll love this book: Making Sense of Squiggly Lines [about].

Intended to support engineers or drivers trying to make sense of their own car data, this book guides you through how to read a variety of charts generated from sensor data traces.


Full review to follow...

Go on, just buy it...you know you want to, even if just to have a copy of a book with such a great title...:-)

...from the author (US/Canada, Australia)...
...from RaceTechMag (Europe/UK)...
... or if you must, on AmazonUK...

Saturday, January 31, 2015

Recommended Reading: F1metrics

I stumbled across a new-to-me blog yesterday that readers of this blog might find interesting: f1metrics - Mathematical and statistical insights into Formula 1, by Andrew Phillips. The site has several long form articles describing models for classifying rankings such as the best ever F1 driver, the best wet weather driver, as well as a novel ranking for drivers in the 2014 season.

The article on building a race simulator describes a process for constructing a simple race simulator that in turn looks like a fun project to try to replicate in R or Python (the blog doesn't currently seem to release any code alongside the posts).

Andrew has also published F1 relevant articles into the academic literature: Uncovering Formula One driver performances from 1950 to 2013 by adjusting for team and competition effects looks interesting, and something I'll have a go at replicating if I can get hold of a copy of the paper.

Thursday, January 29, 2015

Calculating Track Position from Laptime Data

When all the cars in a race are on the same lap, the track position of each car and their race positions are all in sequence. However, as cars start to get lapped, the order in which the cars cross the start/finish line (the track position) may bear little, if any, resemblance to their race positions.

So how can we capture the track position of each car - that is, the order in which they cross the start/finish?

The timing sheets published via the FIA website include a Race History Chart that tabulates the order in which cars pass the start/finish line relative to the laps completed by the current leader of the race. As the example below shows, if the leader laps a car on any given lead lap, the passed car does not have a time recorded for the previous leader lap because it did not complete that lap.


Unfortunately, the FIA don't release the timing sheets as data, preferring instead to use immutable PDF documents. (That doesn't mean we can't scrape the data of course...)

So how might we generate the track position given data we do have ready access to? The ergast database, for example, published lap time information - so can we use that to recreate track positions? Indeed we can...

One observation we might make is that a race track is a closed circuit; the second that the accumulated race time to date is the same for each driver, given that they all start the race at the same time. (The race clock is not started as each driver passes the start finish line - the race clock starts when the lights go green. To this extent, drivers lower placed on the grid server a positional time penalty compared to cars further up grid. This effective time penalty corresponds to the time it takes a lower placed car to physically get as far up the track as the cars in the higher placed grid positions.)


If we get hold of all of the lap time data for a particular race, with laptimes described in a milliseconds column,  we can find the track position of a car in the following way.

First, identify which leader’s lap each driver is on and then use this as the basis for deciding whether a car is on the same lap - or a different one - compared with any car immediately ahead or behind on track. One way of doing this is on the basis of accumulated race time. If we order the drivers by the accumulated race time, and flag whether or not a particular driver is the leader on particular lap, we can count the accumulated number of “lap leader” flags to give us the current lead lap count irrespective of how many laps a given driver has completed.

library(plyr)

#For each driver, calculate their accumulated race time at the end of each lap
lapTimes=ddply(lapTimes, .(driverId), transform,
               acctime=cumsum(milliseconds))

#Order the rows by accumulated lap time
lapTimes=arrange(lapTimes,acctime)
#This ordering need not necessarily respect the ordering by lap.

#Flag the leader of a given lap - this will be the first row in new leader lap block
lapTimes$leadlap= (lapTimes$position==1)
head(lapTimes[lapTimes$position<=3,c('driverRef','leadlap')],n=5)

This gives a result of the form:

##             driverRef leadlap
## 1              button    TRUE
## 2            hamilton   FALSE
## 3  michael_schumacher   FALSE
## 22             button    TRUE
## 23           hamilton   FALSE


A Boolean TRUE value has numeric value 1, a Boolean FALSE numeric value 0.

#Calculate a rolling count of leader lap flags.
#Recall that the cars are ordered by accumulated race time.
#The accumulated count of leader flags is the lead lap number each driver is on.
lapTimes$leadlap=cumsum(lapTimes$leadlap)
head(lapTimes[lapTimes$position<=3,c('driverRef','leadlap')],n=6)

So when we count the flags, we get something like this:

##             driverRef leadlap
## 1              button       1
## 2            hamilton       1
## 3  michael_schumacher       1
## 22             button       2
## 23           hamilton       2
## 24 michael_schumacher       2

Let’s now calculate the track position for a given lead lap, where the leader in a given lap is in both race position and track position 1, the second car through the start/finish line is in track position 2 (irrespective of their race position), and so on. (In your mind’s eye, you might imagine the cars passing the finish line to complete each lap, first the race leader, then either car in second, or a lapped back marker, and so on.) Specifically, we group by leadlap and then accumulated race time within that lap, and assign track positions in incremental order.

lapTimes=arrange(lapTimes,leadlap,acctime)
lapTimes=ddply(lapTimes,.(leadlap),transform,
               trackpos=1:length(position))
lapTimes[lapTimes$leadlap==33,c('code','lap','position','acctime','leadlap','trackpos')]

We now have track - as well as race - positions:

##     code lap position acctime leadlap trackpos
## 616  BUT  33        1 3100735      33        1
## 617  HAM  33        2 3111538      33        2
## 618  VET  33        3 3113745      33        3
## 619  SEN  32       16 3115035      33        4
## 620  RIC  32       17 3115829      33        5
## 621  ALO  33        4 3125951      33        6
## 622  WEB  33        5 3131009      33        7
## 623  MAL  33        6 3133006      33        8
## 624  RAI  33        7 3141269      33        9
## 625  KOB  33        8 3147051      33       10
## 626  GLO  32       18 3150703      33       11
## 627  PER  33        9 3153818      33       12
## 628  ROS  33       10 3159053      33       13
## 629  VER  33       11 3162088      33       14
## 630  DIR  33       12 3172712      33       15
## 631  MAS  33       13 3177681      33       16
## 632  PET  33       14 3184974      33       17
## 633  PIC  32       19 3186685      33       18
## 634  KOV  33       15 3188375      33       19

In this example, we see Timo Glock (GLO) has only completed 32 laps compared to 33 for the race leader and the majority of the field. On track, he is placed between Kobyashi (KOB) and Perez (PER).

This code will form part of forthcoming chapter in the Wrangling F1 Data With R book, initially in a chapter that revisits an old idea: battle charts.





Saturday, January 17, 2015

Seasonal Churn

With just a couple of weeks to go until testing begins for the 2015 F1 season, I thought I'd have a quick look at how the championship standings have churned over the years (eg Calculating Churn in Seasonal Leagues).



The churn value shows how different the league standings were to the previous year: a low churn (or adjusted churn - a volume normalised to the range 0..1) suggests little change that may be indicative of low levels of competitive change on a year on year basis.

For more, see the the Keeping an Eye on Competitiveness - Tracking Churn chapter [initial version here] of Wrangling F1 Data With R.

Friday, January 9, 2015

F1 Datajunkie Experimental Virtual Machine

To try to make playing along with the Wrangling F1 Data With R book a little easier, I've posted a simple Docker file for launching a containerisd virtual machine that runs RStudio and contains a local copy of a SQLite3 version of the ergast database (to 2013), a scrape of some of the results data from the F1 results website, and some sample R files: Wrangling F1 Data - docker.

The virtual machine allows you to work with the data in a self-contained virtual machine running on your own machine. (I will explore how to run it via a cloud service when I get a chance...) The RStudio application that runs in the container can be accessed via your browser.



I've also started exploring the idea of a code bundle alongside the book that will contain additional software files. The Github repository linked above will probably lag the book extras package, but will be added to over time.

Tuesday, December 16, 2014

Spotting Contested Positions in F1 Races Using Graph Theory

In the latest update to the Wrangling F1 Data With R book, I posted a recipe describing how to automatically identify the positions being contested in a race (which could equally be the championship race) by virtue of positions that had changed hands lap on lap.

The method comes via an answer to a question posted on Stack Overflow about how to spot disjoint sets of grouped items in a list. The trick is to construct a graph in which edges are placed between elements in each subset, and then clusters identified from the whole set of items.

So for example, in this fragment of a lap chart showing race positions going from one lap to another, we see several position changes:


Many of the drivers do not change position at all, but there are position changes between four distinct groups of drivers: those in 1st and 2nd; those in 4th, 5th and 6th; those in 9th and 10th; and those in 17th, 18th and 19th.

If we connect nodes in a graph for each driver going from the position they held in the previous lap to the position they hold in the current graph (and ignore drivers that didn't change position), we get the following groupings:


Notice how the nodes - representing positions - are connected to each other by arrows, showing how a car placed in one position moved to another position. So for example, we see that the cars in positions 9 and 10 changed place with each other, as did those in positions 1 and 2. The car in 19th went to 18th, the one in 18th to 17th, and the one in 17th fell back to 19th. And so so.

The chapter containing the code for constructing the graph and partitioning it into separate clusters can currently be found as part of the preview for the Wrangling F1 Data With R book... but I'm not sure how long it will remain so...

See also: OUseful.info - Identifying Position Change Groupings in Rank Ordered Lists