First up, an admission: I'm not much of a statistician.

Second up, a suspicion: I suspect that statisticians have lots of clever trick for using maths and stats to help them tell stories to themselves about the distributions of various populations.

Third up, a vision: that it might be possible to use the tools the statisticians use to get a gut feel about how things are dostributed without having to know too much about the stats, but just doing a bit of common-sense reasoning...

You might say this is an anti-academic view. But then again, my car is blue...* [UPDATE: It's also late, and I've I've come to realise there are a couple of obvious errors in this post that I've tried to highlight...]

** I find this a hilarious response, in much the same way I always laugh to myself about the "What's the differnce between a duck?" joke. The backstory to it is, if anyone asks me what sort of car I have, (let alone how it works) my response is always: "my car's blue"...*

So... here's a few ways of looking at Webber's and Button's laptimes. These may or may not be sensible ways - what I'm looking for is ways that might tell us something interesting about how these drivers' racers compared in a purely visual intuitive way (that is, are there visual tools we can throw at the data, and if anything looks vaguely interesting use that as the basis for a more considered look. I'm not suggesting the rash application of tools and measures to find things we might claim. I'm interested in the rash application of tools and measures to see if there's something worth exploring in a little more detail...)

If you'd like to enagage in debate, principled or otherwise, about what if anything, any of the following charts hint at, please feel free to chip in in the comments (I'm making this all up as I go along, and would value corrections as well as other observations/contributions. We're all learning together, right?;-)

First up, a straightforward comparison of laptimes lap-by-lap:

Hmm.. nothing too obvious there, except that they're pitting at different times... How about if we look at fuel corrected laptimes (this shouldn't show much different? Just a compaction of the times perhaps to a smaller range?)

Okay, so the distribution looks sort of flat...so maybe Button has more consistent laptimes than Webber... [UPDATE: arrghhhh

**GOTCHA**moment: the axes are different scales, so the 85s distance on the y-axis is different to the 5s axis on the x axis.

*Note to self*- force them to be the same (err, how do I do this in R? Fixing the range of each axis and the viewing portal to be the same would seem easiest?] that's something we can check - here's the distribution of fuel corrected laptimes plotted explicitly:

And here's the boxplot:

Button is more consistent...

Okay, how about a slightly different sort of scatterplot, specifically, a QQ plot. This works by ranking each driver's laptimes, then plotting the first ranked against the first ranked, the second ranked against the secind ranked, and so on. Two lines are plotted on the chart. One corresponds to x=y. The other line goes through the frst and third quartlie points. First up, by ranked laptime:

Err, erm.. so, err, the lower is x=y, so, err... based on the x=y line: the fastest laps are above the line, so Webber does the fastest fastest laps, in absolute terms. More generally, most of the marks appear below the x=y line, so Button is more consistently on the faster lap? The qqline has a gradient greater than one, and it crosses the x=y line between the first and third quantiles, so, err, erm...? It's late, my intuition is failing me... I'll have to come back to that....

How about the qqplot over the rank ordered fuel corrected laptimes?

(I'm thinking I maybe need to jiggle the axes so the x and y ranges are the same in the above two plots to make for a fairer visual comparison... how'd I set the axis ranges in R then...?)

This time, the qqline gradient is way less than 1, but again the cross seems to appear between first and third quartiles, so, err... erm... it's late, I'm slow tonight...?:-(

Should I maybe try a best fit line as well? How do I do that in R?

As far as the pitstop times go, because the drivers did different startegies (3 stop for BUT vs 4 for WEB) I'm not sure we can do much of a comparison (unless we compare slowest 3 stops... For a more literal comparison, see the F1 2011 Spain Race - Pit Stop Analysis.)

Assuming that when I come to the qqline/quartile line with a clear head I can make some sense of it, here's a quick look at the ranked fuel corrected laptimes for Hamilton and Vettel:

Recalling that in F1 2011 Spain Race - Driver Comparisons I couldn't really identify who was "faster", the closeness of the two lines in the above plot suggests the drivers were pretty evenly matched?

Note: I couldn't get the

*qqline()*R function to work, (it wouldn't plot anything?) but instead I found this function:

`qqline2 <- function (x, y, ...)`

{

y <- quantile(y[!is.na(y)], c(0.25, 0.75))

x <- quantile(x[!is.na(x)], c(0.25, 0.75))

slope <- diff(y)/diff(x)

int <- y[1L] - slope * x[1L]

abline(int, slope, ...)

{

y <- quantile(y[!is.na(y)], c(0.25, 0.75))

x <- quantile(x[!is.na(x)], c(0.25, 0.75))

slope <- diff(y)/diff(x)

int <- y[1L] - slope * x[1L]

abline(int, slope, ...)

The lines were then plotted using the commands:

`qqline2(WEB$fuelCorrectedLaptime,BUT$fuelCorrectedLaptime)`

abline(0,1)

abline(0,1)

Please could you change the white-on-black, it's very hard on the eyes. Thanks.

ReplyDeleteTwo points.

ReplyDelete1. Statisticians generally prefer graphical to numerical tools as these better reflect the subtleties of datasets and are less subject to catastrophic failure [e.g. vs shapiro.test() ]. QQplots fit a normal distribution to a straight line - obviously it is easy to see deviation here, c.f. a histogram with a normal probability overlay (all nasty curves). This is good: http://www.scribd.com/doc/51255252/Reif-Regression-Diagnostics-II

2. I don't believe the differences in lap times are significant, so I'd be surprised if you see much. Wobbles in the data are caused by non-driver factors such as traffic. In race terms, pit stops are much more significant, especially considering the aero difficulties cause by a track like Spain.

Personally, I'd ban pit stops and do more to clean up the airflow.

@AJ Cann

ReplyDelete>Personally, I'd ban pit stops and do more to clean up the airflow

Is this the scientist in you coming out? Remove as many variables as you can to improve the accuracy of the result. Why not just give everyone the same car as well?

;)

Martin

Actually I've been recommending random car shuffling with old-style Le Mans starts for years. Teams/drivers find out which driver/car they've got 30 minutes before the race. Two competitions over the season, best driver and best team.

ReplyDelete