Moneyball and Formula 1: Qualification Results Prediction Model
I must say right away: I'm not an IT specialist, but an enthusiast in statistics. In addition, for many years I participated in various forecast competitions for Formula 1. This also implies the challenges facing my model: to give forecasts that would be no worse than those that are created “by eye”. And ideally, the model, of course, should beat human opponents.
This model focuses solely on predicting qualification results, as qualifications are more predictable than races and are easier to model. However, of course, in the future I plan to create a model that allows predicting the results of races with fairly good accuracy.
To create a model, I brought together all the results of practices and qualifications for the seasons 2018 and 2019 in one table. The 2018 year served as a training sample, and the 2019 year served as a test sample. Based on this data, we built % D1% 8F_% D1% 80% D0% B5% D0% B3% D1% 80% D0% B5% D1% 81% D1% 81% D0% B8% D1% 8F ">linear regression . If the regression is as simple as possible to explain, then our data is a collection of points on the coordinate plane. We drew a straight line that deviates least from the totality of these points. And the function the graph of which is this line is our linear regression.
From the formula $ inline $ y=kx + b $ inline $ known from the school curriculum, our function is distinguished only by the fact that we have two variables. The first variable (X1) is the lag in the third practice, and the second variable (X2) is the average lag in previous qualifications. These variables are not equivalent, and one of our goals is to determine the weight of each variable in the range from 0 to 1. The farther the variable is from zero, the more important it is in explaining the dependent variable. In our case, the lap time acts as a dependent variable, expressed as a lag behind the leader (or rather, from some “ideal circle”, since this value was positive for all pilots).
Fans of the Moneyball book (this point is not explained in the film) may recall that there, using linear regression, they determined that the base occupation percentage, aka OBP (on-base percentage), is more closely related to wounds earned than other statistical indicators. We have about the same goal: to understand which factors are most closely related to the results of qualifications. One of the great advantages of regression is that it does not require advanced knowledge of mathematics: we just set the data, and then Excel or another spreadsheet editor gives us ready-made coefficients.
In fact, with linear regression we want to learn two things. First, how much the independent variables we have chosen explain the change in function. And secondly, how great is the significance of each of these independent variables. In other words, what better explains the results of qualification: the results of races on previous tracks or the results of training on the same track.
An important point should be noted here. The final result consisted of two independent parameters, each of which resulted from two independent regressions. The first parameter is the strength of the team at this stage, more precisely, the lag of the best pilot of the team from the leader. The second parameter is the distribution of forces within the team.
What does this mean by example? Let's say we take the 2019 Hungarian Grand Prix. The model shows that the lag of the Ferrari from the leader will be 0.218 seconds. But this is the lag of the first pilot, and who they will be - Vettel or Leclair - and what the gap between them will be, is determined by another parameter. In this example, the model showed that Vettel will be ahead, and Leclair will lose him.966 seconds.
Why such difficulties? Isn't it easier to consider each pilot separately instead of this breakdown into the lag of the team and the lag of the first pilot from the second inside the team? Perhaps this is so, but my personal observations show that looking at the results of a team is much more reliable than looking at the results of each pilot.One pilot can make a mistake, or fly off the track, or he will have technical problems - all this will bring chaos to the model, unless you manually monitor each force majeure situation, which takes too much time. The influence of force majeure on the results of the team is much less.
But back to the point where we wanted to evaluate how well the independent variables we have chosen explain the changes in the function. This can be done using coefficient of determination . It will demonstrate the extent to which qualification results are explained by practices and previous qualifications.
Since we built two regressions, we also have two coefficients of determination. The first regression is responsible for the level of the team at the stage, the second - for the confrontation between the pilots of the same team. In the first case, the determination coefficient is 0.82, that is, 82% of the qualification results are explained by the factors we have chosen, and another 18% due to some other factors that we did not take into account. This is a pretty good result. In the second case, the coefficient of determination was 0.13.
These indicators, in essence, mean that the model predicts the level of the team quite well, but is experiencing problems with determining the gap between team partners. However, for the final goal, we do not need to know the gap, we just need to know which of the two pilots will be higher, and the model basically copes with this. In 62% of cases, the model was higher than the pilot who was really higher according to the qualification results.
At the same time, when assessing the strength of the team, the results of the last training session were one and a half times more important than the results of previous qualifications, but in the internal team duels everything was the other way around. The trend manifested itself both in the data of 2018 and 2019.
The final formula looks like this:
$$ display $$ Y1=(0.618 * X1 + 0.445 * X2) $$ display $$
$$ display $$ Y2=Y1 + (0.313 * X1 + 0.511 * X2) $$ display $$
I remind you that X1 is the lag in the third practice, and X2 is the average lag in previous qualifications.
What do these numbers mean. They mean that the team level in qualifications is determined by 60% by the results of the third practice and by 40% by the results of qualifications at the previous stages. Accordingly, the results of the third practice are one and a half times more significant factor than the results of previous qualifications.
Fans of Formula 1 probably know the answer to this question, but for the rest you should comment on why I took the results of the third practice. In Formula 1, there are three practices. However, it was in the last of them that teams traditionally train their qualifications. However, in those cases when the third practice breaks down due to rain or other force majeure, I took the results of the second practice. As far as I remember, in 2019 there was only one such case - at the Japanese Grand Prix, when the stage passed in a shortened format due to the typhoon.
Also, someone must have noticed that the model uses the average lag in previous qualifications. But what about the first stage of the season? I used the backlogs from the previous year, but did not leave them as they are, but manually corrected them based on common sense. For example, in 2019, Ferrari was on average faster than Red Bull by 0.3 seconds. However, it seems that the Italian team will not have such an advantage this year, or maybe they will be completely behind. Therefore, for the first stage of the 2020 season, the Austrian Grand Prix, I manually approximated the Red Bull to the Ferrari.
Thus, I got the backlog of each pilot, ranked the pilots by the backlog and got the final forecast for qualification. It is important to understand that the first and second pilots are pure conventions. Returning to the example of Vettel and Leclair, at the Hungarian Grand Prix, the model considered Sebastian to be the first pilot, but at many other stages she preferred Leclair.
As I said, the task was to create a model that would make it possible to predict no worse than people. As a basis, I took my forecasts and the forecasts of my teammates, which were created "by eye", but with a careful study of the results of the practices and joint discussion.
The rating system was as follows. Only the first ten pilots were taken into account. For an exact hit, the forecast received 9 points, for a slip in 1 position 6 points, for a slip in 2 positions 4 points, for a slip in 3 positions 2 points and for a slip in 4 positions - 1 point. That is, if the pilot is in 3rd place in the forecast, and as a result he took pole position, then the forecast received 4 points.
With such a system, the maximum number of points for 21 Grand Prix is 1890.
Human participants scored 1056, 1048 and 1034 points respectively.
The model scored 1031 points, although with easy manipulation of the coefficients, I also received 1045 and 1053 points.
Personally, I am pleased with the results, since this is my first experience in constructing regressions, and it led to fairly acceptable results. Of course, I would like to improve them, because I'm sure that by building models, even as simple as this one, you can achieve a better result than simply evaluating the data “by eye”. Within the framework of this model, one could, for example, take into account the factor that some teams are weak in practice, but “shoot” in qualifications. For example, there is an observation that Mercedes was often not the best team during training, but it performed much better in qualifications. However, these human observations were not reflected in the model. Therefore, in the 2020 season, which will begin in July (if nothing unexpected happens), I want to test this model in the competition against live forecasters, and also find how to make it better.
In addition, I hope to arouse a response in the community of Formula 1 fans and I believe that through the exchange of ideas we can better understand what the results of qualifications and races are, and this is ultimately the goal of any person who makes predictions.