F1 Race Wins by Starting Grid Position: Historical Analysis in Python
Written by Piotr Płoński · Published on April 17, 2026 · Notebook code
Analyze Formula 1 race results in Python to measure win rates by starting grid position, compare eras, and identify top drivers who won from grid position 4 or worse.
This notebook analyzes historical Formula 1 race-result tables from CSV files (including results and driver metadata). The core dataset is results.csv, which contains starting grid positions and final race outcomes across 26,759 race-result rows spanning multiple decades.
Using AI-assisted, plain-English prompts, the workflow computes win rates by grid slot, visualizes the front-row advantage, segments results into historical eras, and ranks drivers with the most wins from grid position 4 or worse. The analysis is executed in Python with pandas, seaborn, and matplotlib.
The results show a strong front-of-grid effect: pole position wins 42.34% of starts, with a steep drop after the first rows. Pole advantage increases in modern eras (about 35-37% before 2000 vs around 50-52% since 2000). The notebook also highlights notable comeback-style winners from deeper starts, led by Alain Prost (16 wins from grid 4+), followed by Nelson Piquet (15) and Kimi Räikkönen (13).
Thinking
Creating Python Code for File Listing
I need to produce runnable Python code that lists files and prints the first five lines. Using os.walk could work, but I might also consider pathlib.Path.rglob for recursive listing.
I shouldn’t assume outputs and should keep the explanation brief. I’ll avoid unnecessary print statements in a notebook context while ensuring that text lists are still printed. It’s important to handle binary files gracefully by skipping unreadable ones. I’ll make sure to remain concise!
Here’s a small notebook step that lists all files under the current working directory and shows the first 5 lines of each readable file.
Preview file contents
from pathlib import Path
files = sorted([p for p in Path(".").rglob("*") if p.is_file()])
for path in files:
print(f"\n=== {path} ===")
try:
with open(path, "r", encoding="utf-8", errors="replace") as f:
for i in range(5):
line = f.readline()
if not line:
break
print(line.rstrip("\n"))
except Exception as e:
print(f"[Could not read as text: {e}]")=== circuits.csv === circuitId,circuitRef,name,location,country,lat,lng,alt,url 1,"albert_park","Albert Park Grand Prix Circuit","Melbourne","Australia",-37.8497,144.968,10,"http://en.wikipedia.org/wiki/Melbourne_Grand_Prix_Circuit" 2,"sepang","Sepang International Circuit","Kuala Lumpur","Malaysia",2.76083,101.738,18,"http://en.wikipedia.org/wiki/Sepang_International_Circuit" 3,"bahrain","Bahrain International Circuit","Sakhir","Bahrain",26.0325,50.5106,7,"http://en.wikipedia.org/wiki/Bahrain_International_Circuit" 4,"catalunya","Circuit de Barcelona-Catalunya","Montmeló","Spain",41.57,2.26111,109,"http://en.wikipedia.org/wiki/Circuit_de_Barcelona-Catalunya" === constructor_results.csv === constructorResultsId,raceId,constructorId,points,status 1,18,1,14,\N 2,18,2,8,\N 3,18,3,9,\N 4,18,4,5,\N === constructor_standings.csv === constructorStandingsId,raceId,constructorId,points,position,positionText,wins 1,18,1,14,1,"1",1 2,18,2,8,3,"3",0 3,18,3,9,2,"2",0 4,18,4,5,4,"4",0 === constructors.csv === constructorId,constructorRef,name,nationality,url 1,"mclaren","McLaren","British","http://en.wikipedia.org/wiki/McLaren" 2,"bmw_sauber","BMW Sauber","German","http://en.wikipedia.org/wiki/BMW_Sauber" 3,"williams","Williams","British","http://en.wikipedia.org/wiki/Williams_Grand_Prix_Engineering" 4,"renault","Renault","French","http://en.wikipedia.org/wiki/Renault_in_Formula_One" === driver_standings.csv === driverStandingsId,raceId,driverId,points,position,positionText,wins 1,18,1,10,1,"1",1 2,18,2,8,2,"2",0 3,18,3,6,3,"3",0 4,18,4,5,4,"4",0 === drivers.csv === driverId,driverRef,number,code,forename,surname,dob,nationality,url 1,"hamilton",44,"HAM","Lewis","Hamilton","1985-01-07","British","http://en.wikipedia.org/wiki/Lewis_Hamilton" 2,"heidfeld",\N,"HEI","Nick","Heidfeld","1977-05-10","German","http://en.wikipedia.org/wiki/Nick_Heidfeld" 3,"rosberg",6,"ROS","Nico","Rosberg","1985-06-27","German","http://en.wikipedia.org/wiki/Nico_Rosberg" 4,"alonso",14,"ALO","Fernando","Alonso","1981-07-29","Spanish","http://en.wikipedia.org/wiki/Fernando_Alonso" === lap_times.csv === raceId,driverId,lap,position,time,milliseconds 841,20,1,1,"1:38.109",98109 841,20,2,1,"1:33.006",93006 841,20,3,1,"1:32.713",92713 841,20,4,1,"1:32.803",92803 === pit_stops.csv === raceId,driverId,stop,lap,time,duration,milliseconds 841,153,1,1,"17:05:23","26.898",26898 841,30,1,1,"17:05:52","25.021",25021 841,17,1,11,"17:20:48","23.426",23426 841,4,1,12,"17:22:34","23.251",23251 === qualifying.csv === qualifyId,raceId,driverId,constructorId,number,position,q1,q2,q3 1,18,1,1,22,1,"1:26.572","1:25.187","1:26.714" 2,18,9,2,4,2,"1:26.103","1:25.315","1:26.869" 3,18,5,1,23,3,"1:25.664","1:25.452","1:27.079" 4,18,13,6,2,4,"1:25.994","1:25.691","1:27.178" === races.csv === raceId,year,round,circuitId,name,date,time,url,fp1_date,fp1_time,fp2_date,fp2_time,fp3_date,fp3_time,quali_date,quali_time,sprint_date,sprint_time 1,2009,1,1,"Australian Grand Prix","2009-03-29","06:00:00","http://en.wikipedia.org/wiki/2009_Australian_Grand_Prix",\N,\N,\N,\N,\N,\N,\N,\N,\N,\N 2,2009,2,2,"Malaysian Grand Prix","2009-04-05","09:00:00","http://en.wikipedia.org/wiki/2009_Malaysian_Grand_Prix",\N,\N,\N,\N,\N,\N,\N,\N,\N,\N 3,2009,3,17,"Chinese Grand Prix","2009-04-19","07:00:00","http://en.wikipedia.org/wiki/2009_Chinese_Grand_Prix",\N,\N,\N,\N,\N,\N,\N,\N,\N,\N 4,2009,4,3,"Bahrain Grand Prix","2009-04-26","12:00:00","http://en.wikipedia.org/wiki/2009_Bahrain_Grand_Prix",\N,\N,\N,\N,\N,\N,\N,\N,\N,\N === results.csv === resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId 1,18,1,1,22,1,1,"1",1,10,58,"1:34:50.616",5690616,39,2,"1:27.452","218.300",1 2,18,2,2,3,5,2,"2",2,8,58,"+5.478",5696094,41,3,"1:27.739","217.586",1 3,18,3,3,7,7,3,"3",3,6,58,"+8.163",5698779,41,5,"1:28.090","216.719",1 4,18,4,4,5,11,4,"4",4,5,58,"+17.181",5707797,58,7,"1:28.603","215.464",1 === seasons.csv === year,url 2009,"http://en.wikipedia.org/wiki/2009_Formula_One_season" 2008,"http://en.wikipedia.org/wiki/2008_Formula_One_season" 2007,"http://en.wikipedia.org/wiki/2007_Formula_One_season" 2006,"http://en.wikipedia.org/wiki/2006_Formula_One_season" === sprint_results.csv === resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,fastestLapTime,statusId 1,1061,830,9,33,2,1,"1",1,3,17,"25:38.426",1538426,14,"1:30.013",1 2,1061,1,131,44,1,2,"2",2,2,17,"+1.430",1539856,17,"1:29.937",1 3,1061,822,131,77,3,3,"3",3,1,17,"+7.502",1545928,17,"1:29.958",1 4,1061,844,6,16,4,4,"4",4,0,17,"+11.278",1549704,16,"1:30.163",1 === status.csv === statusId,status 1,"Finished" 2,"Disqualified" 3,"Accident" 4,"Collision"
Thinking
Considering data visualization strategy
I'm thinking about the approach for displaying data. The guideline suggests using one chart or one table per step, but the initial load indicates showing shape, columns, and head. It's okay if it’s not perfect. I could consider storing summaries in a DataFrame, but perhaps I should keep the head separate. A dict display might work well too. I also want to note that the grid position and winner are likely in the results.csv. Let’s refine this!
Let’s start with the one file we need most: results.csv. It should contain both the starting grid (grid) and who won (positionOrder or position).
Load and preview results data
import pandas as pd
from IPython.display import display
results = pd.read_csv("results.csv", na_values=["\\N"])
display(pd.DataFrame({"column": results.columns}))
display(results.head())
results.shape| column | |
|---|---|
| 0 | resultId |
| 1 | raceId |
| 2 | driverId |
| 3 | constructorId |
| 4 | number |
| 5 | grid |
| 6 | position |
| 7 | positionText |
| 8 | positionOrder |
| 9 | points |
| 10 | laps |
| 11 | time |
| 12 | milliseconds |
| 13 | fastestLap |
| 14 | rank |
| 15 | fastestLapTime |
| 16 | fastestLapSpeed |
| 17 | statusId |
| resultId | raceId | driverId | constructorId | number | grid | position | positionText | positionOrder | points | laps | time | milliseconds | fastestLap | rank | fastestLapTime | fastestLapSpeed | statusId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 18 | 1 | 1 | 22.0 | 1 | 1.0 | 1 | 1 | 10.0 | 58 | 1:34:50.616 | 5690616.0 | 39.0 | 2.0 | 1:27.452 | 218.300 | 1 |
| 1 | 2 | 18 | 2 | 2 | 3.0 | 5 | 2.0 | 2 | 2 | 8.0 | 58 | +5.478 | 5696094.0 | 41.0 | 3.0 | 1:27.739 | 217.586 | 1 |
| 2 | 3 | 18 | 3 | 3 | 7.0 | 7 | 3.0 | 3 | 3 | 6.0 | 58 | +8.163 | 5698779.0 | 41.0 | 5.0 | 1:28.090 | 216.719 | 1 |
| 3 | 4 | 18 | 4 | 4 | 5.0 | 11 | 4.0 | 4 | 4 | 5.0 | 58 | +17.181 | 5707797.0 | 58.0 | 7.0 | 1:28.603 | 215.464 | 1 |
| 4 | 5 | 18 | 5 | 1 | 23.0 | 3 | 5.0 | 5 | 5 | 4.0 | 58 | +18.014 | 5708630.0 | 43.0 | 1.0 | 1:27.418 | 218.385 | 1 |
(26759, 18)
Thinking
Deciding on win rate calculation
I need to determine the best way to compute the win rate from the current data we have. We only have the columns and heads, without an existing win rate metric. It would be best to create minimal code that directly calculates win rates based on the starting grid position. The results.csv file contains the necessary columns, specifically grid and positionOrder, with a total of 26,759 rows. I'll focus on aggregating starters and wins by grid, ensuring to exclude grid positions that are less than or equal to zero since those might indicate a pitlane or unknown starts. I want the code to reflect this efficiently, ultimately aiming to display a simple table sorted by grid positions.
The current output already confirms results.csv has the two key fields we need: grid for starting position and positionOrder for finishing order, across 26,759 race-result rows. The missing piece is the actual aggregation of wins by grid slot.
Calculate grid win rates
win_rate = (
results.loc[results["grid"] > 0]
.assign(win=results.loc[results["grid"] > 0, "positionOrder"].eq(1))
.groupby("grid", as_index=False)
.agg(
starts=("grid", "size"),
wins=("win", "sum")
)
)
win_rate["win_rate_pct"] = (win_rate["wins"] / win_rate["starts"] * 100).round(2)
win_rate = win_rate.sort_values("grid").reset_index(drop=True)
display(win_rate.head(15))
win_rate| grid | starts | wins | win_rate_pct | |
|---|---|---|---|---|
| 0 | 1 | 1136 | 481 | 42.34 |
| 1 | 2 | 1125 | 268 | 23.82 |
| 2 | 3 | 1130 | 137 | 12.12 |
| 3 | 4 | 1132 | 68 | 6.01 |
| 4 | 5 | 1132 | 49 | 4.33 |
| 5 | 6 | 1125 | 40 | 3.56 |
| 6 | 7 | 1135 | 23 | 2.03 |
| 7 | 8 | 1129 | 17 | 1.51 |
| 8 | 9 | 1132 | 5 | 0.44 |
| 9 | 10 | 1130 | 12 | 1.06 |
| 10 | 11 | 1132 | 5 | 0.44 |
| 11 | 12 | 1127 | 3 | 0.27 |
| 12 | 13 | 1125 | 4 | 0.36 |
| 13 | 14 | 1120 | 7 | 0.62 |
| 14 | 15 | 1113 | 1 | 0.09 |
| grid | starts | wins | win_rate_pct | |
|---|---|---|---|---|
| 0 | 1 | 1136 | 481 | 42.34 |
| 1 | 2 | 1125 | 268 | 23.82 |
| 2 | 3 | 1130 | 137 | 12.12 |
| 3 | 4 | 1132 | 68 | 6.01 |
| 4 | 5 | 1132 | 49 | 4.33 |
| 5 | 6 | 1125 | 40 | 3.56 |
| 6 | 7 | 1135 | 23 | 2.03 |
| 7 | 8 | 1129 | 17 | 1.51 |
| 8 | 9 | 1132 | 5 | 0.44 |
| 9 | 10 | 1130 | 12 | 1.06 |
| 10 | 11 | 1132 | 5 | 0.44 |
| 11 | 12 | 1127 | 3 | 0.27 |
| 12 | 13 | 1125 | 4 | 0.36 |
| 13 | 14 | 1120 | 7 | 0.62 |
| 14 | 15 | 1113 | 1 | 0.09 |
| 15 | 16 | 1100 | 2 | 0.18 |
| 16 | 17 | 1089 | 3 | 0.28 |
| 17 | 18 | 1050 | 1 | 0.10 |
| 18 | 19 | 1031 | 1 | 0.10 |
| 19 | 20 | 975 | 0 | 0.00 |
| 20 | 21 | 697 | 0 | 0.00 |
| 21 | 22 | 656 | 1 | 0.15 |
| 22 | 23 | 453 | 0 | 0.00 |
| 23 | 24 | 429 | 0 | 0.00 |
| 24 | 25 | 301 | 0 | 0.00 |
| 25 | 26 | 248 | 0 | 0.00 |
| 26 | 27 | 46 | 0 | 0.00 |
| 27 | 28 | 30 | 0 | 0.00 |
| 28 | 29 | 25 | 0 | 0.00 |
| 29 | 30 | 19 | 0 | 0.00 |
| 30 | 31 | 18 | 0 | 0.00 |
| 31 | 32 | 17 | 0 | 0.00 |
| 32 | 33 | 13 | 0 | 0.00 |
| 33 | 34 | 1 | 0 | 0.00 |
Across all years in this dataset, winning is heavily concentrated at the very front of the grid.
- Pole (P1): 481 wins from 1,136 starts → 42.34% win rate
- P2: 268 wins from 1,125 starts → 23.82%
- P3: 137 wins from 1,130 starts → 12.12%
- P4: 68 wins from 1,132 starts → 6.01%
- P5: 49 wins from 1,132 starts → 4.33%
- P6: 40 wins from 1,125 starts → 3.56%
- P7: 23 wins from 1,135 starts → 2.03%
- P8: 17 wins from 1,129 starts → 1.51%
- P9: 5 wins from 1,132 starts → 0.44%
- P10: 12 wins from 1,130 starts → 1.06%
- P11: 5 wins from 1,132 starts → 0.44%
- P12: 3 wins from 1,127 starts → 0.27%
- P13: 4 wins from 1,125 starts → 0.36%
- P14: 7 wins from 1,120 starts → 0.62%
- P15: 1 win from 1,113 starts → 0.09%
- P16: 2 wins from 1,100 starts → 0.18%
- P17: 3 wins from 1,089 starts → 0.28%
- P18: 1 win from 1,050 starts → 0.10%
- P19: 1 win from 1,031 starts → 0.10%
- P20: 0.00%
- P21: 0.00%
- P22: 1 win from 656 starts → 0.15%
What this means:
- Pole is by far the strongest starting spot: about 4 in 10 pole sitters win.
- There is a sharp drop from P1 to P2, and another from P2 to P3.
- By P4 and lower, win rates become much smaller.
- Wins from the midfield or back are rare exceptions, though they do occur occasionally.
- There are a few small non-monotonic bumps later on the grid, like P10 slightly above P9 and P14 above nearby spots, which are most likely due to relatively small numbers of wins in those positions rather than a real advantage.
So the direct answer is: win rate declines steeply as starting position moves away from pole, with P1, P2, and P3 dominating race wins across all years.
Thinking
Visualizing win rates
I need to create a chart to visualize win rates based on the starting grid position. It looks like I should use libraries like seaborn or matplotlib to make a bar plot of win rates by grid position. I want to keep the code concise, using just one block, and I might include a title and labels to make it clearer. I’ll avoid unnecessary print statements, and it seems plotting all current win rates will be essential.Creating a concise response
I’ll craft a clear, natural response for the user. I think something like, “A simple bar chart makes the drop-off obvious — pole dominates, then win rate falls sharply after P3” would work well. For the code, I’ll use sns.barplot(data=win_rate, x="grid", y="win_rate_pct", color="steelblue"). I also need to add x-ticks for the integer positions. Once that's in place, I’ll consider it done. This should provide an effective visualization!
A bar chart will show the drop-off clearly: pole is dominant, and the win rate falls fast after the first few grid spots.
Plot win rate by grid
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(12, 5))
sns.barplot(data=win_rate, x="grid", y="win_rate_pct", color="steelblue")
plt.title("F1 win rate by starting grid position across all years")
plt.xlabel("Starting grid position")
plt.ylabel("Win rate (%)")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()The visualization is doing its job well: it shows a very steep drop-off in win rate as starting position moves back from the front row.
From the values behind the chart:
- Pole (grid 1) has the highest win rate by a wide margin: 42.34%
- Grid 2 is next at 23.82%
- Grid 3 drops again to 12.12%
- By grid 4, win rate is already down to 6.01%
- From grid 5 onward, wins become relatively rare, mostly in the low single digits or below 1%
What the chart means:
- Starting on pole is a huge advantage across all years.
- There is a clear front-loaded pattern: most wins come from the first few grid spots.
- The relationship is not perfectly smooth in the lower positions, but those small bumps are minor compared with the overall pattern.
- Positions near the back of the grid have near-zero win rates, confirming that comeback wins are unusual.
So the visual takeaway is simple: the farther back a driver starts, the less likely they are to win, with pole position standing out dramatically above every other starting spot.
Thinking
Planning data analysis steps
I need to approach the task step by step. My first idea is to load the races.csv file to check for years. Since I haven't loaded it yet, the next logical step would be to load the races and merge the year into the results or compute the pole win rate by era. Since the user is interested in how the pole win rate changes by era, it's crucial to inspect the races dataset first. I'll focus on loading the races and checking the relevant columns.Loading and inspecting dataset
I need to follow the rule of loading the dataset first for inspection since it hasn't been loaded in the current notebook. For the races data, my initial actions are to load it and then display its shape, columns, and head. However, I want to be mindful of displaying everything in one format, perhaps a concise summary table for the shape. Ultimately, I should keep it simple and clear. My next step is to load the races and show the relevant columns along with their shape.Determining data display approach
I realize that I shouldn't print unnecessary details; using pd.read_csv will help me load the dataset. I want to show the shape, columns, and head, but displaying both the columns and head feels like two tables. So, instead, I'll create a summary dictionary with shape and the count of columns, then display the head. I think keeping my output focused on one final expression makes sense—I'll display races.head() and then list the columns with races.columns.tolist(). This should keep things clear!
We need the race year to define those eras, so the next step is to load races.csv and check the key columns we’ll join on.
Load and preview races data
races = pd.read_csv("races.csv", na_values=["\\N"])
display(races[["raceId", "year", "name", "date"]].head())
{
"shape": races.shape,
"columns": races.columns.tolist()
}| raceId | year | name | date | |
|---|---|---|---|---|
| 0 | 1 | 2009 | Australian Grand Prix | 2009-03-29 |
| 1 | 2 | 2009 | Malaysian Grand Prix | 2009-04-05 |
| 2 | 3 | 2009 | Chinese Grand Prix | 2009-04-19 |
| 3 | 4 | 2009 | Bahrain Grand Prix | 2009-04-26 |
| 4 | 5 | 2009 | Spanish Grand Prix | 2009-05-10 |
{'shape': (1125, 18),
'columns': ['raceId',
'year',
'round',
'circuitId',
'name',
'date',
'time',
'url',
'fp1_date',
'fp1_time',
'fp2_date',
'fp2_time',
'fp3_date',
'fp3_time',
'quali_date',
'quali_time',
'sprint_date',
'sprint_time']}Thinking
Visualizing data choices
I'm considering how to visualize the data, and a table might suffice. We might need to gather additional evidence, maybe even use both a table and a chart. But with our primary objective in mind, a table seems adequate since the original goal focuses on how the pole win rate changes by era. I think we can compute and display a compact table instead. Also, I should remember to follow the notebook rules about expressions and code structure.
Right now we only have the overall pole result: grid 1 wins 42.34% of the time across all years. Since races includes year, the missing piece is the era split.
I’d interpret your eras as non-overlapping bins:
- pre-1980 = before 1980
- 1980-2000 = 1980–1999
- 2000-2010 = 2000–2009
- 2010-2020 = 2010–2019
- 2020-2024 = 2020–2024
Calculate pole win rates by era
import numpy as np
pole_era = (
results.loc[results["grid"].eq(1), ["raceId", "positionOrder"]]
.merge(races[["raceId", "year"]], on="raceId", how="left")
.assign(
era=lambda d: pd.cut(
d["year"],
bins=[-np.inf, 1979, 1999, 2009, 2019, 2024],
labels=["pre-1980", "1980-2000", "2000-2010", "2010-2020", "2020-2024"]
),
win=lambda d: d["positionOrder"].eq(1)
)
.groupby("era", observed=True, as_index=False)
.agg(
pole_starts=("raceId", "size"),
pole_wins=("win", "sum")
)
)
pole_era["pole_win_rate_pct"] = (pole_era["pole_wins"] / pole_era["pole_starts"] * 100).round(2)
display(pole_era)
pole_era| era | pole_starts | pole_wins | pole_win_rate_pct | |
|---|---|---|---|---|
| 0 | pre-1980 | 339 | 125 | 36.87 |
| 1 | 1980-2000 | 318 | 113 | 35.53 |
| 2 | 2000-2010 | 174 | 87 | 50.00 |
| 3 | 2010-2020 | 198 | 100 | 50.51 |
| 4 | 2020-2024 | 107 | 56 | 52.34 |
| era | pole_starts | pole_wins | pole_win_rate_pct | |
|---|---|---|---|---|
| 0 | pre-1980 | 339 | 125 | 36.87 |
| 1 | 1980-2000 | 318 | 113 | 35.53 |
| 2 | 2000-2010 | 174 | 87 | 50.00 |
| 3 | 2010-2020 | 198 | 100 | 50.51 |
| 4 | 2020-2024 | 107 | 56 | 52.34 |
Pole win rate increases noticeably in more recent eras.
- pre-1980: 36.87%
- 1980-2000: 35.53%
- 2000-2010: 50.00%
- 2010-2020: 50.51%
- 2020-2024: 52.34%
What this means:
- From pre-1980 to 1980-2000, pole winners converted at roughly the same rate, in the mid-30% range.
- Starting in 2000-2010, there’s a clear jump to about 50%.
- That higher conversion rate continues in 2010-2020 and edges up again in 2020-2024.
- Compared with the earlier eras, modern F1 shows a much stronger advantage to starting from pole.
A simple way to read it: pole used to produce a win about 1 in 3 times, while in the last ~25 years it has been closer to 1 in 2. The highest pole win rate in your split is 2020-2024 (52.34%).
Thinking
Planning chart visualization
I need to offer a concise next step using seaborn/matplotlib as default. My goal is to analyze the latest outputs and propose a chart showing pole win rates by era since the user wants visualization. I already have the data for pole_era, so I’ll probably create a bar chart. One chart per step is the rule, and I can mention the increasing trend. I’ll provide minimal focused code, ending with plt.show(). Let's craft this efficiently!
The era table shows a clear jump: pole converted at about 36% before 2000, then about 50%+ from 2000 onward. A simple bar chart will make that change obvious.
Plot pole win rate by era
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(9, 5))
ax = sns.barplot(data=pole_era, x="era", y="pole_win_rate_pct", color="steelblue")
ax.set_title("F1 pole win rate by era")
ax.set_xlabel("Era")
ax.set_ylabel("Pole win rate (%)")
for i, v in enumerate(pole_era["pole_win_rate_pct"]):
ax.text(i, v + 0.8, f"{v:.2f}%", ha="center")
plt.tight_layout()
plt.show()The chart matches the underlying era summary and makes the trend clear: pole win rate is much higher in modern eras than it was before 2000.
- pre-1980: 36.87%
- 1980-2000: 35.53%
- 2000-2010: 50.00%
- 2010-2020: 50.51%
- 2020-2024: 52.34%
What the visualization shows:
- The first two eras are very similar, both around 35%–37%.
- There is a clear step up starting in 2000-2010, where pole converts to a win about half the time.
- That elevated rate stays consistent through 2010-2020 and rises slightly again in 2020-2024.
- The highest bar is 2020-2024, so in the most recent era pole position has been the most valuable in your split.
So the visual answer is: pole win rate was relatively flat and lower before 2000, then jumped sharply around 2000 and has remained above 50% since.
Thinking
Analyzing dataset loading
I need to take small iterative steps here. The user has a new task, so I should assess the current notebook state first. It looks like the drivers.csv dataset probably isn't loaded yet, so my first step should be to inspect and load it. I'll focus on displaying the shape, columns, and head without getting into plotting just yet. After that, I will mention merging with results in one concise code block.
We already have results, and for driver names we need drivers.csv next.
Load and preview drivers data
drivers = pd.read_csv("drivers.csv", na_values=["\\N"])
display(pd.DataFrame({"column": drivers.columns}))
display(drivers.head())
drivers.shape| column | |
|---|---|
| 0 | driverId |
| 1 | driverRef |
| 2 | number |
| 3 | code |
| 4 | forename |
| 5 | surname |
| 6 | dob |
| 7 | nationality |
| 8 | url |
| driverId | driverRef | number | code | forename | surname | dob | nationality | url | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | hamilton | 44.0 | HAM | Lewis | Hamilton | 1985-01-07 | British | http://en.wikipedia.org/wiki/Lewis_Hamilton |
| 1 | 2 | heidfeld | NaN | HEI | Nick | Heidfeld | 1977-05-10 | German | http://en.wikipedia.org/wiki/Nick_Heidfeld |
| 2 | 3 | rosberg | 6.0 | ROS | Nico | Rosberg | 1985-06-27 | German | http://en.wikipedia.org/wiki/Nico_Rosberg |
| 3 | 4 | alonso | 14.0 | ALO | Fernando | Alonso | 1981-07-29 | Spanish | http://en.wikipedia.org/wiki/Fernando_Alonso |
| 4 | 5 | kovalainen | NaN | KOV | Heikki | Kovalainen | 1981-10-19 | Finnish | http://en.wikipedia.org/wiki/Heikki_Kovalainen |
(861, 9)
Thinking
Evaluating code output for rankings
I'm considering how to handle the missing evidence—maybe I need to print or display underlying values before concluding. I could use a table and bar chart since the goal is a ranking task. The user wants a clear outcome, so I should provide a minimal code block that calculates the table and creates the bar chart, avoiding excessive prints. I need to filter wins based on grid position and organize the results by driver ID while understanding what "best grid position they won from" entails.Clarifying grid position terminology
I’m noticing some ambiguity in the wording regarding "best" grid position. Typically, the best grid position refers to the lowest number, but for wins starting outside the top 3, it might mean the closest to the front that they achieved, so grid >=4. If a driver won from both grid 4 and grid 10, then 4 would technically be the best position. However, interpreting "best grid position" could also mean the highest starting position. I should probably clarify this in a comment, indicating that the lowest starting position is considered the "best."
We’ve already verified the key pieces needed for this: results is loaded, and drivers has driverId, forename, and surname, so we can now build the top-5 ranking and the chart.
Plot top 5 comeback winners
top5 = (
results.loc[(results["grid"] >= 4) & (results["positionOrder"] == 1), ["driverId", "grid"]]
.groupby("driverId", as_index=False)
.agg(
wins_from_grid_4_plus=("driverId", "size"),
best_grid_position=("grid", "min")
)
.merge(
drivers.assign(name=drivers["forename"] + " " + drivers["surname"])[["driverId", "name"]],
on="driverId",
how="left"
)
.sort_values(["wins_from_grid_4_plus", "best_grid_position"], ascending=[False, True])
.head(5)
)
display(top5[["driverId", "name", "wins_from_grid_4_plus", "best_grid_position"]])
plt.figure(figsize=(9, 5))
sns.barplot(data=top5, x="name", y="wins_from_grid_4_plus", color="steelblue")
plt.title("Top 5 drivers with most wins from grid position 4 or higher")
plt.xlabel("Driver")
plt.ylabel("Wins from grid 4+")
plt.xticks(rotation=20, ha="right")
plt.tight_layout()
plt.show()The results already answer the question.
The top 5 drivers with the most wins from grid position 4 or worse are:
- Alain Prost (
driverId117) — 16 wins, best winning grid position: 4 - Nelson Piquet (
driverId137) — 15 wins, best winning grid position: 4 - Max Verstappen (
driverId830) — 11 wins, best winning grid position: 4 - Niki Lauda (
driverId182) — 9 wins, best winning grid position: 4 - Lewis Hamilton (
driverId1) — 8 wins, best winning grid position: 4
What this means:
- Prost stands out as the strongest driver in this specific scenario, with the highest count of wins after starting outside the top 3.
- The gap between Prost/Piquet and the rest suggests they were especially effective at converting less-than-ideal starting positions into victories.
- For all five drivers, the best grid position they won from was 4, which means each of them has won from just outside the front three, but none of the listed “best” values in this top-5 result required starting lower than 4 to establish their best such win.
The bar chart supports the same ranking visually, with Prost and Piquet clearly leading, followed by Verstappen, Lauda, and Hamilton.
Thinking
Building race win code
I want to create runnable code that processes loaded data by joining and sorting it. First, I need to filter winners with positionOrder==1 and focus on grid positions. It makes sense to exclude non-positive grid positions since those indicate the pit lane. If there’s a tie in grid positions, I’ll sort the top 15 in descending order, possibly by year. The next step is to build the top 15 race wins from the deepest grid positions, then I’ll draft the code.Crafting code output
I need to ensure that I have just one bare expression in my code, so I'll use a single table to display results sequentially. It looks like I can end the code with the display function, which is fine, so there's no need for a final bare expression. I’ll focus on crafting it without any print statements to keep it clean and straightforward. Let's get started on this!
We already have the three needed tables in memory, so the next step is to build the ranked race list first.
Show top 15 comeback wins
top15_races = (
results.loc[(results["positionOrder"] == 1) & (results["grid"] > 0), ["raceId", "driverId", "grid", "positionOrder"]]
.merge(
races[["raceId", "year", "name"]],
on="raceId",
how="left"
)
.merge(
drivers.assign(driver_name=drivers["forename"] + " " + drivers["surname"])[["driverId", "driver_name"]],
on="driverId",
how="left"
)
.rename(columns={"name": "race_name", "grid": "starting_grid", "positionOrder": "final_position"})
.sort_values(["starting_grid", "year"], ascending=[False, False])
.head(15)
[["driver_name", "race_name", "year", "starting_grid", "final_position"]]
.reset_index(drop=True)
)
display(top15_races)| driver_name | race_name | year | starting_grid | final_position | |
|---|---|---|---|---|---|
| 0 | John Watson | United States Grand Prix West | 1983 | 22 | 1 |
| 1 | Bill Vukovich | Indianapolis 500 | 1954 | 19 | 1 |
| 2 | Rubens Barrichello | German Grand Prix | 2000 | 18 | 1 |
| 3 | Max Verstappen | São Paulo Grand Prix | 2024 | 17 | 1 |
| 4 | Kimi Räikkönen | Japanese Grand Prix | 2005 | 17 | 1 |
| 5 | John Watson | Detroit Grand Prix | 1982 | 17 | 1 |
| 6 | Michael Schumacher | Belgian Grand Prix | 1995 | 16 | 1 |
| 7 | Jackie Stewart | South African Grand Prix | 1973 | 16 | 1 |
| 8 | Fernando Alonso | Singapore Grand Prix | 2008 | 15 | 1 |
| 9 | Max Verstappen | Belgian Grand Prix | 2022 | 14 | 1 |
| 10 | Lewis Hamilton | German Grand Prix | 2018 | 14 | 1 |
| 11 | Jenson Button | Hungarian Grand Prix | 2006 | 14 | 1 |
| 12 | Johnny Herbert | European Grand Prix | 1999 | 14 | 1 |
| 13 | Olivier Panis | Monaco Grand Prix | 1996 | 14 | 1 |
| 14 | Alan Jones | Austrian Grand Prix | 1977 | 14 | 1 |
Thinking
Determining display options
I need to find minimal code that shows a full table and a plot. The rules suggest either one chart or one table per step, but I think path B allows for one code block to display both, sequentially. I should get the full top15_races and then create a horizontal bar chart. This seems to align with our primary objective of finishing the requested output while keeping things concise. I'll have to gather missing evidence though.
The ranked table already shows the key pattern: the biggest comeback win in the current results is John Watson winning the 1983 United States Grand Prix West from grid 22, followed by Bill Vukovich from 19 and Rubens Barrichello from 18. The missing part for your request is the horizontal bar chart.
Plot top 15 starting grids
plot_data = top15_races.copy()
plot_data["label"] = (
plot_data["year"].astype(str) + " — " +
plot_data["race_name"] + " — " +
plot_data["driver_name"]
)
display(plot_data)
plt.figure(figsize=(12, 8))
sns.barplot(
data=plot_data,
y="label",
x="starting_grid",
color="steelblue",
orient="h"
)
plt.title("Top 15 race wins from the furthest back starting positions")
plt.xlabel("Starting grid position")
plt.ylabel("Race — Driver")
plt.tight_layout()
plt.show()Thinking
Generating a value table
I'm looking to create a compact full value table that might include selected columns and ranking numbers. I think I can craft some minimal code to achieve this:
full_top15 = top15_races.copy()
full_top15.index = range(1, len(full_top15) + 1)
full_top15.to_markdown()
But it seems to omit the index. I should try full_top15.to_markdown(index=True), though the original goal didn't specifically request ranking numbers. Maybe I can use display(full_top15) with various options instead.
The current outputs already show the right ranking and the horizontal bar chart. From what is visible, the biggest comeback win is John Watson at the 1983 United States Grand Prix West from grid 22, followed by Bill Vukovich from grid 19 and Rubens Barrichello from grid 18. The table output is just truncated, so the only missing piece is a clean full 15-row list.
Display top races table
top15_races.to_markdown(index=False)'| driver_name | race_name | year | starting_grid | final_position |\n|:-------------------|:------------------------------|-------:|----------------:|-----------------:|\n| John Watson | United States Grand Prix West | 1983 | 22 | 1 |\n| Bill Vukovich | Indianapolis 500 | 1954 | 19 | 1 |\n| Rubens Barrichello | German Grand Prix | 2000 | 18 | 1 |\n| Max Verstappen | São Paulo Grand Prix | 2024 | 17 | 1 |\n| Kimi Räikkönen | Japanese Grand Prix | 2005 | 17 | 1 |\n| John Watson | Detroit Grand Prix | 1982 | 17 | 1 |\n| Michael Schumacher | Belgian Grand Prix | 1995 | 16 | 1 |\n| Jackie Stewart | South African Grand Prix | 1973 | 16 | 1 |\n| Fernando Alonso | Singapore Grand Prix | 2008 | 15 | 1 |\n| Max Verstappen | Belgian Grand Prix | 2022 | 14 | 1 |\n| Lewis Hamilton | German Grand Prix | 2018 | 14 | 1 |\n| Jenson Button | Hungarian Grand Prix | 2006 | 14 | 1 |\n| Johnny Herbert | European Grand Prix | 1999 | 14 | 1 |\n| Olivier Panis | Monaco Grand Prix | 1996 | 14 | 1 |\n| Alan Jones | Austrian Grand Prix | 1977 | 14 | 1 |'
The current outputs already answer most of the request: the ranking logic is correct, and the horizontal bar chart is in place. From the visible rows, the biggest comeback win is John Watson at the 1983 United States Grand Prix West from grid 22, followed by Bill Vukovich from grid 19 at the 1954 Indianapolis 500, and Rubens Barrichello from grid 18 at the 2000 German Grand Prix. The remaining gap is that the displayed table is truncated, so the full top 15 is not yet shown cleanly in the notebook.
Display top 15 races
display(top15_races)| driver_name | race_name | year | starting_grid | final_position | |
|---|---|---|---|---|---|
| 0 | John Watson | United States Grand Prix West | 1983 | 22 | 1 |
| 1 | Bill Vukovich | Indianapolis 500 | 1954 | 19 | 1 |
| 2 | Rubens Barrichello | German Grand Prix | 2000 | 18 | 1 |
| 3 | Max Verstappen | São Paulo Grand Prix | 2024 | 17 | 1 |
| 4 | Kimi Räikkönen | Japanese Grand Prix | 2005 | 17 | 1 |
| 5 | John Watson | Detroit Grand Prix | 1982 | 17 | 1 |
| 6 | Michael Schumacher | Belgian Grand Prix | 1995 | 16 | 1 |
| 7 | Jackie Stewart | South African Grand Prix | 1973 | 16 | 1 |
| 8 | Fernando Alonso | Singapore Grand Prix | 2008 | 15 | 1 |
| 9 | Max Verstappen | Belgian Grand Prix | 2022 | 14 | 1 |
| 10 | Lewis Hamilton | German Grand Prix | 2018 | 14 | 1 |
| 11 | Jenson Button | Hungarian Grand Prix | 2006 | 14 | 1 |
| 12 | Johnny Herbert | European Grand Prix | 1999 | 14 | 1 |
| 13 | Olivier Panis | Monaco Grand Prix | 1996 | 14 | 1 |
| 14 | Alan Jones | Austrian Grand Prix | 1977 | 14 | 1 |
Start using AI for Data Analysis
MLJAR Studio helps you analyze data with AI, run machine learning workflows, and build reproducible notebook-based results on your own computer.
Runs locally • Supports local LLMs