Formula 1 - Predicting the winner of a race¶

In this notebook we will use getML to predict the winner of a Formula 1 race.

Summary:

Prediction type: Classification model
Domain: Sports
Prediction target: Win
Population size: 31578

Author: Dr. Patrick Urbanke

Background¶

We would like to develop a prediction model for Formula 1 races, that would allow us to predict the winner of a race before the race has started.

We use dataset of all Formula 1 races from 1950 to 2017. The dataset includes information such as the time taken in each lap, the time taken for pit stops, the performance in the qualifying rounds etc.

The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).

Analysis¶

Let's get started with the analysis and set up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "matplotlib~=3.9"

import matplotlib.pyplot as plt
%matplotlib inline  

import getml

getml.engine.launch()
getml.engine.set_project('formula1')
%pip install -q "getml==1.5.0" "matplotlib~=3.9" import matplotlib.pyplot as plt %matplotlib inline import getml getml.engine.launch() getml.engine.set_project('formula1')

Note: you may need to restart the kernel to use updated packages.
Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false in /home/user/.getML/getml-1.5.0-x64-community-edition-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912122602.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'formula1'.

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data:

In [2]:

  Copied!     
 
conn = getml.database.connect_mariadb(
    host="relational.fel.cvut.cz",
    dbname="ErgastF1",
    port=3306,
    user="guest",
    password="relational"
)

conn
conn = getml.database.connect_mariadb( host="relational.fel.cvut.cz", dbname="ErgastF1", port=3306, user="guest", password="relational" ) conn

Out[2]:

Connection(dbname='ErgastF1',
           dialect='mysql',
           host='relational.fel.cvut.cz',
           port=3306)

In [3]:

  Copied!     
 
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
def load_if_needed(name): """ Loads the data from the relational learning repository, if the data frame has not already been loaded. """ if not getml.data.exists(name): data_frame = getml.data.DataFrame.from_db( name=name, table_name=name, conn=conn ) data_frame.save() else: data_frame = getml.data.load_data_frame(name) return data_frame

In [4]:

  Copied!     
 
driverStandings = load_if_needed("driverStandings")
drivers = load_if_needed("drivers")
lapTimes = load_if_needed("lapTimes")
pitStops = load_if_needed("pitStops")
races = load_if_needed("races")
qualifying = load_if_needed("qualifying")
driverStandings = load_if_needed("driverStandings") drivers = load_if_needed("drivers") lapTimes = load_if_needed("lapTimes") pitStops = load_if_needed("pitStops") races = load_if_needed("races") qualifying = load_if_needed("qualifying")

In [5]:

  Copied!     
 
driverStandings
driverStandings

Out[5]:

name	driverStandingsId	raceId	driverId	points	position	wins	positionText
role	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string
0	1	18	1	10	1	1	1
1	2	18	2	8	2	0	2
2	3	18	3	6	3	0	3
3	4	18	4	5	4	0	4
4	5	18	5	4	5	0	5
	...	...	...	...	...	...	...
31573	68456	982	835	8	16	0	16
31574	68457	982	154	26	13	0	13
31575	68458	982	836	5	18	0	18
31576	68459	982	18	0	22	0	22
31577	68460	982	814	0	23	0	23

31578 rows x 7 columns
memory usage: 1.85 MB
name: driverStandings
type: getml.DataFrame

In [6]:

  Copied!     
 
drivers
drivers

Out[6]:

name	driverId	number	driverRef	code	forename	surname	dob	nationality	url
role	unused_float	unused_float	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string
0	1	44	hamilton	HAM	Lewis	Hamilton	1985-01-07	British	http://en.wikipedia.org/wiki/Lew...
1	2	nan	heidfeld	HEI	Nick	Heidfeld	1977-05-10	German	http://en.wikipedia.org/wiki/Nic...
2	3	6	rosberg	ROS	Nico	Rosberg	1985-06-27	German	http://en.wikipedia.org/wiki/Nic...
3	4	14	alonso	ALO	Fernando	Alonso	1981-07-29	Spanish	http://en.wikipedia.org/wiki/Fer...
4	5	nan	kovalainen	KOV	Heikki	Kovalainen	1981-10-19	Finnish	http://en.wikipedia.org/wiki/Hei...
	...	...	...	...	...	...	...	...	...
835	837	88	haryanto	HAR	Rio	Haryanto	1993-01-22	Indonesian	http://en.wikipedia.org/wiki/Rio...
836	838	2	vandoorne	VAN	Stoffel	Vandoorne	1992-03-26	Belgian	http://en.wikipedia.org/wiki/Sto...
837	839	31	ocon	OCO	Esteban	Ocon	1996-09-17	French	http://en.wikipedia.org/wiki/Est...
838	840	18	stroll	STR	Lance	Stroll	1998-10-29	Canadian	http://en.wikipedia.org/wiki/Lan...
839	841	36	giovinazzi	GIO	Antonio	Giovinazzi	1993-12-14	Italian	http://en.wikipedia.org/wiki/Ant...

840 rows x 9 columns
memory usage: 0.13 MB
name: drivers
type: getml.DataFrame

In [7]:

  Copied!     
 
lapTimes
lapTimes

Out[7]:

name	raceId	driverId	lap	position	milliseconds	time
role	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string
0	1	1	1	13	109088	1:49.088
1	1	1	2	12	93740	1:33.740
2	1	1	3	11	91600	1:31.600
3	1	1	4	10	91067	1:31.067
4	1	1	5	10	92129	1:32.129
	...	...	...	...	...	...
420364	982	840	54	8	107528	1:47.528
420365	982	840	55	8	107512	1:47.512
420366	982	840	56	8	108143	1:48.143
420367	982	840	57	8	107848	1:47.848
420368	982	840	58	8	108699	1:48.699

420369 rows x 6 columns
memory usage: 23.96 MB
name: lapTimes
type: getml.DataFrame

In [8]:

  Copied!     
 
pitStops
pitStops

Out[8]:

name	raceId	driverId	stop	lap	milliseconds	time	duration
role	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string
0	841	1	1	16	23227	17:28:24	23.227
1	841	1	2	36	23199	17:59:29	23.199
2	841	2	1	15	22994	17:27:41	22.994
3	841	2	2	30	25098	17:51:32	25.098
4	841	3	1	16	23716	17:29:00	23.716
	...	...	...	...	...	...	...
6065	982	839	6	38	29134	21:29:07	29.134
6066	982	840	1	1	37403	20:06:43	37.403
6067	982	840	2	2	29294	20:10:07	29.294
6068	982	840	3	3	25584	20:13:16	25.584
6069	982	840	4	26	29412	21:05:07	29.412

6070 rows x 7 columns
memory usage: 0.44 MB
name: pitStops
type: getml.DataFrame

In [9]:

  Copied!     
 
races
races

Out[9]:

name	raceId	year	round	circuitId	name	date	time	url
role	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string	unused_string	unused_string
0	1	2009	1	1	Australian Grand Prix	2009-03-29	06:00:00	http://en.wikipedia.org/wiki/200...
1	2	2009	2	2	Malaysian Grand Prix	2009-04-05	09:00:00	http://en.wikipedia.org/wiki/200...
2	3	2009	3	17	Chinese Grand Prix	2009-04-19	07:00:00	http://en.wikipedia.org/wiki/200...
3	4	2009	4	3	Bahrain Grand Prix	2009-04-26	12:00:00	http://en.wikipedia.org/wiki/200...
4	5	2009	5	4	Spanish Grand Prix	2009-05-10	12:00:00	http://en.wikipedia.org/wiki/200...
	...	...	...	...	...	...	...	...
971	984	2017	16	22	Japanese Grand Prix	2017-10-08	05:00:00	https://en.wikipedia.org/wiki/20...
972	985	2017	17	69	United States Grand Prix	2017-10-22	19:00:00	https://en.wikipedia.org/wiki/20...
973	986	2017	18	32	Mexican Grand Prix	2017-10-29	19:00:00	https://en.wikipedia.org/wiki/20...
974	987	2017	19	18	Brazilian Grand Prix	2017-11-12	16:00:00	https://en.wikipedia.org/wiki/20...
975	988	2017	20	24	Abu Dhabi Grand Prix	2017-11-26	17:00:00	https://en.wikipedia.org/wiki/20...

976 rows x 8 columns
memory usage: 0.15 MB
name: races
type: getml.DataFrame

In [10]:

  Copied!     
 
qualifying
qualifying

Out[10]:

name	qualifyId	raceId	driverId	constructorId	number	position	q1	q2	q3
role	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string	unused_string
0	1	18	1	1	22	1	1:26.572	1:25.187	1:26.714
1	2	18	9	2	4	2	1:26.103	1:25.315	1:26.869
2	3	18	5	1	23	3	1:25.664	1:25.452	1:27.079
3	4	18	13	6	2	4	1:25.994	1:25.691	1:27.178
4	5	18	2	2	3	5	1:25.960	1:25.518	1:27.236
	...	...	...	...	...	...	...	...	...
7392	7415	982	825	210	20	16	1:43.756	NULL	NULL
7393	7416	982	13	3	19	17	1:44.014	NULL	NULL
7394	7417	982	840	3	18	18	1:44.728	NULL	NULL
7395	7418	982	836	15	94	19	1:45.059	NULL	NULL
7396	7419	982	828	15	9	20	1:45.570	NULL	NULL

7397 rows x 9 columns
memory usage: 0.66 MB
name: qualifying
type: getml.DataFrame

1.2 Prepare data for getML¶

In [11]:

  Copied!     
 
racesPd = races.to_pandas()
racesPd
racesPd = races.to_pandas() racesPd

Out[11]:

	raceId	year	round	circuitId	name	date	time	url
0	1.0	2009.0	1.0	1.0	Australian Grand Prix	2009-03-29	06:00:00	http://en.wikipedia.org/wiki/2009_Australian_G...
1	2.0	2009.0	2.0	2.0	Malaysian Grand Prix	2009-04-05	09:00:00	http://en.wikipedia.org/wiki/2009_Malaysian_Gr...
2	3.0	2009.0	3.0	17.0	Chinese Grand Prix	2009-04-19	07:00:00	http://en.wikipedia.org/wiki/2009_Chinese_Gran...
3	4.0	2009.0	4.0	3.0	Bahrain Grand Prix	2009-04-26	12:00:00	http://en.wikipedia.org/wiki/2009_Bahrain_Gran...
4	5.0	2009.0	5.0	4.0	Spanish Grand Prix	2009-05-10	12:00:00	http://en.wikipedia.org/wiki/2009_Spanish_Gran...
...	...	...	...	...	...	...	...	...
971	984.0	2017.0	16.0	22.0	Japanese Grand Prix	2017-10-08	05:00:00	https://en.wikipedia.org/wiki/2017_Japanese_Gr...
972	985.0	2017.0	17.0	69.0	United States Grand Prix	2017-10-22	19:00:00	https://en.wikipedia.org/wiki/2017_United_Stat...
973	986.0	2017.0	18.0	32.0	Mexican Grand Prix	2017-10-29	19:00:00	https://en.wikipedia.org/wiki/2017_Mexican_Gra...
974	987.0	2017.0	19.0	18.0	Brazilian Grand Prix	2017-11-12	16:00:00	https://en.wikipedia.org/wiki/2017_Brazilian_G...
975	988.0	2017.0	20.0	24.0	Abu Dhabi Grand Prix	2017-11-26	17:00:00	https://en.wikipedia.org/wiki/2017_Abu_Dhabi_G...

976 rows × 8 columns

We actually need some set-up, because the target variable is not readily available. The wins column in driverStandings is actually the accumulated number of wins over a year, but what we want is a boolean variable indicated whether someone has one a particular race or not.

In [12]:

  Copied!     
 
driverStandingsPd = driverStandings.to_pandas()

driverStandingsPd = driverStandingsPd.merge(
    racesPd[["raceId", "year", "date", "round"]],
    on="raceId"
)

previousStanding = driverStandingsPd.merge(
    driverStandingsPd[["driverId", "year", "wins", "round"]],
    on=["driverId", "year"],
)

isPreviousRound = (previousStanding["round_x"] - previousStanding["round_y"] == 1.0)

previousStanding = previousStanding[isPreviousRound]

previousStanding["win"] = previousStanding["wins_x"] - previousStanding["wins_y"]

driverStandingsPd = driverStandingsPd.merge(
    previousStanding[["raceId", "driverId", "win"]],
    on=["raceId", "driverId"],
    how="left",
)

driverStandingsPd["win"] = [win if win == win else wins for win, wins in zip(driverStandingsPd["win"], driverStandingsPd["wins"])]

driver_standings = getml.data.DataFrame.from_pandas(driverStandingsPd, "driver_standings")

driver_standings
driverStandingsPd = driverStandings.to_pandas() driverStandingsPd = driverStandingsPd.merge( racesPd[["raceId", "year", "date", "round"]], on="raceId" ) previousStanding = driverStandingsPd.merge( driverStandingsPd[["driverId", "year", "wins", "round"]], on=["driverId", "year"], ) isPreviousRound = (previousStanding["round_x"] - previousStanding["round_y"] == 1.0) previousStanding = previousStanding[isPreviousRound] previousStanding["win"] = previousStanding["wins_x"] - previousStanding["wins_y"] driverStandingsPd = driverStandingsPd.merge( previousStanding[["raceId", "driverId", "win"]], on=["raceId", "driverId"], how="left", ) driverStandingsPd["win"] = [win if win == win else wins for win, wins in zip(driverStandingsPd["win"], driverStandingsPd["wins"])] driver_standings = getml.data.DataFrame.from_pandas(driverStandingsPd, "driver_standings") driver_standings

Out[12]:

name	driverStandingsId	raceId	driverId	points	position	wins	year	round	win	positionText	date
role	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string
0	1	18	1	10	1	1	2008	1	1	1	2008-03-16
1	2	18	2	8	2	0	2008	1	0	2	2008-03-16
2	3	18	3	6	3	0	2008	1	0	3	2008-03-16
3	4	18	4	5	4	0	2008	1	0	4	2008-03-16
4	5	18	5	4	5	0	2008	1	0	5	2008-03-16
	...	...	...	...	...	...	...	...	...	...	...
31573	68456	982	835	8	16	0	2017	14	0	16	2017-09-17
31574	68457	982	154	26	13	0	2017	14	0	13	2017-09-17
31575	68458	982	836	5	18	0	2017	14	0	18	2017-09-17
31576	68459	982	18	0	22	0	2017	14	0	22	2017-09-17
31577	68460	982	814	0	23	0	2017	14	0	23	2017-09-17

31578 rows x 11 columns
memory usage: 3.21 MB
name: driver_standings
type: getml.DataFrame

We also need to include the date of the race to lapTimes and pitStops, because we cannot use this data for the race we would like to predict. We can only take lap times and pit stops from previous races.

In [13]:

  Copied!     
 
lapTimesPd = lapTimes.to_pandas()

lapTimesPd = lapTimesPd.merge(
    racesPd[["raceId", "date", "year"]],
    on="raceId"
)

lap_times = getml.data.DataFrame.from_pandas(lapTimesPd, "lap_times")

lap_times
lapTimesPd = lapTimes.to_pandas() lapTimesPd = lapTimesPd.merge( racesPd[["raceId", "date", "year"]], on="raceId" ) lap_times = getml.data.DataFrame.from_pandas(lapTimesPd, "lap_times") lap_times

Out[13]:

name	raceId	driverId	lap	position	milliseconds	year	time	date
role	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string
0	1	1	1	13	109088	2009	1:49.088	2009-03-29
1	1	1	2	12	93740	2009	1:33.740	2009-03-29
2	1	1	3	11	91600	2009	1:31.600	2009-03-29
3	1	1	4	10	91067	2009	1:31.067	2009-03-29
4	1	1	5	10	92129	2009	1:32.129	2009-03-29
	...	...	...	...	...	...	...	...
420364	982	840	54	8	107528	2017	1:47.528	2017-09-17
420365	982	840	55	8	107512	2017	1:47.512	2017-09-17
420366	982	840	56	8	108143	2017	1:48.143	2017-09-17
420367	982	840	57	8	107848	2017	1:47.848	2017-09-17
420368	982	840	58	8	108699	2017	1:48.699	2017-09-17

420369 rows x 8 columns
memory usage: 35.31 MB
name: lap_times
type: getml.DataFrame

In [14]:

  Copied!     
 
pitStopsPd = pitStops.to_pandas()

pitStopsPd = pitStopsPd.merge(
    racesPd[["raceId", "date", "year"]],
    on="raceId"
)

pit_stops = getml.data.DataFrame.from_pandas(pitStopsPd, "pit_stops")

pit_stops
pitStopsPd = pitStops.to_pandas() pitStopsPd = pitStopsPd.merge( racesPd[["raceId", "date", "year"]], on="raceId" ) pit_stops = getml.data.DataFrame.from_pandas(pitStopsPd, "pit_stops") pit_stops

Out[14]:

name	raceId	driverId	stop	lap	milliseconds	year	time	duration	date
role	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string	unused_string
0	841	1	1	16	23227	2011	17:28:24	23.227	2011-03-27
1	841	1	2	36	23199	2011	17:59:29	23.199	2011-03-27
2	841	2	1	15	22994	2011	17:27:41	22.994	2011-03-27
3	841	2	2	30	25098	2011	17:51:32	25.098	2011-03-27
4	841	3	1	16	23716	2011	17:29:00	23.716	2011-03-27
	...	...	...	...	...	...	...	...	...
6065	982	839	6	38	29134	2017	21:29:07	29.134	2017-09-17
6066	982	840	1	1	37403	2017	20:06:43	37.403	2017-09-17
6067	982	840	2	2	29294	2017	20:10:07	29.294	2017-09-17
6068	982	840	3	3	25584	2017	20:13:16	25.584	2017-09-17
6069	982	840	4	26	29412	2017	21:05:07	29.412	2017-09-17

6070 rows x 9 columns
memory usage: 0.60 MB
name: pit_stops
type: getml.DataFrame

getML requires that we define roles for each of the columns.

In [15]:

  Copied!     
 
driver_standings.set_role("win", getml.data.roles.target)
driver_standings.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
driver_standings.set_role("position", getml.data.roles.numerical)
driver_standings.set_role("date", getml.data.roles.time_stamp)

driver_standings
driver_standings.set_role("win", getml.data.roles.target) driver_standings.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key) driver_standings.set_role("position", getml.data.roles.numerical) driver_standings.set_role("date", getml.data.roles.time_stamp) driver_standings

Out[15]:

name	date	raceId	driverId	year	win	position	driverStandingsId	points	wins	round	positionText
role	time_stamp	join_key	join_key	join_key	target	numerical	unused_float	unused_float	unused_float	unused_float	unused_string
unit	time stamp, comparison only
0	2008-03-16	18	1	2008	1	1	1	10	1	1	1
1	2008-03-16	18	2	2008	0	2	2	8	0	1	2
2	2008-03-16	18	3	2008	0	3	3	6	0	1	3
3	2008-03-16	18	4	2008	0	4	4	5	0	1	4
4	2008-03-16	18	5	2008	0	5	5	4	0	1	5
	...	...	...	...	...	...	...	...	...	...	...
31573	2017-09-17	982	835	2017	0	16	68456	8	0	14	16
31574	2017-09-17	982	154	2017	0	13	68457	26	0	14	13
31575	2017-09-17	982	836	2017	0	18	68458	5	0	14	18
31576	2017-09-17	982	18	2017	0	22	68459	0	0	14	22
31577	2017-09-17	982	814	2017	0	23	68460	0	0	14	23

31578 rows x 11 columns
memory usage: 2.49 MB
name: driver_standings
type: getml.DataFrame

In [16]:

  Copied!     
 
drivers.set_role("driverId", getml.data.roles.join_key)
drivers.set_role(["nationality", "driverRef"], getml.data.roles.categorical)

drivers
drivers.set_role("driverId", getml.data.roles.join_key) drivers.set_role(["nationality", "driverRef"], getml.data.roles.categorical) drivers

Out[16]:

name	driverId	nationality	driverRef	number	code	forename	surname	dob	url
role	join_key	categorical	categorical	unused_float	unused_string	unused_string	unused_string	unused_string	unused_string
0	1	British	hamilton	44	HAM	Lewis	Hamilton	1985-01-07	http://en.wikipedia.org/wiki/Lew...
1	2	German	heidfeld	nan	HEI	Nick	Heidfeld	1977-05-10	http://en.wikipedia.org/wiki/Nic...
2	3	German	rosberg	6	ROS	Nico	Rosberg	1985-06-27	http://en.wikipedia.org/wiki/Nic...
3	4	Spanish	alonso	14	ALO	Fernando	Alonso	1981-07-29	http://en.wikipedia.org/wiki/Fer...
4	5	Finnish	kovalainen	nan	KOV	Heikki	Kovalainen	1981-10-19	http://en.wikipedia.org/wiki/Hei...
	...	...	...	...	...	...	...	...	...
835	837	Indonesian	haryanto	88	HAR	Rio	Haryanto	1993-01-22	http://en.wikipedia.org/wiki/Rio...
836	838	Belgian	vandoorne	2	VAN	Stoffel	Vandoorne	1992-03-26	http://en.wikipedia.org/wiki/Sto...
837	839	French	ocon	31	OCO	Esteban	Ocon	1996-09-17	http://en.wikipedia.org/wiki/Est...
838	840	Canadian	stroll	18	STR	Lance	Stroll	1998-10-29	http://en.wikipedia.org/wiki/Lan...
839	841	Italian	giovinazzi	36	GIO	Antonio	Giovinazzi	1993-12-14	http://en.wikipedia.org/wiki/Ant...

840 rows x 9 columns
memory usage: 0.11 MB
name: drivers
type: getml.DataFrame

In [17]:

  Copied!     
 
lap_times.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
lap_times.set_role(["lap", "milliseconds", "position"], getml.data.roles.numerical)
lap_times.set_role("date", getml.data.roles.time_stamp)

lap_times
lap_times.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key) lap_times.set_role(["lap", "milliseconds", "position"], getml.data.roles.numerical) lap_times.set_role("date", getml.data.roles.time_stamp) lap_times

Out[17]:

name	date	raceId	driverId	year	lap	milliseconds	position	time
role	time_stamp	join_key	join_key	join_key	numerical	numerical	numerical	unused_string
unit	time stamp, comparison only
0	2009-03-29	1	1	2009	1	109088	13	1:49.088
1	2009-03-29	1	1	2009	2	93740	12	1:33.740
2	2009-03-29	1	1	2009	3	91600	11	1:31.600
3	2009-03-29	1	1	2009	4	91067	10	1:31.067
4	2009-03-29	1	1	2009	5	92129	10	1:32.129
	...	...	...	...	...	...	...	...
420364	2017-09-17	982	840	2017	54	107528	8	1:47.528
420365	2017-09-17	982	840	2017	55	107512	8	1:47.512
420366	2017-09-17	982	840	2017	56	108143	8	1:48.143
420367	2017-09-17	982	840	2017	57	107848	8	1:47.848
420368	2017-09-17	982	840	2017	58	108699	8	1:48.699

420369 rows x 8 columns
memory usage: 25.64 MB
name: lap_times
type: getml.DataFrame

In [18]:

  Copied!     
 
pit_stops.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
pit_stops.set_role(["lap", "milliseconds", "stop"], getml.data.roles.numerical)
pit_stops.set_role("date", getml.data.roles.time_stamp)

pit_stops
pit_stops.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key) pit_stops.set_role(["lap", "milliseconds", "stop"], getml.data.roles.numerical) pit_stops.set_role("date", getml.data.roles.time_stamp) pit_stops

Out[18]:

name	date	raceId	driverId	year	lap	milliseconds	stop	time	duration
role	time_stamp	join_key	join_key	join_key	numerical	numerical	numerical	unused_string	unused_string
unit	time stamp, comparison only
0	2011-03-27	841	1	2011	16	23227	1	17:28:24	23.227
1	2011-03-27	841	1	2011	36	23199	2	17:59:29	23.199
2	2011-03-27	841	2	2011	15	22994	1	17:27:41	22.994
3	2011-03-27	841	2	2011	30	25098	2	17:51:32	25.098
4	2011-03-27	841	3	2011	16	23716	1	17:29:00	23.716
	...	...	...	...	...	...	...	...	...
6065	2017-09-17	982	839	2017	38	29134	6	21:29:07	29.134
6066	2017-09-17	982	840	2017	1	37403	1	20:06:43	37.403
6067	2017-09-17	982	840	2017	2	29294	2	20:10:07	29.294
6068	2017-09-17	982	840	2017	3	25584	3	20:13:16	25.584
6069	2017-09-17	982	840	2017	26	29412	4	21:05:07	29.412

6070 rows x 9 columns
memory usage: 0.46 MB
name: pit_stops
type: getml.DataFrame

In [19]:

  Copied!     
 
qualifying.set_role(["raceId", "driverId", "qualifyId"], getml.data.roles.join_key)
qualifying.set_role(["position", "number"], getml.data.roles.numerical)

qualifying
qualifying.set_role(["raceId", "driverId", "qualifyId"], getml.data.roles.join_key) qualifying.set_role(["position", "number"], getml.data.roles.numerical) qualifying

Out[19]:

name	raceId	driverId	qualifyId	position	number	constructorId	q1	q2	q3
role	join_key	join_key	join_key	numerical	numerical	unused_float	unused_string	unused_string	unused_string
0	18	1	1	1	22	1	1:26.572	1:25.187	1:26.714
1	18	9	2	2	4	2	1:26.103	1:25.315	1:26.869
2	18	5	3	3	23	1	1:25.664	1:25.452	1:27.079
3	18	13	4	4	2	6	1:25.994	1:25.691	1:27.178
4	18	2	5	5	3	2	1:25.960	1:25.518	1:27.236
	...	...	...	...	...	...	...	...	...
7392	982	825	7415	16	20	210	1:43.756	NULL	NULL
7393	982	13	7416	17	19	3	1:44.014	NULL	NULL
7394	982	840	7417	18	18	3	1:44.728	NULL	NULL
7395	982	836	7418	19	94	15	1:45.059	NULL	NULL
7396	982	828	7419	20	9	15	1:45.570	NULL	NULL

7397 rows x 9 columns
memory usage: 0.57 MB
name: qualifying
type: getml.DataFrame

2. Predictive modeling¶

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

In [20]:

  Copied!     
 
split = getml.data.split.random(train=0.8, test=0.2)
split
split = getml.data.split.random(train=0.8, test=0.2) split

Out[20]:


0	train
1	train
2	train
3	test
4	train
	...

infinite number of rows
type: StringColumnView

2.1 Define relational model¶

In [21]:

  Copied!     
 
star_schema = getml.data.StarSchema(population=driver_standings.drop(["position"]), alias="population", split=split)

star_schema.join(
    driver_standings,
    on=["driverId"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
    lagged_targets=True,
)

# We cannot use lap times for the race
# we would like to predict, so we set
# a non-zero horizon.
star_schema.join(
    lap_times,
    on=["driverId"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
)

# We cannot use pit stops for the race
# we would like to predict, so we set
# a non-zero horizon.
star_schema.join(
    pit_stops,
    on=["driverId"],
    time_stamps="date",
    horizon=getml.data.time.days(1),
)

star_schema.join(
    qualifying,
    on=["driverId", "raceId"],
    relationship=getml.data.relationship.many_to_one,
)

star_schema.join(
    drivers,
    on=["driverId"],
    relationship=getml.data.relationship.many_to_one,
)

star_schema
star_schema = getml.data.StarSchema(population=driver_standings.drop(["position"]), alias="population", split=split) star_schema.join( driver_standings, on=["driverId"], time_stamps="date", horizon=getml.data.time.days(1), lagged_targets=True, ) # We cannot use lap times for the race # we would like to predict, so we set # a non-zero horizon. star_schema.join( lap_times, on=["driverId"], time_stamps="date", horizon=getml.data.time.days(1), ) # We cannot use pit stops for the race # we would like to predict, so we set # a non-zero horizon. star_schema.join( pit_stops, on=["driverId"], time_stamps="date", horizon=getml.data.time.days(1), ) star_schema.join( qualifying, on=["driverId", "raceId"], relationship=getml.data.relationship.many_to_one, ) star_schema.join( drivers, on=["driverId"], relationship=getml.data.relationship.many_to_one, ) star_schema

Out[21]:

data model

diagram

staging

	data frames	staging table
0	population, qualifying, drivers	POPULATION__STAGING_TABLE_1
1	driver_standings	DRIVER_STANDINGS__STAGING_TABLE_2
2	lap_times	LAP_TIMES__STAGING_TABLE_3
3	pit_stops	PIT_STOPS__STAGING_TABLE_4

container

population

	subset	name	rows	type
0	test	driver_standings	unknown	View
1	train	driver_standings	unknown	View

peripheral

	name	rows	type
0	driver_standings	31578	DataFrame
1	lap_times	420369	DataFrame
2	pit_stops	6070	DataFrame
3	qualifying	7397	DataFrame
4	drivers	840	DataFrame

2.2 getML pipeline¶

Set-up the feature learner & predictor

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples).

In [22]:

  Copied!     
 
fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
    num_threads=1,
)

predictor = getml.predictors.XGBoostClassifier(n_jobs=1)
fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, aggregation=getml.feature_learning.FastProp.agg_sets.All, num_threads=1, ) predictor = getml.predictors.XGBoostClassifier(n_jobs=1)

Build the pipeline

In [23]:

  Copied!     
 
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    feature_learners=[fast_prop],
    predictors=[predictor],
    include_categorical=True,
)

pipe1
pipe1 = getml.pipeline.Pipeline( tags=['fast_prop'], data_model=star_schema.data_model, feature_learners=[fast_prop], predictors=[predictor], include_categorical=True, ) pipe1

Out[23]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=True,
         loss_function='CrossEntropyLoss',
         peripheral=['driver_standings', 'drivers', 'lap_times', 'pit_stops', 'qualifying'],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['fast_prop'])

2.3 Model training¶

In [24]:

  Copied!     
 
pipe1.check(star_schema.train)
pipe1.check(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.

Out[24]:

	type	label	message
0	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and LAP_TIMES__STAGING_TABLE_3 over 'driverId' and 'driverId', there are no corresponding entries for 68.551028% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
1	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and PIT_STOPS__STAGING_TABLE_4 over 'driverId' and 'driverId', there are no corresponding entries for 82.527910% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.

In [25]:

  Copied!     
 
pipe1.fit(star_schema.train)
pipe1.fit(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 664 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:18
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:20
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:11

Trained pipeline.

Time taken: 0:01:50.261514.

Out[25]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=True,
         loss_function='CrossEntropyLoss',
         peripheral=['driver_standings', 'drivers', 'lap_times', 'pit_stops', 'qualifying'],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-gS8oTa'])

2.4 Model evaluation¶

In [26]:

  Copied!     
 
pipe1.score(star_schema.test)
pipe1.score(star_schema.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05

Out[26]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 12:27:58	train	win	0.9714	0.9448	0.0825
1	2024-09-12 12:28:03	test	win	0.9714	0.9165	0.08801

2.5 Studying features¶

We take a look at the importance of the features FastProp has learned:

In [27]:

  Copied!     
 
names, importances = pipe1.features.importances(target_num=0)

plt.subplots(figsize=(20, 10))

plt.bar(names[:30], importances[:30], color='#6829c2')

plt.title("feature importances")
plt.grid(True)
plt.xlabel("column")
plt.ylabel("importance")
plt.xticks(rotation='vertical')

plt.show()
names, importances = pipe1.features.importances(target_num=0) plt.subplots(figsize=(20, 10)) plt.bar(names[:30], importances[:30], color='#6829c2') plt.title("feature importances") plt.grid(True) plt.xlabel("column") plt.ylabel("importance") plt.xticks(rotation='vertical') plt.show()

No description has been provided for this image

We take a look at the most important features, to get an idea where the predictive power comes from:

In [28]:

  Copied!     
 
pipe1.features.to_sql().find(names[0])[0]
pipe1.features.to_sql().find(names[0])[0]

Out[28]:

DROP TABLE IF EXISTS "FEATURE_1_38";

CREATE TABLE "FEATURE_1_38" AS
SELECT EWMA_90D( t2."win" ORDER BY t1."date" - t2."date, '+1.000000 days'" ) AS "feature_1_38",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "DRIVER_STANDINGS__STAGING_TABLE_2" t2
ON t1."driverid" = t2."driverid"
WHERE t2."date, '+1.000000 days'" <= t1."date"
GROUP BY t1.rowid;

In [29]:

  Copied!     
 
pipe1.features.to_sql()[names[1]]
pipe1.features.to_sql()[names[1]]

Out[29]:

DROP TABLE IF EXISTS "FEATURE_1_60";

CREATE TABLE "FEATURE_1_60" AS
SELECT EWMA_365D( t2."win" ORDER BY t1."date" - t2."date, '+1.000000 days'" ) AS "feature_1_60",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "DRIVER_STANDINGS__STAGING_TABLE_2" t2
ON t1."driverid" = t2."driverid"
WHERE t2."date, '+1.000000 days'" <= t1."date"
GROUP BY t1.rowid;

In [30]:

  Copied!     
 
pipe1.features.to_sql()[names[2]]
pipe1.features.to_sql()[names[2]]

Out[30]:

DROP TABLE IF EXISTS "FEATURE_1_5";

CREATE TABLE "FEATURE_1_5" AS
SELECT EWMA_7D( t2."position" ORDER BY t1."date" - t2."date, '+1.000000 days'" ) AS "feature_1_5",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "DRIVER_STANDINGS__STAGING_TABLE_2" t2
ON t1."driverid" = t2."driverid"
WHERE t2."date, '+1.000000 days'" <= t1."date"
GROUP BY t1.rowid;

In [31]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

What we can learn from these features is the following: Knowing which driver we are talking about and who won the most recent races is the best predictor for whether a driver will win this race.

References¶

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).