Formula 1 - Predicting the winner of a race¶
In this notebook we will benchmark getML against featuretools to predict the winner of a Formula 1 race.
Summary:
- Prediction type: Classification model
- Domain: Sports
- Prediction target: Win
- Population size: 31578
Background¶
We would like to develop a prediction model for Formula 1 races, that would allow us to predict the winner of a race before the race has started.
We use dataset of all Formula 1 races from 1950 to 2017. The dataset includes information such as the time taken in each lap, the time taken for pit stops, the performance in the qualifying rounds etc.
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015) (Now residing at relational-data.org.).
We will benchmark getML's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.
Analysis¶
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "featuretools==1.31.0" "ipywidgets==8.1.5"
Let's get started with the analysis and set up your session:
import os
import warnings
import pandas as pd
import matplotlib.pyplot as plt
import featuretools
import woodwork as ww
import getml
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
warnings.simplefilter(action='ignore', category=FutureWarning)
%matplotlib inline
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('formula1')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912140548.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'formula1'.
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="ErgastF1",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='ErgastF1', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
driverStandings = load_if_needed("driverStandings")
drivers = load_if_needed("drivers")
lapTimes = load_if_needed("lapTimes")
pitStops = load_if_needed("pitStops")
races = load_if_needed("races")
qualifying = load_if_needed("qualifying")
driverStandings
name | driverStandingsId | raceId | driverId | points | position | wins | positionText |
---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string |
0 | 1 | 18 | 1 | 10 | 1 | 1 | 1 |
1 | 2 | 18 | 2 | 8 | 2 | 0 | 2 |
2 | 3 | 18 | 3 | 6 | 3 | 0 | 3 |
3 | 4 | 18 | 4 | 5 | 4 | 0 | 4 |
4 | 5 | 18 | 5 | 4 | 5 | 0 | 5 |
... | ... | ... | ... | ... | ... | ... | |
31573 | 68456 | 982 | 835 | 8 | 16 | 0 | 16 |
31574 | 68457 | 982 | 154 | 26 | 13 | 0 | 13 |
31575 | 68458 | 982 | 836 | 5 | 18 | 0 | 18 |
31576 | 68459 | 982 | 18 | 0 | 22 | 0 | 22 |
31577 | 68460 | 982 | 814 | 0 | 23 | 0 | 23 |
31578 rows x 7 columns
memory usage: 1.85 MB
name: driverStandings
type: getml.DataFrame
drivers
name | driverId | number | driverRef | code | forename | surname | dob | nationality | url |
---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 44 | hamilton | HAM | Lewis | Hamilton | 1985-01-07 | British | http://en.wikipedia.org/wiki/Lew... |
1 | 2 | nan | heidfeld | HEI | Nick | Heidfeld | 1977-05-10 | German | http://en.wikipedia.org/wiki/Nic... |
2 | 3 | 6 | rosberg | ROS | Nico | Rosberg | 1985-06-27 | German | http://en.wikipedia.org/wiki/Nic... |
3 | 4 | 14 | alonso | ALO | Fernando | Alonso | 1981-07-29 | Spanish | http://en.wikipedia.org/wiki/Fer... |
4 | 5 | nan | kovalainen | KOV | Heikki | Kovalainen | 1981-10-19 | Finnish | http://en.wikipedia.org/wiki/Hei... |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
835 | 837 | 88 | haryanto | HAR | Rio | Haryanto | 1993-01-22 | Indonesian | http://en.wikipedia.org/wiki/Rio... |
836 | 838 | 2 | vandoorne | VAN | Stoffel | Vandoorne | 1992-03-26 | Belgian | http://en.wikipedia.org/wiki/Sto... |
837 | 839 | 31 | ocon | OCO | Esteban | Ocon | 1996-09-17 | French | http://en.wikipedia.org/wiki/Est... |
838 | 840 | 18 | stroll | STR | Lance | Stroll | 1998-10-29 | Canadian | http://en.wikipedia.org/wiki/Lan... |
839 | 841 | 36 | giovinazzi | GIO | Antonio | Giovinazzi | 1993-12-14 | Italian | http://en.wikipedia.org/wiki/Ant... |
840 rows x 9 columns
memory usage: 0.13 MB
name: drivers
type: getml.DataFrame
lapTimes
name | raceId | driverId | lap | position | milliseconds | time |
---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string |
0 | 1 | 1 | 1 | 13 | 109088 | 1:49.088 |
1 | 1 | 1 | 2 | 12 | 93740 | 1:33.740 |
2 | 1 | 1 | 3 | 11 | 91600 | 1:31.600 |
3 | 1 | 1 | 4 | 10 | 91067 | 1:31.067 |
4 | 1 | 1 | 5 | 10 | 92129 | 1:32.129 |
... | ... | ... | ... | ... | ... | |
420364 | 982 | 840 | 54 | 8 | 107528 | 1:47.528 |
420365 | 982 | 840 | 55 | 8 | 107512 | 1:47.512 |
420366 | 982 | 840 | 56 | 8 | 108143 | 1:48.143 |
420367 | 982 | 840 | 57 | 8 | 107848 | 1:47.848 |
420368 | 982 | 840 | 58 | 8 | 108699 | 1:48.699 |
420369 rows x 6 columns
memory usage: 23.96 MB
name: lapTimes
type: getml.DataFrame
pitStops
name | raceId | driverId | stop | lap | milliseconds | time | duration |
---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string |
0 | 841 | 1 | 1 | 16 | 23227 | 17:28:24 | 23.227 |
1 | 841 | 1 | 2 | 36 | 23199 | 17:59:29 | 23.199 |
2 | 841 | 2 | 1 | 15 | 22994 | 17:27:41 | 22.994 |
3 | 841 | 2 | 2 | 30 | 25098 | 17:51:32 | 25.098 |
4 | 841 | 3 | 1 | 16 | 23716 | 17:29:00 | 23.716 |
... | ... | ... | ... | ... | ... | ... | |
6065 | 982 | 839 | 6 | 38 | 29134 | 21:29:07 | 29.134 |
6066 | 982 | 840 | 1 | 1 | 37403 | 20:06:43 | 37.403 |
6067 | 982 | 840 | 2 | 2 | 29294 | 20:10:07 | 29.294 |
6068 | 982 | 840 | 3 | 3 | 25584 | 20:13:16 | 25.584 |
6069 | 982 | 840 | 4 | 26 | 29412 | 21:05:07 | 29.412 |
6070 rows x 7 columns
memory usage: 0.44 MB
name: pitStops
type: getml.DataFrame
races
name | raceId | year | round | circuitId | name | date | time | url |
---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
0 | 1 | 2009 | 1 | 1 | Australian Grand Prix | 2009-03-29 | 06:00:00 | http://en.wikipedia.org/wiki/200... |
1 | 2 | 2009 | 2 | 2 | Malaysian Grand Prix | 2009-04-05 | 09:00:00 | http://en.wikipedia.org/wiki/200... |
2 | 3 | 2009 | 3 | 17 | Chinese Grand Prix | 2009-04-19 | 07:00:00 | http://en.wikipedia.org/wiki/200... |
3 | 4 | 2009 | 4 | 3 | Bahrain Grand Prix | 2009-04-26 | 12:00:00 | http://en.wikipedia.org/wiki/200... |
4 | 5 | 2009 | 5 | 4 | Spanish Grand Prix | 2009-05-10 | 12:00:00 | http://en.wikipedia.org/wiki/200... |
... | ... | ... | ... | ... | ... | ... | ... | |
971 | 984 | 2017 | 16 | 22 | Japanese Grand Prix | 2017-10-08 | 05:00:00 | https://en.wikipedia.org/wiki/20... |
972 | 985 | 2017 | 17 | 69 | United States Grand Prix | 2017-10-22 | 19:00:00 | https://en.wikipedia.org/wiki/20... |
973 | 986 | 2017 | 18 | 32 | Mexican Grand Prix | 2017-10-29 | 19:00:00 | https://en.wikipedia.org/wiki/20... |
974 | 987 | 2017 | 19 | 18 | Brazilian Grand Prix | 2017-11-12 | 16:00:00 | https://en.wikipedia.org/wiki/20... |
975 | 988 | 2017 | 20 | 24 | Abu Dhabi Grand Prix | 2017-11-26 | 17:00:00 | https://en.wikipedia.org/wiki/20... |
976 rows x 8 columns
memory usage: 0.15 MB
name: races
type: getml.DataFrame
qualifying
name | qualifyId | raceId | driverId | constructorId | number | position | q1 | q2 | q3 |
---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 1 | 18 | 1 | 1 | 22 | 1 | 1:26.572 | 1:25.187 | 1:26.714 |
1 | 2 | 18 | 9 | 2 | 4 | 2 | 1:26.103 | 1:25.315 | 1:26.869 |
2 | 3 | 18 | 5 | 1 | 23 | 3 | 1:25.664 | 1:25.452 | 1:27.079 |
3 | 4 | 18 | 13 | 6 | 2 | 4 | 1:25.994 | 1:25.691 | 1:27.178 |
4 | 5 | 18 | 2 | 2 | 3 | 5 | 1:25.960 | 1:25.518 | 1:27.236 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
7392 | 7415 | 982 | 825 | 210 | 20 | 16 | 1:43.756 | NULL | NULL |
7393 | 7416 | 982 | 13 | 3 | 19 | 17 | 1:44.014 | NULL | NULL |
7394 | 7417 | 982 | 840 | 3 | 18 | 18 | 1:44.728 | NULL | NULL |
7395 | 7418 | 982 | 836 | 15 | 94 | 19 | 1:45.059 | NULL | NULL |
7396 | 7419 | 982 | 828 | 15 | 9 | 20 | 1:45.570 | NULL | NULL |
7397 rows x 9 columns
memory usage: 0.66 MB
name: qualifying
type: getml.DataFrame
1.2 Prepare data for getML¶
racesPd = races.to_pandas()
racesPd
raceId | year | round | circuitId | name | date | time | url | |
---|---|---|---|---|---|---|---|---|
0 | 1.0 | 2009.0 | 1.0 | 1.0 | Australian Grand Prix | 2009-03-29 | 06:00:00 | http://en.wikipedia.org/wiki/2009_Australian_G... |
1 | 2.0 | 2009.0 | 2.0 | 2.0 | Malaysian Grand Prix | 2009-04-05 | 09:00:00 | http://en.wikipedia.org/wiki/2009_Malaysian_Gr... |
2 | 3.0 | 2009.0 | 3.0 | 17.0 | Chinese Grand Prix | 2009-04-19 | 07:00:00 | http://en.wikipedia.org/wiki/2009_Chinese_Gran... |
3 | 4.0 | 2009.0 | 4.0 | 3.0 | Bahrain Grand Prix | 2009-04-26 | 12:00:00 | http://en.wikipedia.org/wiki/2009_Bahrain_Gran... |
4 | 5.0 | 2009.0 | 5.0 | 4.0 | Spanish Grand Prix | 2009-05-10 | 12:00:00 | http://en.wikipedia.org/wiki/2009_Spanish_Gran... |
... | ... | ... | ... | ... | ... | ... | ... | ... |
971 | 984.0 | 2017.0 | 16.0 | 22.0 | Japanese Grand Prix | 2017-10-08 | 05:00:00 | https://en.wikipedia.org/wiki/2017_Japanese_Gr... |
972 | 985.0 | 2017.0 | 17.0 | 69.0 | United States Grand Prix | 2017-10-22 | 19:00:00 | https://en.wikipedia.org/wiki/2017_United_Stat... |
973 | 986.0 | 2017.0 | 18.0 | 32.0 | Mexican Grand Prix | 2017-10-29 | 19:00:00 | https://en.wikipedia.org/wiki/2017_Mexican_Gra... |
974 | 987.0 | 2017.0 | 19.0 | 18.0 | Brazilian Grand Prix | 2017-11-12 | 16:00:00 | https://en.wikipedia.org/wiki/2017_Brazilian_G... |
975 | 988.0 | 2017.0 | 20.0 | 24.0 | Abu Dhabi Grand Prix | 2017-11-26 | 17:00:00 | https://en.wikipedia.org/wiki/2017_Abu_Dhabi_G... |
976 rows × 8 columns
We actually need some set-up, because the target variable is not readily available. The wins
column in driverStandings
is actually the accumulated number of wins over a year, but what we want is a boolean variable indicated whether someone has one a particular race or not.
driverStandingsPd = driverStandings.to_pandas()
driverStandingsPd = driverStandingsPd.merge(
racesPd[["raceId", "year", "date", "round"]],
on="raceId"
)
previousStanding = driverStandingsPd.merge(
driverStandingsPd[["driverId", "year", "wins", "round"]],
on=["driverId", "year"],
)
isPreviousRound = (previousStanding["round_x"] - previousStanding["round_y"] == 1.0)
previousStanding = previousStanding[isPreviousRound]
previousStanding["win"] = previousStanding["wins_x"] - previousStanding["wins_y"]
driverStandingsPd = driverStandingsPd.merge(
previousStanding[["raceId", "driverId", "win"]],
on=["raceId", "driverId"],
how="left",
)
driverStandingsPd["win"] = [win if win == win else wins for win, wins in zip(driverStandingsPd["win"], driverStandingsPd["wins"])]
driver_standings = getml.data.DataFrame.from_pandas(driverStandingsPd, "driver_standings")
driver_standings
name | driverStandingsId | raceId | driverId | points | position | wins | year | round | win | positionText | date |
---|---|---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string |
0 | 1 | 18 | 1 | 10 | 1 | 1 | 2008 | 1 | 1 | 1 | 2008-03-16 |
1 | 2 | 18 | 2 | 8 | 2 | 0 | 2008 | 1 | 0 | 2 | 2008-03-16 |
2 | 3 | 18 | 3 | 6 | 3 | 0 | 2008 | 1 | 0 | 3 | 2008-03-16 |
3 | 4 | 18 | 4 | 5 | 4 | 0 | 2008 | 1 | 0 | 4 | 2008-03-16 |
4 | 5 | 18 | 5 | 4 | 5 | 0 | 2008 | 1 | 0 | 5 | 2008-03-16 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
31573 | 68456 | 982 | 835 | 8 | 16 | 0 | 2017 | 14 | 0 | 16 | 2017-09-17 |
31574 | 68457 | 982 | 154 | 26 | 13 | 0 | 2017 | 14 | 0 | 13 | 2017-09-17 |
31575 | 68458 | 982 | 836 | 5 | 18 | 0 | 2017 | 14 | 0 | 18 | 2017-09-17 |
31576 | 68459 | 982 | 18 | 0 | 22 | 0 | 2017 | 14 | 0 | 22 | 2017-09-17 |
31577 | 68460 | 982 | 814 | 0 | 23 | 0 | 2017 | 14 | 0 | 23 | 2017-09-17 |
31578 rows x 11 columns
memory usage: 3.21 MB
name: driver_standings
type: getml.DataFrame
We also need to include the date of the race to lapTimes
and pitStops
, because we cannot use this data for the race we would like to predict. We can only take lap times and pit stops from previous races.
lapTimesPd = lapTimes.to_pandas()
lapTimesPd = lapTimesPd.merge(
racesPd[["raceId", "date", "year"]],
on="raceId"
)
lap_times = getml.data.DataFrame.from_pandas(lapTimesPd, "lap_times")
lap_times
name | raceId | driverId | lap | position | milliseconds | year | time | date |
---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string |
0 | 1 | 1 | 1 | 13 | 109088 | 2009 | 1:49.088 | 2009-03-29 |
1 | 1 | 1 | 2 | 12 | 93740 | 2009 | 1:33.740 | 2009-03-29 |
2 | 1 | 1 | 3 | 11 | 91600 | 2009 | 1:31.600 | 2009-03-29 |
3 | 1 | 1 | 4 | 10 | 91067 | 2009 | 1:31.067 | 2009-03-29 |
4 | 1 | 1 | 5 | 10 | 92129 | 2009 | 1:32.129 | 2009-03-29 |
... | ... | ... | ... | ... | ... | ... | ... | |
420364 | 982 | 840 | 54 | 8 | 107528 | 2017 | 1:47.528 | 2017-09-17 |
420365 | 982 | 840 | 55 | 8 | 107512 | 2017 | 1:47.512 | 2017-09-17 |
420366 | 982 | 840 | 56 | 8 | 108143 | 2017 | 1:48.143 | 2017-09-17 |
420367 | 982 | 840 | 57 | 8 | 107848 | 2017 | 1:47.848 | 2017-09-17 |
420368 | 982 | 840 | 58 | 8 | 108699 | 2017 | 1:48.699 | 2017-09-17 |
420369 rows x 8 columns
memory usage: 35.31 MB
name: lap_times
type: getml.DataFrame
pitStopsPd = pitStops.to_pandas()
pitStopsPd = pitStopsPd.merge(
racesPd[["raceId", "date", "year"]],
on="raceId"
)
pit_stops = getml.data.DataFrame.from_pandas(pitStopsPd, "pit_stops")
pit_stops
name | raceId | driverId | stop | lap | milliseconds | year | time | duration | date |
---|---|---|---|---|---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 841 | 1 | 1 | 16 | 23227 | 2011 | 17:28:24 | 23.227 | 2011-03-27 |
1 | 841 | 1 | 2 | 36 | 23199 | 2011 | 17:59:29 | 23.199 | 2011-03-27 |
2 | 841 | 2 | 1 | 15 | 22994 | 2011 | 17:27:41 | 22.994 | 2011-03-27 |
3 | 841 | 2 | 2 | 30 | 25098 | 2011 | 17:51:32 | 25.098 | 2011-03-27 |
4 | 841 | 3 | 1 | 16 | 23716 | 2011 | 17:29:00 | 23.716 | 2011-03-27 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
6065 | 982 | 839 | 6 | 38 | 29134 | 2017 | 21:29:07 | 29.134 | 2017-09-17 |
6066 | 982 | 840 | 1 | 1 | 37403 | 2017 | 20:06:43 | 37.403 | 2017-09-17 |
6067 | 982 | 840 | 2 | 2 | 29294 | 2017 | 20:10:07 | 29.294 | 2017-09-17 |
6068 | 982 | 840 | 3 | 3 | 25584 | 2017 | 20:13:16 | 25.584 | 2017-09-17 |
6069 | 982 | 840 | 4 | 26 | 29412 | 2017 | 21:05:07 | 29.412 | 2017-09-17 |
6070 rows x 9 columns
memory usage: 0.60 MB
name: pit_stops
type: getml.DataFrame
getML requires that we define roles for each of the columns.
driver_standings.set_role("win", getml.data.roles.target)
driver_standings.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
driver_standings.set_role("position", getml.data.roles.numerical)
driver_standings.set_role("date", getml.data.roles.time_stamp)
driver_standings
name | date | raceId | driverId | year | win | position | driverStandingsId | points | wins | round | positionText |
---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | join_key | target | numerical | unused_float | unused_float | unused_float | unused_float | unused_string |
unit | time stamp, comparison only | ||||||||||
0 | 2008-03-16 | 18 | 1 | 2008 | 1 | 1 | 1 | 10 | 1 | 1 | 1 |
1 | 2008-03-16 | 18 | 2 | 2008 | 0 | 2 | 2 | 8 | 0 | 1 | 2 |
2 | 2008-03-16 | 18 | 3 | 2008 | 0 | 3 | 3 | 6 | 0 | 1 | 3 |
3 | 2008-03-16 | 18 | 4 | 2008 | 0 | 4 | 4 | 5 | 0 | 1 | 4 |
4 | 2008-03-16 | 18 | 5 | 2008 | 0 | 5 | 5 | 4 | 0 | 1 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
31573 | 2017-09-17 | 982 | 835 | 2017 | 0 | 16 | 68456 | 8 | 0 | 14 | 16 |
31574 | 2017-09-17 | 982 | 154 | 2017 | 0 | 13 | 68457 | 26 | 0 | 14 | 13 |
31575 | 2017-09-17 | 982 | 836 | 2017 | 0 | 18 | 68458 | 5 | 0 | 14 | 18 |
31576 | 2017-09-17 | 982 | 18 | 2017 | 0 | 22 | 68459 | 0 | 0 | 14 | 22 |
31577 | 2017-09-17 | 982 | 814 | 2017 | 0 | 23 | 68460 | 0 | 0 | 14 | 23 |
31578 rows x 11 columns
memory usage: 2.49 MB
name: driver_standings
type: getml.DataFrame
drivers.set_role("driverId", getml.data.roles.join_key)
drivers.set_role(["nationality", "driverRef"], getml.data.roles.categorical)
drivers
name | driverId | nationality | driverRef | number | code | forename | surname | dob | url |
---|---|---|---|---|---|---|---|---|---|
role | join_key | categorical | categorical | unused_float | unused_string | unused_string | unused_string | unused_string | unused_string |
0 | 1 | British | hamilton | 44 | HAM | Lewis | Hamilton | 1985-01-07 | http://en.wikipedia.org/wiki/Lew... |
1 | 2 | German | heidfeld | nan | HEI | Nick | Heidfeld | 1977-05-10 | http://en.wikipedia.org/wiki/Nic... |
2 | 3 | German | rosberg | 6 | ROS | Nico | Rosberg | 1985-06-27 | http://en.wikipedia.org/wiki/Nic... |
3 | 4 | Spanish | alonso | 14 | ALO | Fernando | Alonso | 1981-07-29 | http://en.wikipedia.org/wiki/Fer... |
4 | 5 | Finnish | kovalainen | nan | KOV | Heikki | Kovalainen | 1981-10-19 | http://en.wikipedia.org/wiki/Hei... |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
835 | 837 | Indonesian | haryanto | 88 | HAR | Rio | Haryanto | 1993-01-22 | http://en.wikipedia.org/wiki/Rio... |
836 | 838 | Belgian | vandoorne | 2 | VAN | Stoffel | Vandoorne | 1992-03-26 | http://en.wikipedia.org/wiki/Sto... |
837 | 839 | French | ocon | 31 | OCO | Esteban | Ocon | 1996-09-17 | http://en.wikipedia.org/wiki/Est... |
838 | 840 | Canadian | stroll | 18 | STR | Lance | Stroll | 1998-10-29 | http://en.wikipedia.org/wiki/Lan... |
839 | 841 | Italian | giovinazzi | 36 | GIO | Antonio | Giovinazzi | 1993-12-14 | http://en.wikipedia.org/wiki/Ant... |
840 rows x 9 columns
memory usage: 0.11 MB
name: drivers
type: getml.DataFrame
lap_times.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
lap_times.set_role(["lap", "milliseconds", "position"], getml.data.roles.numerical)
lap_times.set_role("date", getml.data.roles.time_stamp)
lap_times
name | date | raceId | driverId | year | lap | milliseconds | position | time |
---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | join_key | numerical | numerical | numerical | unused_string |
unit | time stamp, comparison only | |||||||
0 | 2009-03-29 | 1 | 1 | 2009 | 1 | 109088 | 13 | 1:49.088 |
1 | 2009-03-29 | 1 | 1 | 2009 | 2 | 93740 | 12 | 1:33.740 |
2 | 2009-03-29 | 1 | 1 | 2009 | 3 | 91600 | 11 | 1:31.600 |
3 | 2009-03-29 | 1 | 1 | 2009 | 4 | 91067 | 10 | 1:31.067 |
4 | 2009-03-29 | 1 | 1 | 2009 | 5 | 92129 | 10 | 1:32.129 |
... | ... | ... | ... | ... | ... | ... | ... | |
420364 | 2017-09-17 | 982 | 840 | 2017 | 54 | 107528 | 8 | 1:47.528 |
420365 | 2017-09-17 | 982 | 840 | 2017 | 55 | 107512 | 8 | 1:47.512 |
420366 | 2017-09-17 | 982 | 840 | 2017 | 56 | 108143 | 8 | 1:48.143 |
420367 | 2017-09-17 | 982 | 840 | 2017 | 57 | 107848 | 8 | 1:47.848 |
420368 | 2017-09-17 | 982 | 840 | 2017 | 58 | 108699 | 8 | 1:48.699 |
420369 rows x 8 columns
memory usage: 25.64 MB
name: lap_times
type: getml.DataFrame
pit_stops.set_role(["raceId", "driverId", "year"], getml.data.roles.join_key)
pit_stops.set_role(["lap", "milliseconds", "stop"], getml.data.roles.numerical)
pit_stops.set_role("date", getml.data.roles.time_stamp)
pit_stops
name | date | raceId | driverId | year | lap | milliseconds | stop | time | duration |
---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | join_key | join_key | numerical | numerical | numerical | unused_string | unused_string |
unit | time stamp, comparison only | ||||||||
0 | 2011-03-27 | 841 | 1 | 2011 | 16 | 23227 | 1 | 17:28:24 | 23.227 |
1 | 2011-03-27 | 841 | 1 | 2011 | 36 | 23199 | 2 | 17:59:29 | 23.199 |
2 | 2011-03-27 | 841 | 2 | 2011 | 15 | 22994 | 1 | 17:27:41 | 22.994 |
3 | 2011-03-27 | 841 | 2 | 2011 | 30 | 25098 | 2 | 17:51:32 | 25.098 |
4 | 2011-03-27 | 841 | 3 | 2011 | 16 | 23716 | 1 | 17:29:00 | 23.716 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
6065 | 2017-09-17 | 982 | 839 | 2017 | 38 | 29134 | 6 | 21:29:07 | 29.134 |
6066 | 2017-09-17 | 982 | 840 | 2017 | 1 | 37403 | 1 | 20:06:43 | 37.403 |
6067 | 2017-09-17 | 982 | 840 | 2017 | 2 | 29294 | 2 | 20:10:07 | 29.294 |
6068 | 2017-09-17 | 982 | 840 | 2017 | 3 | 25584 | 3 | 20:13:16 | 25.584 |
6069 | 2017-09-17 | 982 | 840 | 2017 | 26 | 29412 | 4 | 21:05:07 | 29.412 |
6070 rows x 9 columns
memory usage: 0.46 MB
name: pit_stops
type: getml.DataFrame
qualifying.set_role(["raceId", "driverId", "qualifyId"], getml.data.roles.join_key)
qualifying.set_role(["position", "number"], getml.data.roles.numerical)
qualifying
name | raceId | driverId | qualifyId | position | number | constructorId | q1 | q2 | q3 |
---|---|---|---|---|---|---|---|---|---|
role | join_key | join_key | join_key | numerical | numerical | unused_float | unused_string | unused_string | unused_string |
0 | 18 | 1 | 1 | 1 | 22 | 1 | 1:26.572 | 1:25.187 | 1:26.714 |
1 | 18 | 9 | 2 | 2 | 4 | 2 | 1:26.103 | 1:25.315 | 1:26.869 |
2 | 18 | 5 | 3 | 3 | 23 | 1 | 1:25.664 | 1:25.452 | 1:27.079 |
3 | 18 | 13 | 4 | 4 | 2 | 6 | 1:25.994 | 1:25.691 | 1:27.178 |
4 | 18 | 2 | 5 | 5 | 3 | 2 | 1:25.960 | 1:25.518 | 1:27.236 |
... | ... | ... | ... | ... | ... | ... | ... | ... | |
7392 | 982 | 825 | 7415 | 16 | 20 | 210 | 1:43.756 | NULL | NULL |
7393 | 982 | 13 | 7416 | 17 | 19 | 3 | 1:44.014 | NULL | NULL |
7394 | 982 | 840 | 7417 | 18 | 18 | 3 | 1:44.728 | NULL | NULL |
7395 | 982 | 836 | 7418 | 19 | 94 | 15 | 1:45.059 | NULL | NULL |
7396 | 982 | 828 | 7419 | 20 | 9 | 15 | 1:45.570 | NULL | NULL |
7397 rows x 9 columns
memory usage: 0.57 MB
name: qualifying
type: getml.DataFrame
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
split = getml.data.split.random(train=0.8, test=0.2)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | test |
4 | train |
... |
infinite number of rows
type: StringColumnView
2.1 Define relational model¶
star_schema = getml.data.StarSchema(population=driver_standings.drop(["position"]), alias="population", split=split)
star_schema.join(
driver_standings,
on=["driverId"],
time_stamps="date",
horizon=getml.data.time.days(1),
lagged_targets=True,
)
# We cannot use lap times for the race
# we would like to predict, so we set
# a non-zero horizon.
star_schema.join(
lap_times,
on=["driverId"],
time_stamps="date",
horizon=getml.data.time.days(1),
)
# We cannot use pit stops for the race
# we would like to predict, so we set
# a non-zero horizon.
star_schema.join(
pit_stops,
on=["driverId"],
time_stamps="date",
horizon=getml.data.time.days(1),
)
star_schema.join(
qualifying,
on=["driverId", "raceId"],
relationship=getml.data.relationship.many_to_one,
)
star_schema.join(
drivers,
on=["driverId"],
relationship=getml.data.relationship.many_to_one,
)
star_schema
data frames | staging table | |
---|---|---|
0 | population, qualifying, drivers | POPULATION__STAGING_TABLE_1 |
1 | driver_standings | DRIVER_STANDINGS__STAGING_TABLE_2 |
2 | lap_times | LAP_TIMES__STAGING_TABLE_3 |
3 | pit_stops | PIT_STOPS__STAGING_TABLE_4 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | driver_standings | 6229 | View |
1 | train | driver_standings | 25349 | View |
name | rows | type | |
---|---|---|---|
0 | driver_standings | 31578 | DataFrame |
1 | lap_times | 420369 | DataFrame |
2 | pit_stops | 6070 | DataFrame |
3 | qualifying | 7397 | DataFrame |
4 | drivers | 840 | DataFrame |
2.2 getML pipeline¶
Set-up the feature learner & predictor
We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples
).
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
aggregation=getml.feature_learning.FastProp.agg_sets.All,
num_threads=1,
)
relboost = getml.feature_learning.Relboost(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
relmt = getml.feature_learning.RelMT(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
predictor = getml.predictors.XGBoostClassifier(n_jobs=1)
Build the pipeline
pipe1 = getml.pipeline.Pipeline(
tags=['fast_prop'],
data_model=star_schema.data_model,
preprocessors=[mapping],
feature_learners=[fast_prop],
predictors=[predictor],
include_categorical=True,
)
pipe1
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['driver_standings', 'drivers', 'lap_times', 'pit_stops', 'qualifying'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop'])
2.3 Model training¶
pipe1.check(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and LAP_TIMES__STAGING_TABLE_3 over 'driverId' and 'driverId', there are no corresponding entries for 68.551028% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and PIT_STOPS__STAGING_TABLE_4 over 'driverId' and 'driverId', there are no corresponding entries for 82.527910% of entries in 'driverId' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
pipe1.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 993 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 02:35 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:35 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:15
Trained pipeline.
Time taken: 0:03:26.520434.
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['driver_standings', 'drivers', 'lap_times', 'pit_stops', 'qualifying'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop', 'container-DlsHvD'])
2.4 Model evaluation¶
fastprop_score = pipe1.score(star_schema.test)
fastprop_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 12:53:27 | train | win | 0.9732 | 0.9577 | 0.07529 |
1 | 2024-09-12 12:53:37 | test | win | 0.9724 | 0.9235 | 0.08533 |
2.5 featuretools¶
featuretools requires some additional data preparation before we can start engineering features.
population_train_pd = star_schema.train.population.drop(["position"]).to_pandas()
population_test_pd = star_schema.test.population.drop(["position"]).to_pandas()
driver_standings_pd = driver_standings.drop(driver_standings.roles.unused).to_pandas()
lap_times_pd = lap_times.drop(lap_times.roles.unused).to_pandas()
pit_stops_pd = pit_stops.drop(pit_stops.roles.unused).to_pandas()
qualifying_pd = qualifying.drop(qualifying.roles.unused).to_pandas()
drivers_pd = drivers.drop(drivers.roles.unused).to_pandas()
Because qualifying
and drivers
are many-to-one joins, we can directly join them onto our table.
population_train_pd["id"] = population_train_pd.index
population_train_pd = population_train_pd.merge(
qualifying_pd,
on=["driverId", "raceId"],
how="left",
)
population_train_pd = population_train_pd.merge(
drivers_pd,
on=["driverId"],
how="left",
)
population_train_pd
raceId | driverId | year | win | date | id | qualifyId | position | number | nationality | driverRef | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18 | 1 | 2008 | 1.0 | 2008-03-16 | 0 | 1 | 1.0 | 22.0 | British | hamilton |
1 | 18 | 2 | 2008 | 0.0 | 2008-03-16 | 1 | 5 | 5.0 | 3.0 | German | heidfeld |
2 | 18 | 3 | 2008 | 0.0 | 2008-03-16 | 2 | 7 | 7.0 | 7.0 | German | rosberg |
3 | 18 | 5 | 2008 | 0.0 | 2008-03-16 | 3 | 3 | 3.0 | 23.0 | Finnish | kovalainen |
4 | 18 | 6 | 2008 | 0.0 | 2008-03-16 | 4 | 14 | 14.0 | 8.0 | Japanese | nakajima |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
25344 | 982 | 835 | 2017 | 0.0 | 2017-09-17 | 25344 | 7410 | 11.0 | 30.0 | British | jolyon_palmer |
25345 | 982 | 154 | 2017 | 0.0 | 2017-09-17 | 25345 | 7414 | 15.0 | 8.0 | French | grosjean |
25346 | 982 | 836 | 2017 | 0.0 | 2017-09-17 | 25346 | 7418 | 19.0 | 94.0 | German | wehrlein |
25347 | 982 | 18 | 2017 | 0.0 | 2017-09-17 | 25347 | NaN | NaN | NaN | British | button |
25348 | 982 | 814 | 2017 | 0.0 | 2017-09-17 | 25348 | NaN | NaN | NaN | British | resta |
25349 rows × 11 columns
Same for the testing set.
population_test_pd["id"] = population_test_pd.index
population_test_pd = population_test_pd.merge(
qualifying_pd,
on=["driverId", "raceId"],
how="left",
)
population_test_pd = population_test_pd.merge(
drivers_pd,
on=["driverId"],
how="left",
)
population_test_pd
raceId | driverId | year | win | date | id | qualifyId | position | number | nationality | driverRef | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 18 | 4 | 2008 | 0.0 | 2008-03-16 | 0 | 12 | 12.0 | 5.0 | Spanish | alonso |
1 | 18 | 7 | 2008 | 0.0 | 2008-03-16 | 1 | 18 | 18.0 | 14.0 | French | bourdais |
2 | 19 | 3 | 2008 | 0.0 | 2008-03-23 | 2 | 38 | 16.0 | 7.0 | German | rosberg |
3 | 19 | 11 | 2008 | 0.0 | 2008-03-23 | 3 | 42 | 20.0 | 18.0 | Japanese | sato |
4 | 20 | 7 | 2008 | 0.0 | 2008-04-06 | 4 | 59 | 15.0 | 14.0 | French | bourdais |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6224 | 981 | 835 | 2017 | 0.0 | 2017-09-03 | 6224 | 7396 | 17.0 | 30.0 | British | jolyon_palmer |
6225 | 981 | 814 | 2017 | 0.0 | 2017-09-03 | 6225 | NaN | NaN | NaN | British | resta |
6226 | 982 | 822 | 2017 | 0.0 | 2017-09-17 | 6226 | 7405 | 6.0 | 77.0 | Finnish | bottas |
6227 | 982 | 832 | 2017 | 0.0 | 2017-09-17 | 6227 | 7409 | 10.0 | 55.0 | Spanish | sainz |
6228 | 982 | 826 | 2017 | 0.0 | 2017-09-17 | 6228 | 7412 | 13.0 | 26.0 | Russian | kvyat |
6229 rows × 11 columns
featuretools requires us to expand our peripheral tables so they can be joined using the unique id from the population table. Luckily, we can write a simple helper function that works for all three remaining peripheral tables.
def prepare_peripheral(peripheral, train_or_test):
"""
Helper function that imitates the behavior of
the data model defined above.
"""
peripheral_new = peripheral.merge(
train_or_test[["id", "driverId", "date"]],
on=["driverId"],
)
peripheral_new = peripheral_new[
peripheral_new["date_x"] < peripheral_new["date_y"]
]
del peripheral_new["date_y"]
del peripheral_new["driverId"]
del peripheral_new["raceId"]
return peripheral_new.rename(columns={"date_x": "date"})
driver_standings_train_pd = prepare_peripheral(driver_standings_pd, population_train_pd)
driver_standings_test_pd = prepare_peripheral(driver_standings_pd, population_test_pd)
driver_standings_train_pd
year | position | win | date | id | |
---|---|---|---|---|---|
1 | 2008 | 1.0 | 1.0 | 2008-03-16 | 6 |
2 | 2008 | 1.0 | 1.0 | 2008-03-16 | 22 |
3 | 2008 | 1.0 | 1.0 | 2008-03-16 | 39 |
4 | 2008 | 1.0 | 1.0 | 2008-03-16 | 54 |
5 | 2008 | 1.0 | 1.0 | 2008-03-16 | 72 |
... | ... | ... | ... | ... | ... |
2653710 | 2017 | 19.0 | 0.0 | 2017-09-03 | 25344 |
2653802 | 2017 | 13.0 | 0.0 | 2017-09-03 | 25345 |
2653830 | 2017 | 16.0 | 0.0 | 2017-09-03 | 25346 |
2654096 | 2017 | 22.0 | 0.0 | 2017-09-03 | 25347 |
2654136 | 2017 | 23.0 | 0.0 | 2017-09-03 | 25348 |
1317941 rows × 5 columns
lap_times_train_pd = prepare_peripheral(lap_times_pd, population_train_pd)
lap_times_test_pd = prepare_peripheral(lap_times_pd, population_test_pd)
lap_times_train_pd
year | lap | milliseconds | position | date | id | |
---|---|---|---|---|---|---|
28 | 2009 | 1.0 | 109088.0 | 13.0 | 2009-03-29 | 5874 |
29 | 2009 | 1.0 | 109088.0 | 13.0 | 2009-03-29 | 5904 |
30 | 2009 | 1.0 | 109088.0 | 13.0 | 2009-03-29 | 5918 |
31 | 2009 | 1.0 | 109088.0 | 13.0 | 2009-03-29 | 5936 |
32 | 2009 | 1.0 | 109088.0 | 13.0 | 2009-03-29 | 5954 |
... | ... | ... | ... | ... | ... | ... |
54748152 | 2017 | 49.0 | 85803.0 | 7.0 | 2017-09-03 | 25341 |
54748164 | 2017 | 50.0 | 85788.0 | 7.0 | 2017-09-03 | 25341 |
54748176 | 2017 | 51.0 | 85625.0 | 7.0 | 2017-09-03 | 25341 |
54748188 | 2017 | 52.0 | 86258.0 | 7.0 | 2017-09-03 | 25341 |
54748200 | 2017 | 53.0 | 86965.0 | 7.0 | 2017-09-03 | 25341 |
25544456 rows × 6 columns
lap_times
demonstrates one of the greatest dangers of featuretools. Because it is written in pure Python, featuretools requires you expand your tables so that they can be more easily joined. But this comes at the cost of increased memory consumption: In this case a table that used to have about 420,000 rows has now been expanded to over 25 million rows.
pit_stops_train_pd = prepare_peripheral(pit_stops_pd, population_train_pd)
pit_stops_test_pd = prepare_peripheral(pit_stops_pd, population_test_pd)
pit_stops_train_pd
year | lap | milliseconds | stop | date | id | |
---|---|---|---|---|---|---|
57 | 2011 | 16.0 | 23227.0 | 1.0 | 2011-03-27 | 22459 |
58 | 2011 | 16.0 | 23227.0 | 1.0 | 2011-03-27 | 22498 |
59 | 2011 | 16.0 | 23227.0 | 1.0 | 2011-03-27 | 22536 |
60 | 2011 | 16.0 | 23227.0 | 1.0 | 2011-03-27 | 22555 |
61 | 2011 | 16.0 | 23227.0 | 1.0 | 2011-03-27 | 22574 |
... | ... | ... | ... | ... | ... | ... |
681497 | 2017 | 27.0 | 24173.0 | 2.0 | 2017-09-03 | 25332 |
681565 | 2017 | 25.0 | 31008.0 | 1.0 | 2017-09-03 | 25344 |
681593 | 2017 | 20.0 | 25212.0 | 1.0 | 2017-09-03 | 25346 |
681612 | 2017 | 16.0 | 24432.0 | 1.0 | 2017-09-03 | 25335 |
681624 | 2017 | 17.0 | 26131.0 | 1.0 | 2017-09-03 | 25341 |
213549 rows × 6 columns
del population_train_pd["driverId"]
del population_train_pd["raceId"]
del population_train_pd["year"]
del population_train_pd["qualifyId"]
del population_test_pd["driverId"]
del population_test_pd["raceId"]
del population_test_pd["year"]
del population_test_pd["qualifyId"]
def add_index(df):
df.insert(0, "index", range(len(df)))
population_pd_logical_types = {
"id": ww.logical_types.Integer,
"win": ww.logical_types.Integer,
"date": ww.logical_types.Datetime,
"position": ww.logical_types.IntegerNullable,
"number": ww.logical_types.IntegerNullable,
"nationality": ww.logical_types.Categorical,
"driverRef": ww.logical_types.Categorical,
}
population_train_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population")
population_test_pd.ww.init(logical_types=population_pd_logical_types, index="id", name="population")
add_index(driver_standings_train_pd)
add_index(driver_standings_test_pd)
driver_standings_pd_logical_types = {
"index": ww.logical_types.Integer,
"year": ww.logical_types.Integer,
"position": ww.logical_types.IntegerNullable,
"win": ww.logical_types.Integer,
"date": ww.logical_types.Datetime,
"id": ww.logical_types.Integer,
}
driver_standings_train_pd.ww.init(logical_types=driver_standings_pd_logical_types, index="index", name="driver_standings")
driver_standings_test_pd.ww.init(logical_types=driver_standings_pd_logical_types, index="index", name="driver_standings")
add_index(lap_times_train_pd)
add_index(lap_times_test_pd)
lap_times_pd_logical_types = {
"index": ww.logical_types.Integer,
"year": ww.logical_types.Integer,
"lap": ww.logical_types.Integer,
"milliseconds": ww.logical_types.Integer,
"position": ww.logical_types.IntegerNullable,
"date": ww.logical_types.Datetime,
"id": ww.logical_types.Integer,
}
lap_times_train_pd.ww.init(logical_types=lap_times_pd_logical_types, index="index", name="lap_times")
lap_times_test_pd.ww.init(logical_types=lap_times_pd_logical_types, index="index", name="lap_times")
add_index(pit_stops_train_pd)
add_index(pit_stops_test_pd)
pit_stops_pd_logical_types = {
"index": ww.logical_types.Integer,
"year": ww.logical_types.Integer,
"lap": ww.logical_types.Integer,
"milliseconds": ww.logical_types.Integer,
"stop": ww.logical_types.Categorical,
"date": ww.logical_types.Datetime,
"id": ww.logical_types.Integer,
}
pit_stops_train_pd.ww.init(logical_types=pit_stops_pd_logical_types, index="index", name="pit_stops")
pit_stops_test_pd.ww.init(logical_types=pit_stops_pd_logical_types, index="index", name="pit_stops")
dataframes_train = {
"population" : (population_train_pd, ),
"driver_standings": (driver_standings_train_pd, ),
"lap_times" : (lap_times_train_pd, ),
"pit_stops" : (pit_stops_train_pd, ),
}
dataframes_test = {
"population" : (population_test_pd, ),
"driver_standings": (driver_standings_test_pd, ),
"lap_times" : (lap_times_test_pd, ),
"pit_stops" : (pit_stops_test_pd, )
}
relationships = [
("population", "id", "driver_standings", "id"),
("population", "id", "lap_times", "id"),
("population", "id", "pit_stops", "id")
]
featuretools_train_pd = featuretools.dfs(
dataframes=dataframes_train,
relationships=relationships,
target_dataframe_name="population")[0]
featuretools_test_pd = featuretools.dfs(
dataframes=dataframes_test,
relationships=relationships,
target_dataframe_name="population")[0]
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
featuretools_train.set_role("win", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)
featuretools_train
name | win | position | number | nationality | driverRef | COUNT(driver_standings) | COUNT(lap_times) | COUNT(pit_stops) | MODE(pit_stops.stop) | NUM_UNIQUE(pit_stops.stop) | DAY(date) | MONTH(date) | WEEKDAY(date) | YEAR(date) | MODE(driver_standings.DAY(date)) | MODE(driver_standings.MONTH(date)) | MODE(driver_standings.WEEKDAY(date)) | MODE(driver_standings.YEAR(date)) | NUM_UNIQUE(driver_standings.DAY(date)) | NUM_UNIQUE(driver_standings.MONTH(date)) | NUM_UNIQUE(driver_standings.WEEKDAY(date)) | NUM_UNIQUE(driver_standings.YEAR(date)) | MODE(lap_times.DAY(date)) | MODE(lap_times.MONTH(date)) | MODE(lap_times.WEEKDAY(date)) | MODE(lap_times.YEAR(date)) | NUM_UNIQUE(lap_times.DAY(date)) | NUM_UNIQUE(lap_times.MONTH(date)) | NUM_UNIQUE(lap_times.WEEKDAY(date)) | NUM_UNIQUE(lap_times.YEAR(date)) | MODE(pit_stops.DAY(date)) | MODE(pit_stops.MONTH(date)) | MODE(pit_stops.WEEKDAY(date)) | MODE(pit_stops.YEAR(date)) | NUM_UNIQUE(pit_stops.DAY(date)) | NUM_UNIQUE(pit_stops.MONTH(date)) | NUM_UNIQUE(pit_stops.WEEKDAY(date)) | NUM_UNIQUE(pit_stops.YEAR(date)) | MAX(driver_standings.position) | MAX(driver_standings.win) | MAX(driver_standings.year) | MEAN(driver_standings.position) | MEAN(driver_standings.win) | MEAN(driver_standings.year) | MIN(driver_standings.position) | MIN(driver_standings.win) | MIN(driver_standings.year) | SKEW(driver_standings.position) | SKEW(driver_standings.win) | SKEW(driver_standings.year) | STD(driver_standings.position) | STD(driver_standings.win) | STD(driver_standings.year) | SUM(driver_standings.position) | SUM(driver_standings.win) | SUM(driver_standings.year) | MAX(lap_times.lap) | MAX(lap_times.milliseconds) | MAX(lap_times.position) | MAX(lap_times.year) | MEAN(lap_times.lap) | MEAN(lap_times.milliseconds) | MEAN(lap_times.position) | MEAN(lap_times.year) | MIN(lap_times.lap) | MIN(lap_times.milliseconds) | MIN(lap_times.position) | MIN(lap_times.year) | SKEW(lap_times.lap) | SKEW(lap_times.milliseconds) | SKEW(lap_times.position) | SKEW(lap_times.year) | STD(lap_times.lap) | STD(lap_times.milliseconds) | STD(lap_times.position) | STD(lap_times.year) | SUM(lap_times.lap) | SUM(lap_times.milliseconds) | SUM(lap_times.position) | SUM(lap_times.year) | MAX(pit_stops.lap) | MAX(pit_stops.milliseconds) | MAX(pit_stops.year) | MEAN(pit_stops.lap) | MEAN(pit_stops.milliseconds) | MEAN(pit_stops.year) | MIN(pit_stops.lap) | MIN(pit_stops.milliseconds) | MIN(pit_stops.year) | SKEW(pit_stops.lap) | SKEW(pit_stops.milliseconds) | SKEW(pit_stops.year) | STD(pit_stops.lap) | STD(pit_stops.milliseconds) | STD(pit_stops.year) | SUM(pit_stops.lap) | SUM(pit_stops.milliseconds) | SUM(pit_stops.year) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | 1 | 1 | 22 | British | hamilton | 17 | 1037 | 0 | NULL | NULL | 16 | 3 | 6 | 2008 | 8 | 7 | 6 | 2007 | 16 | 8 | 1 | 1 | 8 | 7 | 6 | 2007 | 16 | 8 | 1 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 3 | 1 | 2007 | 1.4706 | 0.2353 | 2007 | 1 | 0 | 2007 | 1.3538 | 1.3723 | 0 | 0.7998 | 0.4372 | 0 | 25 | 4 | 34119 | 78 | 1453884 | 19 | 2007 | 32.0559 | 90600.8602 | 3.2575 | 2007 | 1 | 72506 | 1 | 2007 | 0.1882 | 26.093 | 2.4493 | 0 | 19.1156 | 45564.9471 | 3.4488 | 0 | 33242 | 93953092 | 3378 | 2081259 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
1 | 0 | 5 | 3 | German | heidfeld | 134 | 6863 | 0 | NULL | NULL | 16 | 3 | 6 | 2008 | 1 | 7 | 6 | 2005 | 31 | 8 | 1 | 8 | 1 | 7 | 6 | 2007 | 31 | 8 | 1 | 8 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 21 | 0 | 2007 | 10.8582 | 0 | 2003.5448 | 4 | 0 | 2000 | 0.4681 | 0 | -0.04499 | 5.136 | 0 | 2.3255 | 1455 | 0 | 268475 | 78 | 1285986 | 22 | 2007 | 30.4682 | 89072.7102 | 9.6799 | 2003.7322 | 1 | 70516 | 1 | 2000 | 0.2623 | 30.5344 | 0.3339 | -0.05998 | 18.7041 | 20544.1746 | 4.199 | 2.258 | 209103 | 611306010 | 66433 | 13751614 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
2 | 0 | 7 | 7 | German | rosberg | 35 | 1627 | 0 | NULL | NULL | 16 | 3 | 6 | 2008 | 8 | 7 | 6 | 2006 | 25 | 8 | 1 | 2 | 8 | 5 | 6 | 2007 | 25 | 8 | 1 | 2 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 17 | 0 | 2007 | 11.9714 | 0 | 2006.4857 | 7 | 0 | 2006 | 0.1328 | 0 | 0.05976 | 3.1761 | 0 | 0.5071 | 419 | 0 | 70227 | 77 | 191891 | 22 | 2007 | 30.4124 | 89820.1174 | 9.9809 | 2006.5931 | 1 | 73159 | 3 | 2006 | 0.2225 | 1.7858 | 0.5491 | -0.3794 | 18.6109 | 14212.8848 | 3.733 | 0.4914 | 49481 | 146137331 | 16239 | 3264727 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
3 | 0 | 3 | 23 | Finnish | kovalainen | 17 | 1023 | 0 | NULL | NULL | 16 | 3 | 6 | 2008 | 8 | 7 | 6 | 2007 | 16 | 8 | 1 | 1 | 8 | 7 | 6 | 2007 | 16 | 8 | 1 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 11 | 0 | 2007 | 8.3529 | 0 | 2007 | 7 | 0 | 2007 | 0.7931 | 0 | 0 | 1.4116 | 0 | 0 | 142 | 0 | 34119 | 76 | 1394884 | 22 | 2007 | 31.4526 | 92458.8299 | 9.4018 | 2007 | 1 | 73998 | 1 | 2007 | 0.1906 | 26.2493 | 0.6941 | 0 | 18.6943 | 43578.4946 | 4.1724 | 0 | 32176 | 94585383 | 9618 | 2053161 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
4 | 0 | 14 | 8 | Japanese | nakajima | 1 | 70 | 0 | NULL | NULL | 16 | 3 | 6 | 2008 | 21 | 10 | 6 | 2007 | 1 | 1 | 1 | 1 | 21 | 10 | 6 | 2007 | 1 | 1 | 1 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 22 | 0 | 2007 | 22 | 0 | 2007 | 22 | 0 | 2007 | nan | nan | nan | nan | nan | nan | 22 | 0 | 2007 | 70 | 104396 | 16 | 2007 | 35.5 | 76228.3571 | 10.9143 | 2007 | 1 | 73116 | 9 | 2007 | 0 | 4.4222 | 1.2783 | 0 | 20.3511 | 4785.2529 | 1.8552 | 0 | 2485 | 5335985 | 764 | 140490 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
25344 | 0 | 11 | 30 | British | jolyon_palmer | 34 | 1517 | 56 | 1 | 4 | 17 | 9 | 6 | 2017 | 3 | 7 | 6 | 2016 | 25 | 9 | 1 | 2 | 9 | 7 | 6 | 2016 | 25 | 9 | 1 | 2 | 9 | 7 | 6 | 2016 | 21 | 9 | 1 | 2 | 20 | 0 | 2017 | 18.0588 | 0 | 2016.3824 | 11 | 0 | 2016 | -1.7626 | 0 | 0.5068 | 1.8413 | 0 | 0.4933 | 614 | 0 | 68557 | 78 | 1223356 | 22 | 2017 | 29.4199 | 98411.6038 | 14.5175 | 2016.3665 | 1 | 68652 | 7 | 2016 | 0.2921 | 21.0237 | 0.3751 | 0.5546 | 18.5854 | 43393.1519 | 3.0483 | 0.482 | 44630 | 149290403 | 22023 | 3058828 | 50 | 1088294 | 2017 | 20.9286 | 61015.1429 | 2016.2857 | 1 | 16851 | 2016 | 0.3035 | 5.1507 | 0.975 | 13.473 | 191214.0556 | 0.4558 | 1172 | 3416848 | 112912 |
25345 | 0 | 15 | 8 | French | grosjean | 118 | 5518 | 206 | 1 | 5 | 17 | 9 | 6 | 2017 | 27 | 7 | 6 | 2016 | 31 | 9 | 1 | 7 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 7 | 9 | 7 | 6 | 2014 | 31 | 9 | 1 | 6 | 24 | 0 | 2017 | 11.5254 | 0 | 2014.0508 | 5 | 0 | 2009 | 1.3306 | 0 | -0.5931 | 4.1709 | 0 | 2.0542 | 1360 | 0 | 237658 | 78 | 3670560 | 24 | 2017 | 29.4511 | 99489.4235 | 10.7356 | 2014.1414 | 1 | 68590 | 1 | 2009 | 0.2565 | 39.9299 | 0.06907 | -0.6145 | 18.0521 | 63108.3058 | 4.8371 | 1.9904 | 162511 | 548982639 | 59239 | 11114032 | 62 | 1313665 | 2017 | 22.9369 | 46260.6214 | 2014.4272 | 1 | 15073 | 2012 | 0.3075 | 7.1215 | 0.04707 | 13.7483 | 154388.9496 | 1.6024 | 4725 | 9529688 | 414972 |
25346 | 0 | 19 | 94 | German | wehrlein | 32 | 1636 | 62 | 1 | 5 | 17 | 9 | 6 | 2017 | 3 | 7 | 6 | 2016 | 24 | 9 | 1 | 2 | 3 | 7 | 6 | 2016 | 24 | 9 | 1 | 2 | 3 | 7 | 6 | 2016 | 24 | 8 | 1 | 2 | 22 | 0 | 2017 | 17.25 | 0 | 2016.3438 | 13 | 0 | 2016 | 0.3213 | 0 | 0.6908 | 2.2718 | 0 | 0.4826 | 552 | 0 | 64523 | 76 | 2118880 | 22 | 2017 | 30.3839 | 99952.8594 | 15.8319 | 2016.3594 | 1 | 69241 | 4 | 2016 | 0.2027 | 20.0905 | -0.8509 | 0.5865 | 18.2369 | 77850.4087 | 3.2136 | 0.48 | 49708 | 163522878 | 25901 | 3298764 | 52 | 2008464 | 2017 | 21.6129 | 119925.5968 | 2016.3065 | 1 | 18064 | 2016 | 0.1765 | 3.9986 | 0.8606 | 12.8896 | 368710.9372 | 0.4648 | 1340 | 7435387 | 125011 |
25347 | 0 | NULL | NULL | British | button | 309 | 16272 | 259 | 1 | 6 | 17 | 9 | 6 | 2017 | 24 | 7 | 6 | 2016 | 31 | 9 | 1 | 18 | 24 | 7 | 6 | 2013 | 31 | 9 | 1 | 18 | 25 | 7 | 6 | 2011 | 31 | 9 | 1 | 7 | 22 | 1 | 2017 | 9.6828 | 0.04854 | 2008.6181 | 1 | 0 | 2000 | 0.3055 | 4.2218 | -0.099 | 5.8899 | 0.2153 | 5.0436 | 2992 | 15 | 620663 | 78 | 7503775 | 22 | 2017 | 30.2853 | 95662.1676 | 8.4123 | 2008.4672 | 1 | 69828 | 1 | 2000 | 0.2244 | 64.7495 | 0.4521 | -0.1278 | 18.3544 | 79242.6165 | 4.859 | 4.8663 | 492803 | 1556614792 | 136885 | 32681778 | 64 | 2010062 | 2017 | 24.9344 | 45562.5985 | 2013.3977 | 1 | 14501 | 2011 | 0.3229 | 8.7373 | 0.09889 | 13.5509 | 179696.8608 | 1.785 | 6458 | 11800713 | 521470 |
25348 | 0 | NULL | NULL | British | resta | 61 | 3284 | 127 | 1 | 4 | 17 | 9 | 6 | 2017 | 27 | 7 | 6 | 2012 | 27 | 9 | 1 | 4 | 27 | 7 | 6 | 2012 | 27 | 9 | 1 | 4 | 25 | 5 | 6 | 2011 | 27 | 9 | 1 | 4 | 23 | 0 | 2017 | 12.6066 | 0 | 2012.2459 | 8 | 0 | 2011 | 1.244 | 0 | 2.0563 | 3.4558 | 0 | 1.3498 | 769 | 0 | 122747 | 78 | 7503980 | 24 | 2017 | 30.3097 | 103049.4318 | 10.7728 | 2012.0539 | 1 | 76855 | 1 | 2011 | 0.1989 | 44.5439 | 0.1348 | 1.8215 | 18.0742 | 144158.4761 | 3.5107 | 1.0439 | 99537 | 338414334 | 35378 | 6607585 | 57 | 123124 | 2017 | 25.1496 | 24084.2598 | 2011.9528 | 1 | 14538 | 2011 | 0.3166 | 8.1337 | 1.2541 | 13.8456 | 9973.774 | 0.9416 | 3194 | 3058701 | 255518 |
25349 rows x 98 columns
memory usage: 16.12 MB
name: featuretools_train
type: getml.DataFrame
featuretools_test.set_role("win", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)
featuretools_test
name | win | position | number | nationality | driverRef | COUNT(driver_standings) | COUNT(lap_times) | COUNT(pit_stops) | MODE(pit_stops.stop) | NUM_UNIQUE(pit_stops.stop) | DAY(date) | MONTH(date) | WEEKDAY(date) | YEAR(date) | MODE(driver_standings.DAY(date)) | MODE(driver_standings.MONTH(date)) | MODE(driver_standings.WEEKDAY(date)) | MODE(driver_standings.YEAR(date)) | NUM_UNIQUE(driver_standings.DAY(date)) | NUM_UNIQUE(driver_standings.MONTH(date)) | NUM_UNIQUE(driver_standings.WEEKDAY(date)) | NUM_UNIQUE(driver_standings.YEAR(date)) | MODE(lap_times.DAY(date)) | MODE(lap_times.MONTH(date)) | MODE(lap_times.WEEKDAY(date)) | MODE(lap_times.YEAR(date)) | NUM_UNIQUE(lap_times.DAY(date)) | NUM_UNIQUE(lap_times.MONTH(date)) | NUM_UNIQUE(lap_times.WEEKDAY(date)) | NUM_UNIQUE(lap_times.YEAR(date)) | MODE(pit_stops.DAY(date)) | MODE(pit_stops.MONTH(date)) | MODE(pit_stops.WEEKDAY(date)) | MODE(pit_stops.YEAR(date)) | NUM_UNIQUE(pit_stops.DAY(date)) | NUM_UNIQUE(pit_stops.MONTH(date)) | NUM_UNIQUE(pit_stops.WEEKDAY(date)) | NUM_UNIQUE(pit_stops.YEAR(date)) | MAX(driver_standings.position) | MAX(driver_standings.win) | MAX(driver_standings.year) | MEAN(driver_standings.position) | MEAN(driver_standings.win) | MEAN(driver_standings.year) | MIN(driver_standings.position) | MIN(driver_standings.win) | MIN(driver_standings.year) | SKEW(driver_standings.position) | SKEW(driver_standings.win) | SKEW(driver_standings.year) | STD(driver_standings.position) | STD(driver_standings.win) | STD(driver_standings.year) | SUM(driver_standings.position) | SUM(driver_standings.win) | SUM(driver_standings.year) | MAX(lap_times.lap) | MAX(lap_times.milliseconds) | MAX(lap_times.position) | MAX(lap_times.year) | MEAN(lap_times.lap) | MEAN(lap_times.milliseconds) | MEAN(lap_times.position) | MEAN(lap_times.year) | MIN(lap_times.lap) | MIN(lap_times.milliseconds) | MIN(lap_times.position) | MIN(lap_times.year) | SKEW(lap_times.lap) | SKEW(lap_times.milliseconds) | SKEW(lap_times.position) | SKEW(lap_times.year) | STD(lap_times.lap) | STD(lap_times.milliseconds) | STD(lap_times.position) | STD(lap_times.year) | SUM(lap_times.lap) | SUM(lap_times.milliseconds) | SUM(lap_times.position) | SUM(lap_times.year) | MAX(pit_stops.lap) | MAX(pit_stops.milliseconds) | MAX(pit_stops.year) | MEAN(pit_stops.lap) | MEAN(pit_stops.milliseconds) | MEAN(pit_stops.year) | MIN(pit_stops.lap) | MIN(pit_stops.milliseconds) | MIN(pit_stops.year) | SKEW(pit_stops.lap) | SKEW(pit_stops.milliseconds) | SKEW(pit_stops.year) | STD(pit_stops.lap) | STD(pit_stops.milliseconds) | STD(pit_stops.year) | SUM(pit_stops.lap) | SUM(pit_stops.milliseconds) | SUM(pit_stops.year) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | 0 | 12 | 5 | Spanish | alonso | 105 | 5769 | 0 | NULL | NULL | 16 | 3 | 6 | 2008 | 1 | 7 | 6 | 2005 | 31 | 8 | 1 | 6 | 24 | 7 | 6 | 2006 | 31 | 8 | 1 | 6 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 23 | 1 | 2007 | 5.6667 | 0.181 | 2004.3714 | 1 | 0 | 2001 | 1.7319 | 1.6816 | -0.3952 | 7.1598 | 0.3868 | 1.9575 | 595 | 19 | 210459 | 78 | 1406432 | 21 | 2007 | 30.436 | 88439.8289 | 5.2479 | 2004.5883 | 1 | 70526 | 1 | 2001 | 0.2301 | 34.9402 | 1.3932 | -0.4971 | 18.3774 | 22633.3599 | 4.9145 | 1.8675 | 175585 | 510209373 | 30275 | 11564470 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
1 | 0 | 18 | 14 | French | bourdais | 0 | 0 | 0 | NULL | NULL | 16 | 3 | 6 | 2008 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
2 | 0 | 16 | 7 | German | rosberg | 36 | 1685 | 0 | NULL | NULL | 23 | 3 | 6 | 2008 | 8 | 7 | 6 | 2006 | 25 | 8 | 1 | 3 | 16 | 5 | 6 | 2007 | 25 | 8 | 1 | 3 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 17 | 0 | 2008 | 11.7222 | 0 | 2006.5278 | 3 | 0 | 2006 | -0.1958 | 0 | 0.4024 | 3.4691 | 0 | 0.5599 | 422 | 0 | 72235 | 77 | 191891 | 22 | 2008 | 30.381 | 90110.451 | 9.7822 | 2006.6415 | 1 | 73159 | 2 | 2006 | 0.2179 | 1.8323 | 0.5148 | 0.05473 | 18.5504 | 14476.8352 | 3.8237 | 0.5468 | 51192 | 151836110 | 16483 | 3381191 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
3 | 0 | 20 | 18 | Japanese | sato | 84 | 4309 | 0 | NULL | NULL | 23 | 3 | 6 | 2008 | 12 | 7 | 6 | 2004 | 31 | 8 | 1 | 6 | 25 | 7 | 6 | 2007 | 31 | 8 | 1 | 7 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 23 | 0 | 2007 | 16.9405 | 0 | 2004.8095 | 7 | 0 | 2002 | -0.493 | 0 | -0.3829 | 5.5479 | 0 | 1.7319 | 1423 | 0 | 168404 | 76 | 1921214 | 22 | 2008 | 29.1091 | 91287.2075 | 12.4342 | 2004.8867 | 1 | 70727 | 1 | 2002 | 0.2832 | 32.4719 | -0.2034 | -0.3841 | 18.0839 | 42719.4804 | 4.861 | 1.7193 | 125431 | 393356577 | 53579 | 8639057 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
4 | 0 | 15 | 14 | French | bourdais | 2 | 55 | 0 | NULL | NULL | 6 | 4 | 6 | 2008 | 16 | 3 | 6 | 2008 | 2 | 1 | 1 | 1 | 16 | 3 | 6 | 2008 | 1 | 1 | 1 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 10 | 0 | 2008 | 8.5 | 0 | 2008 | 7 | 0 | 2008 | nan | nan | nan | 2.1213 | 0 | 0 | 17 | 0 | 4016 | 55 | 156156 | 16 | 2008 | 28 | 98864.3818 | 9.7818 | 2008 | 1 | 89534 | 4 | 2008 | 0 | 2.2604 | -0.06348 | 0 | 16.0208 | 17491.8075 | 4.3149 | 0 | 1540 | 5437541 | 538 | 110440 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
6224 | 0 | 17 | 30 | British | jolyon_palmer | 33 | 1488 | 55 | 1 | 4 | 3 | 9 | 6 | 2017 | 9 | 7 | 6 | 2016 | 25 | 9 | 1 | 2 | 9 | 7 | 6 | 2016 | 25 | 9 | 1 | 2 | 9 | 7 | 6 | 2016 | 21 | 9 | 1 | 2 | 20 | 0 | 2017 | 18.0303 | 0 | 2016.3636 | 11 | 0 | 2016 | -1.7159 | 0 | 0.5943 | 1.8622 | 0 | 0.4885 | 595 | 0 | 66540 | 78 | 1223356 | 22 | 2017 | 29.7009 | 98587.2056 | 14.5195 | 2016.3542 | 1 | 68652 | 7 | 2016 | 0.2691 | 20.8455 | 0.3776 | 0.6105 | 18.6186 | 43789.6099 | 3.0523 | 0.4784 | 44195 | 146697762 | 21605 | 3000335 | 50 | 1088294 | 2017 | 20.8545 | 61560.7273 | 2016.2727 | 1 | 16851 | 2016 | 0.3181 | 5.1018 | 1.0495 | 13.5857 | 192932.4401 | 0.4495 | 1147 | 3385840 | 110895 |
6225 | 0 | NULL | NULL | British | resta | 60 | 3284 | 127 | 1 | 4 | 3 | 9 | 6 | 2017 | 27 | 7 | 6 | 2012 | 27 | 9 | 1 | 4 | 27 | 7 | 6 | 2012 | 27 | 9 | 1 | 4 | 25 | 5 | 6 | 2011 | 27 | 9 | 1 | 4 | 23 | 0 | 2017 | 12.4333 | 0 | 2012.1667 | 8 | 0 | 2011 | 1.155 | 0 | 2.0461 | 3.2068 | 0 | 1.2097 | 746 | 0 | 120730 | 78 | 7503980 | 24 | 2017 | 30.3097 | 103049.4318 | 10.7728 | 2012.0539 | 1 | 76855 | 1 | 2011 | 0.1989 | 44.5439 | 0.1348 | 1.8215 | 18.0742 | 144158.4761 | 3.5107 | 1.0439 | 99537 | 338414334 | 35378 | 6607585 | 57 | 123124 | 2017 | 25.1496 | 24084.2598 | 2011.9528 | 1 | 14538 | 2011 | 0.3166 | 8.1337 | 1.2541 | 13.8456 | 9973.774 | 0.9416 | 3194 | 3058701 | 255518 |
6226 | 0 | 6 | 77 | Finnish | bottas | 91 | 5153 | 183 | 1 | 5 | 17 | 9 | 6 | 2017 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 5 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 5 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 5 | 18 | 1 | 2017 | 8.1868 | 0.02198 | 2014.8901 | 3 | 0 | 2013 | 1.0124 | 6.6307 | 0.04113 | 4.9976 | 0.1474 | 1.3618 | 745 | 2 | 183355 | 78 | 3689718 | 22 | 2017 | 30.4306 | 99678.869 | 8.0584 | 2014.8894 | 1 | 67847 | 1 | 2013 | 0.2271 | 32.1149 | 0.5698 | 0.05111 | 18.18 | 74208.5105 | 4.885 | 1.3584 | 156809 | 513645212 | 41525 | 10382725 | 59 | 2008595 | 2017 | 24.1093 | 68069.5082 | 2014.8087 | 1 | 14978 | 2013 | 0.3284 | 5.9016 | 0.07557 | 13.4151 | 242520.4324 | 1.3349 | 4412 | 12456720 | 368710 |
6227 | 0 | 10 | 55 | Spanish | sainz | 53 | 2690 | 97 | 1 | 4 | 17 | 9 | 6 | 2017 | 3 | 7 | 6 | 2016 | 29 | 9 | 1 | 3 | 30 | 5 | 6 | 2016 | 29 | 9 | 1 | 3 | 9 | 7 | 6 | 2016 | 29 | 9 | 1 | 3 | 16 | 0 | 2017 | 11.8679 | 0 | 2015.8868 | 7 | 0 | 2015 | 0.06274 | 0 | 0.2015 | 2.481 | 0 | 0.7761 | 629 | 0 | 106842 | 78 | 2122068 | 21 | 2017 | 29.7457 | 99861.4349 | 10.655 | 2015.8587 | 1 | 69150 | 3 | 2015 | 0.3078 | 24.5072 | 0.3527 | 0.2326 | 18.3133 | 61998.6099 | 3.5636 | 0.7425 | 80016 | 268627260 | 28662 | 5422660 | 51 | 2008828 | 2017 | 22.6289 | 84961.6907 | 2015.8247 | 1 | 14966 | 2015 | 0.3388 | 5.1188 | 0.2777 | 12.7919 | 297744.5008 | 0.7218 | 2195 | 8241284 | 195535 |
6228 | 0 | 13 | 26 | Russian | kvyat | 72 | 3585 | 145 | 1 | 5 | 17 | 9 | 6 | 2017 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 4 | 30 | 7 | 6 | 2015 | 31 | 9 | 1 | 4 | 9 | 7 | 6 | 2016 | 31 | 9 | 1 | 4 | 22 | 0 | 2017 | 12.6111 | 0 | 2015.3889 | 7 | 0 | 2014 | -0.3133 | 0 | 0.08466 | 3.4214 | 0 | 1.0688 | 908 | 0 | 145108 | 78 | 3690660 | 22 | 2017 | 29.2739 | 100722.5623 | 11.2388 | 2015.3819 | 1 | 68061 | 2 | 2014 | 0.2624 | 33.405 | -0.07759 | 0.126 | 18.0505 | 79720.0053 | 4.0792 | 1.0596 | 104947 | 361090386 | 40291 | 7225144 | 59 | 2011266 | 2017 | 23.6207 | 64171.0069 | 2015.3586 | 1 | 16139 | 2014 | 0.2769 | 6.5459 | 0.07908 | 15.0675 | 236423.1748 | 1.0651 | 3425 | 9304796 | 292227 |
6229 rows x 98 columns
memory usage: 3.96 MB
name: featuretools_test
type: getml.DataFrame
We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.
Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.
imputation = getml.preprocessors.Imputation()
predictor = getml.predictors.XGBoostClassifier(n_jobs=1)
pipe2 = getml.pipeline.Pipeline(
tags=['featuretools'],
preprocessors=[imputation],
predictors=[predictor],
include_categorical=True,
)
pipe2
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostClassifier'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
pipe2.fit(featuretools_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 16 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:14
Trained pipeline.
Time taken: 0:00:14.526013.
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostClassifier'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
featuretools_score = pipe2.score(featuretools_test)
featuretools_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 12:54:50 | featuretools_train | win | 0.9717 | 0.9389 | 0.08364 |
1 | 2024-09-12 12:54:50 | featuretools_test | win | 0.9722 | 0.9153 | 0.08797 |
2.6 Studying features¶
We would like to understand why getML outperforms featuretools. In particular, getML's FastProp is based on an approach that is very similar to featuretools.
To investigate this matter, we first take a look at the importance of the features FastProp has learned:
names, importances = pipe1.features.importances(target_num=0)
plt.subplots(figsize=(20, 10))
plt.bar(names[:30], importances[:30])
plt.title("feature importances")
plt.grid(True)
plt.xlabel("column")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
We take a look at the most important features, to get an idea where the predictive power comes from:
pipe1.features.to_sql().find(names[0])[0]
DROP TABLE IF EXISTS "T4__DRIVERREF__MAPPING_TARGET_1_AVG";
CREATE TABLE "T4__DRIVERREF__MAPPING_TARGET_1_AVG"("key" TEXT, "value" REAL);
INSERT INTO "T4__DRIVERREF__MAPPING_TARGET_1_AVG" ("key", "value")
VALUES('fangio', 0.3653846153846154),
('ascari', 0.3055555555555556),
('michael_schumacher', 0.3032786885245902),
('hamilton', 0.2981366459627329),
('stewart', 0.2948717948717949),
('prost', 0.264367816091954),
('vettel', 0.2467532467532468),
('clark', 0.2394366197183098),
('senna', 0.2323943661971831),
('damon_hill', 0.2159090909090909),
('moss', 0.1866666666666667),
('mansell', 0.1569767441860465),
('lauda', 0.1390728476821192),
('alonso', 0.1282051282051282),
('hakkinen', 0.1278195488721804),
('rosberg', 0.10625),
('hunt', 0.1058823529411765),
('jack_brabham', 0.1052631578947368),
('piquet', 0.09714285714285714),
('farina', 0.09523809523809523),
('jones', 0.09259259259259259),
('brooks', 0.09090909090909091),
('scheckter', 0.09),
('montoya', 0.08536585365853659),
('emerson_fittipaldi', 0.08403361344537816),
('rindt', 0.08333333333333333),
('hulme', 0.08247422680412371),
('peterson', 0.08247422680412371),
('reutemann', 0.07936507936507936),
('mario_andretti', 0.07142857142857142),
('raikkonen', 0.0684931506849315),
('villeneuve', 0.06153846153846154),
('surtees', 0.0576923076923077),
('hill', 0.05194805194805195),
('gilles_villeneuve', 0.05),
('berger', 0.05),
('keke_rosberg', 0.04901960784313725),
('hawthorn', 0.04878048780487805),
('ickx', 0.04878048780487805),
('gurney', 0.04819277108433735),
('collins', 0.04761904761904762),
('arnoux', 0.04697986577181208),
('massa', 0.04587155963302753),
('button', 0.04511278195488722),
('pironi', 0.04347826086956522),
('coulthard', 0.0427807486631016),
('revson', 0.0425531914893617),
('ricciardo', 0.0425531914893617),
('webber', 0.04216867469879518),
('ralf_schumacher', 0.03846153846153846),
('barrichello', 0.036),
('phil_hill', 0.03448275862068965),
('rodriguez', 0.03333333333333333),
('scarfiotti', 0.03333333333333333),
('laffite', 0.03311258278145696),
('regazzoni', 0.03225806451612903),
('jabouille', 0.03076923076923077),
('baghetti', 0.02941176470588235),
('flaherty', 0.02941176470588235),
('watson', 0.02898550724637681),
('mclaren', 0.02803738317757009),
('gethin', 0.02777777777777778),
('musso', 0.02702702702702703),
('cevert', 0.02631578947368421),
('ruttman', 0.025),
('irvine', 0.025),
('max_verstappen', 0.02380952380952381),
('hanks', 0.02325581395348837),
('alboreto', 0.02298850574712644),
('boutsen', 0.02290076335877863),
('herbert', 0.02222222222222222),
('parsons', 0.02083333333333333),
('bryan', 0.02040816326530612),
('trintignant', 0.02),
('kubica', 0.0196078431372549),
('patrese', 0.01951219512195122),
('rathmann', 0.01923076923076923),
('bandini', 0.01851851851851852),
('tambay', 0.01834862385321101),
('ginther', 0.01639344262295082),
('fisichella', 0.01621621621621622),
('ward', 0.01612903225806452),
('frentzen', 0.01574803149606299),
('brambilla', 0.01492537313432836),
('bottas', 0.01470588235294118),
('pace', 0.0136986301369863),
('maldonado', 0.01298701298701299),
('beltoise', 0.01282051282051282),
('siffert', 0.01265822784810127),
('depailler', 0.0108695652173913),
('angelis', 0.0108695652173913),
('bonnier', 0.009009009009009009),
('mass', 0.008771929824561403),
('panis', 0.007692307692307693),
('darter', 0),
('pretorius', 0),
('adamich', 0),
('trevor_taylor', 0),
('love', 0),
('anderson', 0),
('bianchi', 0),
('courage', 0),
('tingle', 0),
('attwood', 0),
('spence', 0),
('resta', 0),
('chiron', 0),
('reece', 0),
('linden', 0),
('agabashian', 0),
('manzon', 0),
('rosier', 0),
('villoresi', 0),
('graffenried', 0),
('hulkenberg', 0),
('petrov', 0),
('bruno_senna', 0),
('daywalt', 0),
('perez', 0),
('vergne', 0),
('pic', 0),
('chilton', 0),
('gutierrez', 0),
('jules_bianchi', 0),
('kevin_magnussen', 0),
('kvyat', 0),
('ericsson', 0),
('nasr', 0),
('sainz', 0),
('schell', 0),
('maggs', 0),
('gregory', 0),
('andre_pilette', 0),
('beaufort', 0),
('burgess', 0),
('salvadori', 0),
('trips', 0),
('herrmann', 0),
('scarlatti', 0),
('menditeguy', 0),
('gonzalez', 0),
('ireland', 0),
('thomson', 0),
('johnson', 0),
('ertl', 0),
('hartley', 0),
('stevenson', 0),
('freeland', 0),
('bettenhausen', 0),
('boyd', 0),
('gould', 0),
('behra', 0),
('paul_russo', 0),
('martini', 0),
('larini', 0),
('katayama', 0),
('morbidelli', 0),
('lamy', 0),
('brundle', 0),
('montermini', 0),
('blundell', 0),
('suzuki', 0),
('moreno', 0),
('wendlinger', 0),
('gachot', 0),
('magnussen', 0),
('tarquini', 0),
('comas', 0),
('bernard', 0),
('fittipaldi', 0),
('lehto', 0),
('cesaris', 0),
('alliot', 0),
('dalmas', 0),
('warwick', 0),
('capelli', 0),
('gugelmin', 0),
('rosa', 0),
('heidfeld', 0),
('kovalainen', 0),
('glock', 0),
('sato', 0),
('trulli', 0),
('sutil', 0),
('liuzzi', 0),
('wurz', 0),
('speed', 0),
('albers', 0),
('klien', 0),
('grouillard', 0),
('karthikeyan', 0),
('zonta', 0),
('gene', 0),
('verstappen', 0),
('alesi', 0),
('salo', 0),
('diniz', 0),
('buemi', 0),
('badoer', 0),
('zanardi', 0),
('hoffmann', 0),
('henton', 0),
('daly', 0),
('villota', 0),
('keegan', 0),
('rebaque', 0),
('merzario', 0),
('stuck', 0),
('lunger', 0),
('stommelen', 0),
('ian_scheckter', 0),
('pryce', 0),
('jarier', 0),
('oliver', 0),
('amon', 0),
('pescarolo', 0),
('wilson_fittipaldi', 0),
('keizan', 0),
('charlton', 0),
('hailwood', 0),
('ganley', 0),
('redman', 0),
('schenken', 0),
('palmer', 0),
('modena', 0),
('caffi', 0),
('lammers', 0),
('satoru_nakajima', 0),
('johansson', 0),
('nannini', 0),
('schneider', 0),
('giacomelli', 0),
('alguersuari', 0),
('grosjean', 0),
('kobayashi', 0),
('wisell', 0),
('danner', 0),
('cheever', 0),
('ghinzani', 0),
('streiff', 0),
('fabi', 0),
('surer', 0),
('manfred_winkelhock', 0),
('baldi', 0),
('serra', 0),
('salazar', 0);
ALTER TABLE "POPULATION__STAGING_TABLE_1" ADD COLUMN "t4__driverref__mapping_target_1_avg" REAL;
UPDATE "POPULATION__STAGING_TABLE_1" SET "t4__driverref__mapping_target_1_avg" = 0.0;
UPDATE "POPULATION__STAGING_TABLE_1"
SET "t4__driverref__mapping_target_1_avg" = "T4__DRIVERREF__MAPPING_TARGET_1_AVG"."value"
FROM "T4__DRIVERREF__MAPPING_TARGET_1_AVG"
WHERE "POPULATION__STAGING_TABLE_1"."t4__driverref" = "T4__DRIVERREF__MAPPING_TARGET_1_AVG"."key";
DROP TABLE IF EXISTS "T4__DRIVERREF__MAPPING_TARGET_1_AVG";
pipe1.features.to_sql()[names[1]]
DROP TABLE IF EXISTS "FEATURE_1_75";
CREATE TABLE "FEATURE_1_75" AS
SELECT EWMA_90D( t2."win", t1."date" - t2."date__1_000000_days" ) AS "feature_1_75",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "DRIVER_STANDINGS__STAGING_TABLE_2" t2
ON t1."driverid" = t2."driverid"
WHERE t2."date__1_000000_days" <= t1."date"
GROUP BY t1.rowid;
pipe1.features.to_sql()[names[2]]
DROP TABLE IF EXISTS "FEATURE_1_78";
CREATE TABLE "FEATURE_1_78" AS
SELECT EWMA_365D( t2."win", t1."date" - t2."date__1_000000_days" ) AS "feature_1_78",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "DRIVER_STANDINGS__STAGING_TABLE_2" t2
ON t1."driverid" = t2."driverid"
WHERE t2."date__1_000000_days" <= t1."date"
GROUP BY t1.rowid;
What we can learn from these features is the following: Knowing which driver we are talking aboyt and who won the most recent races is the best predictor for whether a driver will win this race.
names, importances = pipe2.features.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names[:30], importances[:30])
plt.title("feature importances")
plt.grid(True)
plt.xlabel("column")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
As we can see, featuretools generally builds features that are similar to the features identified by getML's FastProp.
2.7 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named formula1_pipeline containing
# the SQL code.
pipe1.features.to_sql().save("formula1_pipeline")
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("formula1_spark")
2.8 Discussion¶
For a more convenient overview, we summarize our results into a table.
scores = [fastprop_score, featuretools_score]
pd.DataFrame(data={ 'Name': ['getML: FastProp', 'featuretools'],
'Accuracy': [f'{score.accuracy:.2%}' for score in scores],
'AUC': [f'{score.auc:,.2%}' for score in scores],
'Cross entropy': [f'{score.cross_entropy:,.5f}' for score in scores]
})
Name | Accuracy | AUC | Cross entropy | |
---|---|---|---|---|
0 | getML: FastProp | 97.24% | 92.35% | 0.08533 |
1 | featuretools | 97.22% | 91.53% | 0.08797 |
As we can see, getML's FastProp outperforms featuretools according to all three measures.
getml.engine.shutdown()
3. Conclusion¶
We have benchmarked getML against featuretools on dataset related to Formula 1 races. We have found that getML's FastProp outperforms featuretools.
References¶
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).