Propositionalization: Traffic near Dodgers' stadium¶

In this notebook, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

Prediction type: Regression model
Domain: Transportation
Prediction target: traffic volume
Source data: Univariate time series
Population size: 47497

Background¶

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

In this notebook, we use traffic data that was collected for the Glendale on ramp for the 101 North freeway in Los Angeles. For further details about the data set refer to the full notebook.

Let's get started with the analysis and set-up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3"
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3"

Note: you may need to restart the kernel to use updated packages.

In [2]:

  Copied!     
 
import os
import sys

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path
from urllib import request
import getml
import pandas as pd

print(f"getML API version: {getml.__version__}\n")
import os import sys os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" from pathlib import Path from urllib import request import getml import pandas as pd print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [3]:

  Copied!     
 
# If we are in Colab, we need to fetch the utils folder from the repository
if os.getenv("COLAB_RELEASE_TAG"):
    !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'
# If we are in Colab, we need to fetch the utils folder from the repository if os.getenv("COLAB_RELEASE_TAG"): !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'

In [4]:

  Copied!     
 
parent = Path(os.getcwd()).parent.as_posix()

if parent not in sys.path:
    sys.path.append(parent)

from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder
parent = Path(os.getcwd()).parent.as_posix() if parent not in sys.path: sys.path.append(parent) from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder

In [5]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("dodgers")
getml.engine.launch(allow_remote_ips=True, token="token") getml.engine.set_project("dodgers")

getML Engine is already running.

Connected to project 'dodgers'.

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data from the UC Irvine Machine Learning repository:

In [6]:

  Copied!     
 
fname = "Dodgers.data"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/"
        + fname,
        fname,
    )

data_full_pandas = pd.read_csv(fname, header=None)
data_full_pandas.columns = ["ds", "y"]
fname = "Dodgers.data" if not os.path.exists(fname): fname, res = request.urlretrieve( "https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/" + fname, fname, ) data_full_pandas = pd.read_csv(fname, header=None) data_full_pandas.columns = ["ds", "y"]

In [7]:

  Copied!     
 
data_full_pandas["ds"] = pd.to_datetime(data_full_pandas["ds"], format="%m/%d/%Y %H:%M")
data_full_pandas["ds"] = pd.to_datetime(data_full_pandas["ds"], format="%m/%d/%Y %H:%M")

In [8]:

  Copied!     
 
data_full_pandas
data_full_pandas

Out[8]:

	ds	y
0	2005-04-10 00:00:00	-1
1	2005-04-10 00:05:00	-1
2	2005-04-10 00:10:00	-1
3	2005-04-10 00:15:00	-1
4	2005-04-10 00:20:00	-1
...	...	...
50395	2005-10-01 23:35:00	-1
50396	2005-10-01 23:40:00	-1
50397	2005-10-01 23:45:00	-1
50398	2005-10-01 23:50:00	-1
50399	2005-10-01 23:55:00	-1

50400 rows × 2 columns

1.2 Prepare data for getML¶

In [9]:

  Copied!     
 
data_full = getml.data.DataFrame.from_pandas(data_full_pandas, "data_full")
data_full = getml.data.DataFrame.from_pandas(data_full_pandas, "data_full")

In [10]:

  Copied!     
 
data_full.set_role("y", getml.data.roles.target)
data_full.set_role("ds", getml.data.roles.time_stamp)
data_full.set_role("y", getml.data.roles.target) data_full.set_role("ds", getml.data.roles.time_stamp)

In [11]:

  Copied!     
 
data_full
data_full

Out[11]:

name	ds	y
role	time_stamp	target
unit	time stamp, comparison only
0	2005-04-10	-1
1	2005-04-10 00:05:00	-1
2	2005-04-10 00:10:00	-1
3	2005-04-10 00:15:00	-1
4	2005-04-10 00:20:00	-1
	...	...
50395	2005-10-01 23:35:00	-1
50396	2005-10-01 23:40:00	-1
50397	2005-10-01 23:45:00	-1
50398	2005-10-01 23:50:00	-1
50399	2005-10-01 23:55:00	-1

50400 rows x 2 columns
memory usage: 0.81 MB
name: data_full
type: getml.DataFrame

In [12]:

  Copied!     
 
split = getml.data.split.time(
    population=data_full, time_stamp="ds", test=getml.data.time.datetime(2005, 8, 20)
)
split
split = getml.data.split.time( population=data_full, time_stamp="ds", test=getml.data.time.datetime(2005, 8, 20) ) split

Out[12]:


0	train
1	train
2	train
3	train
4	train
	...

50400 rows
type: StringColumnView

1.3 Define relational model¶

To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (horizon, memory and allow_lagged_targets). This is done abstractly using Placeholders

The data model consists of two tables:

Population table traffic_{test/train}: holds target and the contemporarily available time-based components
Peripheral table traffic: same table as the population table
Join between both placeholders specifies (horizon) to prevent leaks and (memory) that keeps the computations feasible

In [13]:

  Copied!     
 
# 1. The horizon is 1 hour (we predict the traffic volume in one hour).
# 2. The memory is 2 hours, so we allow the algorithm to
#    use information from up to 2 hours ago.
# 3. We allow lagged targets. Thus, the algorithm can
#    identify autoregressive processes.

time_series = getml.data.TimeSeries(
    population=data_full,
    alias="population",
    split=split,
    time_stamps="ds",
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.hours(2),
    lagged_targets=True,
)

time_series
# 1. The horizon is 1 hour (we predict the traffic volume in one hour). # 2. The memory is 2 hours, so we allow the algorithm to # use information from up to 2 hours ago. # 3. We allow lagged targets. Thus, the algorithm can # identify autoregressive processes. time_series = getml.data.TimeSeries( population=data_full, alias="population", split=split, time_stamps="ds", horizon=getml.data.time.hours(1), memory=getml.data.time.hours(2), lagged_targets=True, ) time_series

Out[13]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	data_full	DATA_FULL__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	data_full	unknown	View
1	train	data_full	unknown	View

peripheral

	name	rows	type
0	data_full	50400	DataFrame

2. Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 Propositionalization with getML's FastProp¶

In [14]:

  Copied!     
 
seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)
seasonal = getml.preprocessors.Seasonal() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.SquareLoss, num_threads=1, )

Build the pipeline

In [15]:

  Copied!     
 
pipe_fp_fl = getml.pipeline.Pipeline(
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    data_model=time_series.data_model,
    tags=["feature learning", "fastprop"],
)

pipe_fp_fl
pipe_fp_fl = getml.pipeline.Pipeline( preprocessors=[seasonal], feature_learners=[fast_prop], data_model=time_series.data_model, tags=["feature learning", "fastprop"], ) pipe_fp_fl

Out[15]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=['data_full'],
         predictors=[],
         preprocessors=['Seasonal'],
         share_selected_features=0.5,
         tags=['feature learning', 'fastprop'])

In [16]:

  Copied!     
 
pipe_fp_fl.check(time_series.train)
pipe_fp_fl.check(time_series.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

OK.

In [17]:

  Copied!     
 
benchmark = Benchmark()
benchmark = Benchmark()

In [18]:

  Copied!     
 
with benchmark("fastprop"):
    pipe_fp_fl.fit(time_series.train)
    fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")
with benchmark("fastprop"): pipe_fp_fl.fit(time_series.train) fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 526 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08

Trained pipeline.

Time taken: 0:00:08.574815.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03

In [19]:

  Copied!     
 
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

In [20]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)
predictor = getml.predictors.XGBoostRegressor() pipe_fp_pr = getml.pipeline.Pipeline( tags=["prediction", "fastprop"], predictors=[predictor] )

In [21]:

  Copied!     
 
pipe_fp_pr.fit(fastprop_train)
pipe_fp_pr.fit(fastprop_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09

Trained pipeline.

Time taken: 0:00:09.859138.

Out[21]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'fastprop'])

In [22]:

  Copied!     
 
pipe_fp_pr.score(fastprop_test)
pipe_fp_pr.score(fastprop_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[22]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-13 13:03:14	fastprop_train	y	5.4188	7.5347	0.699
1	2024-09-13 13:03:15	fastprop_test	y	5.6151	7.8243	0.6747

2.2 Propositionalization with featuretools¶

In [23]:

  Copied!     
 
data_train = time_series.train.population.to_df("data_train")
data_test = time_series.test.population.to_df("data_test")
data_train = time_series.train.population.to_df("data_train") data_test = time_series.test.population.to_df("data_test")

In [24]:

  Copied!     
 
dfs_pandas = {}

for df in getml.project.data_frames:
    dfs_pandas[df.name] = df.to_pandas()
    dfs_pandas[df.name]["id"] = 1
dfs_pandas = {} for df in getml.project.data_frames: dfs_pandas[df.name] = df.to_pandas() dfs_pandas[df.name]["id"] = 1

In [25]:

  Copied!     
 
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(hours=1),
    memory=pd.Timedelta(hours=2),
    column_id="id",
    time_stamp="ds",
    target="y",
    allow_lagged_targets=True,
)
ft_builder = FTTimeSeriesBuilder( num_features=200, horizon=pd.Timedelta(hours=1), memory=pd.Timedelta(hours=2), column_id="id", time_stamp="ds", target="y", allow_lagged_targets=True, )

In [26]:

  Copied!     
 
with benchmark("featuretools"):
    featuretools_train = ft_builder.fit(dfs_pandas["data_train"])

featuretools_test = ft_builder.transform(dfs_pandas["data_test"])
with benchmark("featuretools"): featuretools_train = ft_builder.fit(dfs_pandas["data_train"]) featuretools_test = ft_builder.transform(dfs_pandas["data_test"])

featuretools: Trying features...
Selecting the best out of 118 features...
Time taken: 0h:8m:51.202135

In [27]:

  Copied!     
 
df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=data_train.roles
)

df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=data_train.roles
)
df_featuretools_train = getml.data.DataFrame.from_pandas( featuretools_train, name="featuretools_train", roles=data_train.roles ) df_featuretools_test = getml.data.DataFrame.from_pandas( featuretools_test, name="featuretools_test", roles=data_train.roles )

In [28]:

  Copied!     
 
df_featuretools_train.set_role(
    df_featuretools_train.roles.unused, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.roles.unused, getml.data.roles.numerical
)
df_featuretools_train.set_role( df_featuretools_train.roles.unused, getml.data.roles.numerical ) df_featuretools_test.set_role( df_featuretools_test.roles.unused, getml.data.roles.numerical )

In [29]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr
predictor = getml.predictors.XGBoostRegressor() pipe_ft_pr = getml.pipeline.Pipeline( tags=["prediction", "featuretools"], predictors=[predictor] ) pipe_ft_pr

Out[29]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'featuretools'])

In [30]:

  Copied!     
 
pipe_ft_pr.check(df_featuretools_train)
pipe_ft_pr.check(df_featuretools_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING.

Out[30]:

	type	label	message
0	WARNING	COLUMN SHOULD BE UNUSED	All non-NULL entries in column 'id' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').

In [31]:

  Copied!     
 
pipe_ft_pr.fit(df_featuretools_train)
pipe_ft_pr.fit(df_featuretools_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03

Trained pipeline.

Time taken: 0:00:03.563567.

Out[31]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'featuretools'])

In [32]:

  Copied!     
 
pipe_ft_pr.score(df_featuretools_test)
pipe_ft_pr.score(df_featuretools_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[32]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-13 13:15:05	featuretools_train	y	5.4482	7.568	0.6962
1	2024-09-13 13:15:06	featuretools_test	y	6.0863	8.5009	0.6498

2.3 Propositionalization with tsfresh¶

In [33]:

  Copied!     
 
tsfresh_builder = TSFreshBuilder(
    num_features=200,
    horizon=20,
    memory=60,
    column_id="id",
    time_stamp="ds",
    target="y",
    allow_lagged_targets=True,
)
tsfresh_builder = TSFreshBuilder( num_features=200, horizon=20, memory=60, column_id="id", time_stamp="ds", target="y", allow_lagged_targets=True, )

In [34]:

  Copied!     
 
with benchmark("tsfresh"):
    tsfresh_train = tsfresh_builder.fit(dfs_pandas["data_train"])

tsfresh_test = tsfresh_builder.transform(dfs_pandas["data_test"])
with benchmark("tsfresh"): tsfresh_train = tsfresh_builder.fit(dfs_pandas["data_train"]) tsfresh_test = tsfresh_builder.transform(dfs_pandas["data_test"])

Rolling: 100%|██████████| 40/40 [00:12<00:00,  3.18it/s]
Feature Extraction: 100%|██████████| 40/40 [00:06<00:00,  6.26it/s]
Feature Extraction: 100%|██████████| 40/40 [00:06<00:00,  6.23it/s]

Selecting the best out of 13 features...
Time taken: 0h:0m:31.919565

Rolling: 100%|██████████| 40/40 [00:03<00:00, 11.31it/s]
Feature Extraction: 100%|██████████| 40/40 [00:01<00:00, 20.24it/s]
Feature Extraction: 100%|██████████| 40/40 [00:02<00:00, 19.32it/s]

In [35]:

  Copied!     
 
df_tsfresh_train = getml.data.DataFrame.from_pandas(
    tsfresh_train, name="tsfresh_train", roles=data_train.roles
)

df_tsfresh_test = getml.data.DataFrame.from_pandas(
    tsfresh_test, name="tsfresh_test", roles=data_train.roles
)
df_tsfresh_train = getml.data.DataFrame.from_pandas( tsfresh_train, name="tsfresh_train", roles=data_train.roles ) df_tsfresh_test = getml.data.DataFrame.from_pandas( tsfresh_test, name="tsfresh_test", roles=data_train.roles )

In [36]:

  Copied!     
 
df_tsfresh_train.set_role(df_tsfresh_train.roles.unused, getml.data.roles.numerical)

df_tsfresh_test.set_role(df_tsfresh_test.roles.unused, getml.data.roles.numerical)
df_tsfresh_train.set_role(df_tsfresh_train.roles.unused, getml.data.roles.numerical) df_tsfresh_test.set_role(df_tsfresh_test.roles.unused, getml.data.roles.numerical)

In [37]:

  Copied!     
 
pipe_tsf_pr = getml.pipeline.Pipeline(
    tags=["predicition", "tsfresh"], predictors=[predictor]
)

pipe_tsf_pr
pipe_tsf_pr = getml.pipeline.Pipeline( tags=["predicition", "tsfresh"], predictors=[predictor] ) pipe_tsf_pr

Out[37]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['predicition', 'tsfresh'])

In [38]:

  Copied!     
 
pipe_tsf_pr.fit(df_tsfresh_train)
pipe_tsf_pr.fit(df_tsfresh_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

⠇ XGBoost: Trained tree 95.  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━    95% • 00:01

Trained pipeline.

Time taken: 0:00:01.448143.

Out[38]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['predicition', 'tsfresh'])

In [39]:

  Copied!     
 
pipe_tsf_pr.score(df_tsfresh_test)
pipe_tsf_pr.score(df_tsfresh_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[39]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-13 13:15:49	tsfresh_train	y	6.3146	8.2348	0.6418
1	2024-09-13 13:15:49	tsfresh_test	y	6.7886	8.9134	0.5778

3. Comparison¶

In [40]:

  Copied!     
 
num_features = dict(
    fastprop=526,
    featuretools=59,
    tsfresh=12,
)

runtime_per_feature = [
    benchmark.runtimes["fastprop"] / num_features["fastprop"],
    benchmark.runtimes["featuretools"] / num_features["featuretools"],
    benchmark.runtimes["tsfresh"] / num_features["tsfresh"],
]

features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]

normalized_runtime_per_feature = [
    r / runtime_per_feature[0] for r in runtime_per_feature
]

comparison = pd.DataFrame(
    dict(
        runtime=[
            benchmark.runtimes["fastprop"],
            benchmark.runtimes["featuretools"],
            benchmark.runtimes["tsfresh"],
        ],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        normalized_runtime=[
            1,
            benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
            benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"],
        ],
        normalized_runtime_per_feature=normalized_runtime_per_feature,
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared, pipe_tsf_pr.rsquared],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse, pipe_tsf_pr.rmse],
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae, pipe_tsf_pr.mae],
    )
)

comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]
num_features = dict( fastprop=526, featuretools=59, tsfresh=12, ) runtime_per_feature = [ benchmark.runtimes["fastprop"] / num_features["fastprop"], benchmark.runtimes["featuretools"] / num_features["featuretools"], benchmark.runtimes["tsfresh"] / num_features["tsfresh"], ] features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature] normalized_runtime_per_feature = [ r / runtime_per_feature[0] for r in runtime_per_feature ] comparison = pd.DataFrame( dict( runtime=[ benchmark.runtimes["fastprop"], benchmark.runtimes["featuretools"], benchmark.runtimes["tsfresh"], ], num_features=num_features.values(), features_per_second=features_per_second, normalized_runtime=[ 1, benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"], benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"], ], normalized_runtime_per_feature=normalized_runtime_per_feature, rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared, pipe_tsf_pr.rsquared], rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse, pipe_tsf_pr.rmse], mae=[pipe_fp_pr.mae, pipe_ft_pr.mae, pipe_tsf_pr.mae], ) ) comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]

In [41]:

  Copied!     
 
comparison
comparison

Out[41]:

	runtime	num_features	features_per_second	normalized_runtime	normalized_runtime_per_feature	rsquared	rmse	mae
getML: FastProp	0 days 00:00:12.106112	526	43.449924	1.000000	1.000000	0.674740	7.824265	5.615102
featuretools	0 days 00:08:51.202688	59	0.111069	43.878884	391.198566	0.649768	8.500887	6.086277
tsfresh	0 days 00:00:31.919755	12	0.375943	2.636664	115.575929	0.577811	8.913408	6.788610

In [42]:

  Copied!     
 
# export for further use
comparison.to_csv("comparisons/dodgers.csv")
# export for further use comparison.to_csv("comparisons/dodgers.csv")

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()