Propositionalization: Interstate 94¶

In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

Prediction type: Regression model
Domain: Transportation
Prediction target: Hourly traffic volume
Source data: Multivariate time series, 5 components
Population size: 24096

Background¶

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

In this notebook, we predict the hourly traffic volume on I-94 westbound from Minneapolis-St Paul. The analysis is built on top of a dataset provided by the MN Department of Transportation, with some data preparation done by John Hogue. For further details about the data set refer to the full notebook.

Let's get started with the analysis and set-up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "featuretools==1.31.0"
%pip install -q "getml==1.5.0" "featuretools==1.31.0"

Note: you may need to restart the kernel to use updated packages.

In [2]:

  Copied!     
 
import os
import sys

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path

import pandas as pd
import getml

print(f"getML API version: {getml.__version__}\n")
import os import sys os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" from pathlib import Path import pandas as pd import getml print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [3]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("interstate94")
getml.engine.launch(allow_remote_ips=True, token="token") getml.engine.set_project("interstate94")

getML Engine is already running.

Connected to project 'interstate94'.

In [4]:

  Copied!     
 
# If we are in Colab, we need to fetch the utils folder from the repository
if os.getenv("COLAB_RELEASE_TAG"):
    !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'
# If we are in Colab, we need to fetch the utils folder from the repository if os.getenv("COLAB_RELEASE_TAG"): !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'

In [5]:

  Copied!     
 
parent = Path(os.getcwd()).parent.as_posix()

if parent not in sys.path:
    sys.path.append(parent)

from utils import Benchmark, FTTimeSeriesBuilder
parent = Path(os.getcwd()).parent.as_posix() if parent not in sys.path: sys.path.append(parent) from utils import Benchmark, FTTimeSeriesBuilder

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data from the UC Irvine Machine Learning repository:

In [6]:

  Copied!     
 
traffic = getml.datasets.load_interstate94(roles=True, units=True)
traffic = getml.datasets.load_interstate94(roles=True, units=True)

  Downloading traffic... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 1.2/1.2 MB • 00:00

In [7]:

  Copied!     
 
traffic.set_role(traffic.roles.categorical, getml.data.roles.unused_string)
traffic.set_role(traffic.roles.categorical, getml.data.roles.unused_string)

In [8]:

  Copied!     
 
traffic
traffic

Out[8]:

name	ds	traffic_volume	holiday	day	month	weekday	hour	year
role	time_stamp	target	unused_string	unused_string	unused_string	unused_string	unused_string	unused_string
unit	time stamp, comparison only			day	month	weekday	hour	year
0	2016-01-01	1513	New Years Day	1	1	4	0	2016
1	2016-01-01 01:00:00	1550	New Years Day	1	1	4	1	2016
2	2016-01-01 02:00:00	993	New Years Day	1	1	4	2	2016
3	2016-01-01 03:00:00	719	New Years Day	1	1	4	3	2016
4	2016-01-01 04:00:00	533	New Years Day	1	1	4	4	2016
	...	...	...	...	...	...	...	...
24091	2018-09-30 19:00:00	3543	No holiday	30	9	6	19	2018
24092	2018-09-30 20:00:00	2781	No holiday	30	9	6	20	2018
24093	2018-09-30 21:00:00	2159	No holiday	30	9	6	21	2018
24094	2018-09-30 22:00:00	1450	No holiday	30	9	6	22	2018
24095	2018-09-30 23:00:00	954	No holiday	30	9	6	23	2018

24096 rows x 8 columns
memory usage: 2.16 MB
name: traffic
type: getml.DataFrame

1.2 Define relational model¶

In [9]:

  Copied!     
 
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))

In [10]:

  Copied!     
 
time_series = getml.data.TimeSeries(
    population=traffic,
    split=split,
    alias="traffic",
    time_stamps="ds",
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.hours(24),
    lagged_targets=True,
)

time_series
time_series = getml.data.TimeSeries( population=traffic, split=split, alias="traffic", time_stamps="ds", horizon=getml.data.time.hours(1), memory=getml.data.time.hours(24), lagged_targets=True, ) time_series

Out[10]:

data model

diagram

staging

	data frames	staging table
0	traffic	TRAFFIC__STAGING_TABLE_1
1	traffic	TRAFFIC__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	traffic	unknown	View
1	train	traffic	unknown	View

peripheral

	name	rows	type
0	traffic	24096	DataFrame

2. Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 Propositionalization with getML's FastProp¶

In [11]:

  Copied!     
 
seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
)
seasonal = getml.preprocessors.Seasonal() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.SquareLoss, num_threads=1, )

Build the pipeline

In [12]:

  Copied!     
 
pipe_fp_fl = getml.pipeline.Pipeline(
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    data_model=time_series.data_model,
    tags=["feature learning", "fastprop"],
)

pipe_fp_fl
pipe_fp_fl = getml.pipeline.Pipeline( preprocessors=[seasonal], feature_learners=[fast_prop], data_model=time_series.data_model, tags=["feature learning", "fastprop"], ) pipe_fp_fl

Out[12]:

Pipeline(data_model='traffic',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=['traffic'],
         predictors=[],
         preprocessors=['Seasonal'],
         share_selected_features=0.5,
         tags=['feature learning', 'fastprop'])

In [13]:

  Copied!     
 
pipe_fp_fl.check(time_series.train)
pipe_fp_fl.check(time_series.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

In [14]:

  Copied!     
 
benchmark = Benchmark()
benchmark = Benchmark()

In [15]:

  Copied!     
 
with benchmark("fastprop"):
    pipe_fp_fl.fit(time_series.train)
    fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")
with benchmark("fastprop"): pipe_fp_fl.fit(time_series.train) fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 365 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03

Trained pipeline.

Time taken: 0:00:03.058378.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

In [16]:

  Copied!     
 
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

In [17]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)
predictor = getml.predictors.XGBoostRegressor() pipe_fp_pr = getml.pipeline.Pipeline( tags=["prediction", "fastprop"], predictors=[predictor] )

In [18]:

  Copied!     
 
pipe_fp_pr.fit(fastprop_train)
pipe_fp_pr.fit(fastprop_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05

Trained pipeline.

Time taken: 0:00:05.192145.

Out[18]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'fastprop'])

In [19]:

  Copied!     
 
pipe_fp_pr.score(fastprop_test)
pipe_fp_pr.score(fastprop_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[19]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-13 13:17:10	fastprop_train	traffic_volume	198.9482	292.2493	0.9779
1	2024-09-13 13:17:10	fastprop_test	traffic_volume	180.4867	261.9389	0.9827

2.2 Propositionalization with featuretools¶

In [20]:

  Copied!     
 
traffic_train = time_series.train.population
traffic_test = time_series.test.population
traffic_train = time_series.train.population traffic_test = time_series.test.population

In [21]:

  Copied!     
 
dfs_pandas = {}

for df in [traffic_train, traffic_test, traffic]:
    dfs_pandas[df.name] = df.drop(df.roles.unused).to_pandas()
    dfs_pandas[df.name]["join_key"] = 1
dfs_pandas = {} for df in [traffic_train, traffic_test, traffic]: dfs_pandas[df.name] = df.drop(df.roles.unused).to_pandas() dfs_pandas[df.name]["join_key"] = 1

In [22]:

  Copied!     
 
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(hours=1),
    memory=pd.Timedelta(hours=24),
    column_id="join_key",
    time_stamp="ds",
    target="traffic_volume",
    allow_lagged_targets=True,
)
ft_builder = FTTimeSeriesBuilder( num_features=200, horizon=pd.Timedelta(hours=1), memory=pd.Timedelta(hours=24), column_id="join_key", time_stamp="ds", target="traffic_volume", allow_lagged_targets=True, )

In [23]:

  Copied!     
 
with benchmark("featuretools"):
    featuretools_train = ft_builder.fit(dfs_pandas["train"])

featuretools_test = ft_builder.transform(dfs_pandas["test"])
with benchmark("featuretools"): featuretools_train = ft_builder.fit(dfs_pandas["train"]) featuretools_test = ft_builder.transform(dfs_pandas["test"])

featuretools: Trying features...
Selecting the best out of 118 features...
Time taken: 0h:4m:27.008254

In [24]:

  Copied!     
 
roles = {
    getml.data.roles.join_key: ["join_key"],
    getml.data.roles.target: ["traffic_volume"],
    getml.data.roles.time_stamp: ["ds"],
}

df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=roles
)

df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=roles
)
roles = { getml.data.roles.join_key: ["join_key"], getml.data.roles.target: ["traffic_volume"], getml.data.roles.time_stamp: ["ds"], } df_featuretools_train = getml.data.DataFrame.from_pandas( featuretools_train, name="featuretools_train", roles=roles ) df_featuretools_test = getml.data.DataFrame.from_pandas( featuretools_test, name="featuretools_test", roles=roles )

In [25]:

  Copied!     
 
df_featuretools_train.set_role(
    df_featuretools_train.roles.unused, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.roles.unused, getml.data.roles.numerical
)
df_featuretools_train.set_role( df_featuretools_train.roles.unused, getml.data.roles.numerical ) df_featuretools_test.set_role( df_featuretools_test.roles.unused, getml.data.roles.numerical )

In [26]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr
predictor = getml.predictors.XGBoostRegressor() pipe_ft_pr = getml.pipeline.Pipeline( tags=["prediction", "featuretools"], predictors=[predictor] ) pipe_ft_pr

Out[26]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'featuretools'])

In [27]:

  Copied!     
 
pipe_ft_pr.check(df_featuretools_train)
pipe_ft_pr.check(df_featuretools_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

In [28]:

  Copied!     
 
pipe_ft_pr.fit(df_featuretools_train)
pipe_ft_pr.fit(df_featuretools_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

Trained pipeline.

Time taken: 0:00:01.955919.

Out[28]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'featuretools'])

In [29]:

  Copied!     
 
pipe_ft_pr.score(df_featuretools_test)
pipe_ft_pr.score(df_featuretools_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[29]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-13 13:22:48	featuretools_train	traffic_volume	220.4023	321.1657	0.9734
1	2024-09-13 13:22:48	featuretools_test	traffic_volume	210.1988	317.52	0.9746

2.3 Propositionalization with tsfresh¶

tsfresh failed to run through due to an apparent bug in the tsfresh library and is therefore excluded from this analysis.

3. Comparison¶

In [30]:

  Copied!     
 
num_features = dict(
    fastprop=461,
    featuretools=59,
)

runtime_per_feature = [
    benchmark.runtimes["fastprop"] / num_features["fastprop"],
    benchmark.runtimes["featuretools"] / num_features["featuretools"],
]

features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]

normalized_runtime_per_feature = [
    r / runtime_per_feature[0] for r in runtime_per_feature
]

comparison = pd.DataFrame(
    dict(
        runtime=[benchmark.runtimes["fastprop"], benchmark.runtimes["featuretools"]],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        normalized_runtime=[
            1,
            benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
        ],
        normalized_runtime_per_feature=normalized_runtime_per_feature,
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse],
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae],
    )
)

comparison.index = ["getML: FastProp", "featuretools"]
num_features = dict( fastprop=461, featuretools=59, ) runtime_per_feature = [ benchmark.runtimes["fastprop"] / num_features["fastprop"], benchmark.runtimes["featuretools"] / num_features["featuretools"], ] features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature] normalized_runtime_per_feature = [ r / runtime_per_feature[0] for r in runtime_per_feature ] comparison = pd.DataFrame( dict( runtime=[benchmark.runtimes["fastprop"], benchmark.runtimes["featuretools"]], num_features=num_features.values(), features_per_second=features_per_second, normalized_runtime=[ 1, benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"], ], normalized_runtime_per_feature=normalized_runtime_per_feature, rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared], rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse], mae=[pipe_fp_pr.mae, pipe_ft_pr.mae], ) ) comparison.index = ["getML: FastProp", "featuretools"]

In [31]:

  Copied!     
 
comparison
comparison

Out[31]:

	runtime	num_features	features_per_second	normalized_runtime	normalized_runtime_per_feature	rsquared	rmse	mae
getML: FastProp	0 days 00:00:04.806504	461	95.914061	1.000000	1.000000	0.982678	261.938873	180.486734
featuretools	0 days 00:04:27.009351	59	0.220966	55.551676	434.066948	0.974582	317.519976	210.198793

In [32]:

  Copied!     
 
comparison.to_csv("comparisons/interstate94.csv")
comparison.to_csv("comparisons/interstate94.csv")

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()