Propositionalization: Interstate 94¶
In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.
Summary:
- Prediction type: Regression model
- Domain: Transportation
- Prediction target: Hourly traffic volume
- Source data: Multivariate time series, 5 components
- Population size: 24096
Background¶
A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.
getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.
In this notebook, we predict the hourly traffic volume on I-94 westbound from Minneapolis-St Paul. The analysis is built on top of a dataset provided by the MN Department of Transportation, with some data preparation done by John Hogue. For further details about the data set refer to the full notebook.
Analysis¶
Let's get started with the analysis and set-up your session:
%pip install -q "getml==1.5.0" "featuretools==1.31.0"
Note: you may need to restart the kernel to use updated packages.
import os
import sys
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path
import pandas as pd
import getml
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("interstate94")
getML Engine is already running.
Connected to project 'interstate94'.
# If we are in Colab, we need to fetch the utils folder from the repository
if os.getenv("COLAB_RELEASE_TAG"):
!curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'
parent = Path(os.getcwd()).parent.as_posix()
if parent not in sys.path:
sys.path.append(parent)
from utils import Benchmark, FTTimeSeriesBuilder
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data from the UC Irvine Machine Learning repository:
traffic = getml.datasets.load_interstate94(roles=True, units=True)
Downloading traffic... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 1.2/1.2 MB • 00:00
traffic.set_role(traffic.roles.categorical, getml.data.roles.unused_string)
traffic
name | ds | traffic_volume | holiday | day | month | weekday | hour | year |
---|---|---|---|---|---|---|---|---|
role | time_stamp | target | unused_string | unused_string | unused_string | unused_string | unused_string | unused_string |
unit | time stamp, comparison only | day | month | weekday | hour | year | ||
0 | 2016-01-01 | 1513 | New Years Day | 1 | 1 | 4 | 0 | 2016 |
1 | 2016-01-01 01:00:00 | 1550 | New Years Day | 1 | 1 | 4 | 1 | 2016 |
2 | 2016-01-01 02:00:00 | 993 | New Years Day | 1 | 1 | 4 | 2 | 2016 |
3 | 2016-01-01 03:00:00 | 719 | New Years Day | 1 | 1 | 4 | 3 | 2016 |
4 | 2016-01-01 04:00:00 | 533 | New Years Day | 1 | 1 | 4 | 4 | 2016 |
... | ... | ... | ... | ... | ... | ... | ... | |
24091 | 2018-09-30 19:00:00 | 3543 | No holiday | 30 | 9 | 6 | 19 | 2018 |
24092 | 2018-09-30 20:00:00 | 2781 | No holiday | 30 | 9 | 6 | 20 | 2018 |
24093 | 2018-09-30 21:00:00 | 2159 | No holiday | 30 | 9 | 6 | 21 | 2018 |
24094 | 2018-09-30 22:00:00 | 1450 | No holiday | 30 | 9 | 6 | 22 | 2018 |
24095 | 2018-09-30 23:00:00 | 954 | No holiday | 30 | 9 | 6 | 23 | 2018 |
24096 rows x 8 columns
memory usage: 2.16 MB
name: traffic
type: getml.DataFrame
1.2 Define relational model¶
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))
time_series = getml.data.TimeSeries(
population=traffic,
split=split,
alias="traffic",
time_stamps="ds",
horizon=getml.data.time.hours(1),
memory=getml.data.time.hours(24),
lagged_targets=True,
)
time_series
data frames | staging table | |
---|---|---|
0 | traffic | TRAFFIC__STAGING_TABLE_1 |
1 | traffic | TRAFFIC__STAGING_TABLE_2 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | traffic | unknown | View |
1 | train | traffic | unknown | View |
name | rows | type | |
---|---|---|---|
0 | traffic | 24096 | DataFrame |
2. Predictive modeling¶
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
2.1 Propositionalization with getML's FastProp¶
seasonal = getml.preprocessors.Seasonal()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
)
Build the pipeline
pipe_fp_fl = getml.pipeline.Pipeline(
preprocessors=[seasonal],
feature_learners=[fast_prop],
data_model=time_series.data_model,
tags=["feature learning", "fastprop"],
)
pipe_fp_fl
Pipeline(data_model='traffic', feature_learners=['FastProp'], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=['traffic'], predictors=[], preprocessors=['Seasonal'], share_selected_features=0.5, tags=['feature learning', 'fastprop'])
pipe_fp_fl.check(time_series.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
benchmark = Benchmark()
with benchmark("fastprop"):
pipe_fp_fl.fit(time_series.train)
fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 365 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
Trained pipeline.
Time taken: 0:00:03.058378. Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
predictor = getml.predictors.XGBoostRegressor()
pipe_fp_pr = getml.pipeline.Pipeline(
tags=["prediction", "fastprop"], predictors=[predictor]
)
pipe_fp_pr.fit(fastprop_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05
Trained pipeline.
Time taken: 0:00:05.192145.
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['prediction', 'fastprop'])
pipe_fp_pr.score(fastprop_test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-13 13:17:10 | fastprop_train | traffic_volume | 198.9482 | 292.2493 | 0.9779 |
1 | 2024-09-13 13:17:10 | fastprop_test | traffic_volume | 180.4867 | 261.9389 | 0.9827 |
2.2 Propositionalization with featuretools¶
traffic_train = time_series.train.population
traffic_test = time_series.test.population
dfs_pandas = {}
for df in [traffic_train, traffic_test, traffic]:
dfs_pandas[df.name] = df.drop(df.roles.unused).to_pandas()
dfs_pandas[df.name]["join_key"] = 1
ft_builder = FTTimeSeriesBuilder(
num_features=200,
horizon=pd.Timedelta(hours=1),
memory=pd.Timedelta(hours=24),
column_id="join_key",
time_stamp="ds",
target="traffic_volume",
allow_lagged_targets=True,
)
with benchmark("featuretools"):
featuretools_train = ft_builder.fit(dfs_pandas["train"])
featuretools_test = ft_builder.transform(dfs_pandas["test"])
featuretools: Trying features... Selecting the best out of 118 features... Time taken: 0h:4m:27.008254
roles = {
getml.data.roles.join_key: ["join_key"],
getml.data.roles.target: ["traffic_volume"],
getml.data.roles.time_stamp: ["ds"],
}
df_featuretools_train = getml.data.DataFrame.from_pandas(
featuretools_train, name="featuretools_train", roles=roles
)
df_featuretools_test = getml.data.DataFrame.from_pandas(
featuretools_test, name="featuretools_test", roles=roles
)
df_featuretools_train.set_role(
df_featuretools_train.roles.unused, getml.data.roles.numerical
)
df_featuretools_test.set_role(
df_featuretools_test.roles.unused, getml.data.roles.numerical
)
predictor = getml.predictors.XGBoostRegressor()
pipe_ft_pr = getml.pipeline.Pipeline(
tags=["prediction", "featuretools"], predictors=[predictor]
)
pipe_ft_pr
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['prediction', 'featuretools'])
pipe_ft_pr.check(df_featuretools_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
pipe_ft_pr.fit(df_featuretools_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
Trained pipeline.
Time taken: 0:00:01.955919.
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['prediction', 'featuretools'])
pipe_ft_pr.score(df_featuretools_test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-13 13:22:48 | featuretools_train | traffic_volume | 220.4023 | 321.1657 | 0.9734 |
1 | 2024-09-13 13:22:48 | featuretools_test | traffic_volume | 210.1988 | 317.52 | 0.9746 |
2.3 Propositionalization with tsfresh¶
tsfresh failed to run through due to an apparent bug in the tsfresh library and is therefore excluded from this analysis.
3. Comparison¶
num_features = dict(
fastprop=461,
featuretools=59,
)
runtime_per_feature = [
benchmark.runtimes["fastprop"] / num_features["fastprop"],
benchmark.runtimes["featuretools"] / num_features["featuretools"],
]
features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]
normalized_runtime_per_feature = [
r / runtime_per_feature[0] for r in runtime_per_feature
]
comparison = pd.DataFrame(
dict(
runtime=[benchmark.runtimes["fastprop"], benchmark.runtimes["featuretools"]],
num_features=num_features.values(),
features_per_second=features_per_second,
normalized_runtime=[
1,
benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
],
normalized_runtime_per_feature=normalized_runtime_per_feature,
rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared],
rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse],
mae=[pipe_fp_pr.mae, pipe_ft_pr.mae],
)
)
comparison.index = ["getML: FastProp", "featuretools"]
comparison
runtime | num_features | features_per_second | normalized_runtime | normalized_runtime_per_feature | rsquared | rmse | mae | |
---|---|---|---|---|---|---|---|---|
getML: FastProp | 0 days 00:00:04.806504 | 461 | 95.914061 | 1.000000 | 1.000000 | 0.982678 | 261.938873 | 180.486734 |
featuretools | 0 days 00:04:27.009351 | 59 | 0.220966 | 55.551676 | 434.066948 | 0.974582 | 317.519976 | 210.198793 |
comparison.to_csv("comparisons/interstate94.csv")
getml.engine.shutdown()