Dodgers - Traffic near Dodgers' stadium¶
In this notebook, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.
Summary:
- Prediction type: Regression model
- Domain: Transportation
- Prediction target: traffic volume
- Source data: Univariate time series
- Population size: 47497
Background¶
A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.
getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.
In this notebook, we use traffic data that was collected for the Glendale on ramp for the 101 North freeway in Los Angeles. For further details about the data set refer to the full notebook.
Analysis¶
Let's get started with the analysis and set-up your session:
import datetime
import gc
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path
import sys
import time
from urllib import request
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import scipy
from IPython.display import Image
from scipy.stats import pearsonr
%matplotlib inline
parent = Path(os.getcwd()).parent.as_posix()
if parent not in sys.path:
sys.path.append(parent)
from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder
import getml
getml.engine.launch(home_directory=Path.home(), allow_remote_ips=True, token='token')
getml.engine.set_project("dodgers")
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/getml --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/getml/.getML/getml-1.4.0-x64-linux... Launched the getML engine. The log output will be stored in /home/getml/.getML/logs/20240221161224.log. Loading pipelines... 100% |██████████| [elapsed: 00:01, remaining: 00:00] Connected to project 'dodgers'
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data from the UC Irvine Machine Learning repository:
fname = "Dodgers.data"
if not os.path.exists(fname):
fname, res = request.urlretrieve(
"https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/"
+ fname,
fname,
)
data_full_pandas = pd.read_csv(fname, header=None)
data_full_pandas.columns = ["ds", "y"]
data_full_pandas["ds"] = [
datetime.datetime.strptime(dt, "%m/%d/%Y %H:%M") for dt in data_full_pandas["ds"]
]
data_full_pandas
ds | y | |
---|---|---|
0 | 2005-04-10 00:00:00 | -1 |
1 | 2005-04-10 00:05:00 | -1 |
2 | 2005-04-10 00:10:00 | -1 |
3 | 2005-04-10 00:15:00 | -1 |
4 | 2005-04-10 00:20:00 | -1 |
... | ... | ... |
50395 | 2005-10-01 23:35:00 | -1 |
50396 | 2005-10-01 23:40:00 | -1 |
50397 | 2005-10-01 23:45:00 | -1 |
50398 | 2005-10-01 23:50:00 | -1 |
50399 | 2005-10-01 23:55:00 | -1 |
50400 rows × 2 columns
1.2 Prepare data for getML¶
data_full = getml.data.DataFrame.from_pandas(data_full_pandas, "data_full")
data_full.set_role("y", getml.data.roles.target)
data_full.set_role("ds", getml.data.roles.time_stamp)
data_full
name | ds | y |
---|---|---|
role | time_stamp | target |
unit | time stamp, comparison only | |
0 | 2005-04-10 | -1 |
1 | 2005-04-10 00:05:00 | -1 |
2 | 2005-04-10 00:10:00 | -1 |
3 | 2005-04-10 00:15:00 | -1 |
4 | 2005-04-10 00:20:00 | -1 |
... | ... | |
50395 | 2005-10-01 23:35:00 | -1 |
50396 | 2005-10-01 23:40:00 | -1 |
50397 | 2005-10-01 23:45:00 | -1 |
50398 | 2005-10-01 23:50:00 | -1 |
50399 | 2005-10-01 23:55:00 | -1 |
50400 rows x 2 columns
memory usage: 0.81 MB
name: data_full
type: getml.DataFrame
split = getml.data.split.time(
population=data_full, time_stamp="ds", test=getml.data.time.datetime(2005, 8, 20)
)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | train |
4 | train |
... |
50400 rows
type: StringColumnView
1.3 Define relational model¶
To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (horizon
, memory
and allow_lagged_targets
). This is done abstractly using Placeholders
The data model consists of two tables:
- Population table
traffic_{test/train}
: holds target and the contemporarily available time-based components - Peripheral table
traffic
: same table as the population table - Join between both placeholders specifies (
horizon
) to prevent leaks and (memory
) that keeps the computations feasible
# 1. The horizon is 1 hour (we predict the traffic volume in one hour).
# 2. The memory is 2 hours, so we allow the algorithm to
# use information from up to 2 hours ago.
# 3. We allow lagged targets. Thus, the algorithm can
# identify autoregressive processes.
time_series = getml.data.TimeSeries(
population=data_full,
alias="population",
split=split,
time_stamps="ds",
horizon=getml.data.time.hours(1),
memory=getml.data.time.hours(2),
lagged_targets=True,
)
time_series
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | data_full | DATA_FULL__STAGING_TABLE_2 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | data_full | 12384 | View |
1 | train | data_full | 38016 | View |
name | rows | type | |
---|---|---|---|
0 | data_full | 50400 | DataFrame |
2. Predictive modeling¶
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
2.1 Propositionalization with getML's FastProp¶
seasonal = getml.preprocessors.Seasonal()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
)
Build the pipeline
pipe_fp_fl = getml.pipeline.Pipeline(
preprocessors=[seasonal],
feature_learners=[fast_prop],
data_model=time_series.data_model,
tags=["feature learning", "fastprop"],
)
pipe_fp_fl
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=['data_full'], predictors=[], preprocessors=['Seasonal'], share_selected_features=0.5, tags=['feature learning', 'fastprop'])
pipe_fp_fl.check(time_series.train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Checking... 100% |██████████| [elapsed: 00:02, remaining: 00:00] OK.
benchmark = Benchmark()
with benchmark("fastprop"):
pipe_fp_fl.fit(time_series.train)
fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] OK. Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] FastProp: Trying 526 features... 100% |██████████| [elapsed: 00:06, remaining: 00:00] Trained pipeline. Time taken: 0h:0m:6.317863 Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] FastProp: Building features... 100% |██████████| [elapsed: 00:03, remaining: 00:00]
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00] FastProp: Building features... 100% |██████████| [elapsed: 00:01, remaining: 00:00]
predictor = getml.predictors.XGBoostRegressor()
pipe_fp_pr = getml.pipeline.Pipeline(
tags=["prediction", "fastprop"], predictors=[predictor]
)
pipe_fp_pr.fit(fastprop_train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00] OK. Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:09, remaining: 00:00] Trained pipeline. Time taken: 0h:0m:9.613381
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['prediction', 'fastprop'])
pipe_fp_pr.score(fastprop_test)
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-02-21 16:12:55 | fastprop_train | y | 5.4188 | 7.5347 | 0.699 |
1 | 2024-02-21 16:12:55 | fastprop_test | y | 5.6151 | 7.8243 | 0.6747 |
2.2 Propositionalization with featuretools¶
data_train = time_series.train.population.to_df("data_train")
data_test = time_series.test.population.to_df("data_test")
dfs_pandas = {}
for df in getml.project.data_frames:
dfs_pandas[df.name] = df.to_pandas()
dfs_pandas[df.name]["id"] = 1
ft_builder = FTTimeSeriesBuilder(
num_features=200,
horizon=pd.Timedelta(hours=1),
memory=pd.Timedelta(hours=2),
column_id="id",
time_stamp="ds",
target="y",
allow_lagged_targets=True,
)
with benchmark("featuretools"):
featuretools_train = ft_builder.fit(dfs_pandas["data_train"])
featuretools_test = ft_builder.transform(dfs_pandas["data_test"])
featuretools: Trying features... Selecting the best out of 118 features... Time taken: 0h:9m:19.75259
df_featuretools_train = getml.data.DataFrame.from_pandas(
featuretools_train, name="featuretools_train", roles=data_train.roles
)
df_featuretools_test = getml.data.DataFrame.from_pandas(
featuretools_test, name="featuretools_test", roles=data_train.roles
)
df_featuretools_train.set_role(
df_featuretools_train.roles.unused, getml.data.roles.numerical
)
df_featuretools_test.set_role(
df_featuretools_test.roles.unused, getml.data.roles.numerical
)
predictor = getml.predictors.XGBoostRegressor()
pipe_ft_pr = getml.pipeline.Pipeline(
tags=["prediction", "featuretools"], predictors=[predictor]
)
pipe_ft_pr
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['prediction', 'featuretools'])
pipe_ft_pr.check(df_featuretools_train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | WARNING | COLUMN SHOULD BE UNUSED | All non-NULL entries in column 'id' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). |
pipe_ft_pr.fit(df_featuretools_train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING. To see the issues in full, run .check() on the pipeline. Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:04, remaining: 00:00] Trained pipeline. Time taken: 0h:0m:4.092266
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['prediction', 'featuretools'])
pipe_ft_pr.score(df_featuretools_test)
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-02-21 16:25:31 | featuretools_train | y | 5.4482 | 7.568 | 0.6962 |
1 | 2024-02-21 16:25:31 | featuretools_test | y | 6.0863 | 8.5009 | 0.6498 |
2.3 Propositionalization with tsfresh¶
tsfresh_builder = TSFreshBuilder(
num_features=200,
horizon=20,
memory=60,
column_id="id",
time_stamp="ds",
target="y",
allow_lagged_targets=True,
)
with benchmark("tsfresh"):
tsfresh_train = tsfresh_builder.fit(dfs_pandas["data_train"])
tsfresh_test = tsfresh_builder.transform(dfs_pandas["data_test"])
Rolling: 100%|██████████| 40/40 [00:19<00:00, 2.06it/s] Feature Extraction: 100%|██████████| 40/40 [00:08<00:00, 4.69it/s] Feature Extraction: 100%|██████████| 40/40 [00:08<00:00, 4.71it/s]
Selecting the best out of 13 features... Time taken: 0h:0m:46.114942
Rolling: 100%|██████████| 40/40 [00:05<00:00, 7.69it/s] Feature Extraction: 100%|██████████| 40/40 [00:02<00:00, 13.49it/s] Feature Extraction: 100%|██████████| 40/40 [00:03<00:00, 12.04it/s]
df_tsfresh_train = getml.data.DataFrame.from_pandas(
tsfresh_train, name="tsfresh_train", roles=data_train.roles
)
df_tsfresh_test = getml.data.DataFrame.from_pandas(
tsfresh_test, name="tsfresh_test", roles=data_train.roles
)
df_tsfresh_train.set_role(df_tsfresh_train.roles.unused, getml.data.roles.numerical)
df_tsfresh_test.set_role(df_tsfresh_test.roles.unused, getml.data.roles.numerical)
pipe_tsf_pr = getml.pipeline.Pipeline(
tags=["predicition", "tsfresh"], predictors=[predictor]
)
pipe_tsf_pr
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['predicition', 'tsfresh'])
pipe_tsf_pr.fit(df_tsfresh_train)
Checking data model... Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00] The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING. To see the issues in full, run .check() on the pipeline. Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:02, remaining: 00:00] Trained pipeline. Time taken: 0h:0m:1.790984
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['predicition', 'tsfresh'])
pipe_tsf_pr.score(df_tsfresh_test)
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00] Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-02-21 16:26:34 | tsfresh_train | y | 6.3146 | 8.2348 | 0.6418 |
1 | 2024-02-21 16:26:34 | tsfresh_test | y | 6.7886 | 8.9134 | 0.5778 |
3. Comparison¶
num_features = dict(
fastprop=526,
featuretools=59,
tsfresh=12,
)
runtime_per_feature = [
benchmark.runtimes["fastprop"] / num_features["fastprop"],
benchmark.runtimes["featuretools"] / num_features["featuretools"],
benchmark.runtimes["tsfresh"] / num_features["tsfresh"],
]
features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]
normalized_runtime_per_feature = [
r / runtime_per_feature[0] for r in runtime_per_feature
]
comparison = pd.DataFrame(
dict(
runtime=[
benchmark.runtimes["fastprop"],
benchmark.runtimes["featuretools"],
benchmark.runtimes["tsfresh"],
],
num_features=num_features.values(),
features_per_second=features_per_second,
normalized_runtime=[
1,
benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"],
],
normalized_runtime_per_feature=normalized_runtime_per_feature,
rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared, pipe_tsf_pr.rsquared],
rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse, pipe_tsf_pr.rmse],
mae=[pipe_fp_pr.mae, pipe_ft_pr.mae, pipe_tsf_pr.mae],
)
)
comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]
comparison
runtime | num_features | features_per_second | normalized_runtime | normalized_runtime_per_feature | rsquared | rmse | mae | |
---|---|---|---|---|---|---|---|---|
getML: FastProp | 0 days 00:00:09.406415 | 526 | 55.919029 | 1.000000 | 1.000000 | 0.674740 | 7.824273 | 5.615138 |
featuretools | 0 days 00:09:19.754041 | 59 | 0.105403 | 59.507691 | 530.523794 | 0.649768 | 8.500887 | 6.086277 |
tsfresh | 0 days 00:00:46.115063 | 12 | 0.260219 | 4.902512 | 214.892468 | 0.577811 | 8.913408 | 6.788610 |
# export for further use
comparison.to_csv("comparisons/dodgers.csv")