Dodgers - Traffic volume prediction¶
In this tutorial, we demonstrate a univariate time series prediction application with getML.
We benchmark our results against Facebook's Prophet and tsfresh.
getML's relational learning algorithms outperform Prophet's classical time series approach by ~14% and tsfresh's brute force approaches to feature engineering by ~26% (measured in terms of the predictive R-squared).
Summary:
- Prediction type: Regression model
- Domain: Transportation
- Prediction target: traffic volume
- Source data: Univariate time series
- Population size: 47497
Background¶
The data set features some particularly interesting characteristics common for time series, which classical models may struggle to deal with. Such characteristics are:
- High frequency (every five minutes)
- Dependence on irregular events (holidays, Dodgers games)
- Strong and overlapping cycles (daily, weekly)
- Anomalies
- Multiple seasonalities
To quote the maintainers of the data set:
"This loop sensor data was collected for the Glendale on ramp for the 101 North freeway in Los Angeles. It is close enough to the stadium to see unusual traffic after a Dodgers game, but not so close and heavily used by game traffic so that the signal for the extra traffic is overly obvious."
The dataset was originally collected for this paper:
"Adaptive event detection with time-varying Poisson processes" A. Ihler, J. Hutchins, and P. Smyth Proceedings of the 12th ACM SIGKDD Conference (KDD-06), August 2006.
It is maintained by the UCI Machine Learning Repository:
Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
Analysis¶
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "scipy==1.14.1" "ipywidgets==8.1.5"
Let's get started with the analysis and set-up your session:
import os
import gc
import time
import warnings
from datetime import datetime
from urllib import request
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
from scipy.stats import pearsonr
import getml
warnings.simplefilter(action='ignore', category=UserWarning)
%matplotlib inline
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
For various technical reasons, we want to pre-store the features for prophet and tsfresh. However, you are very welcome to try this at home and fully reproduce our results. You can just set the two constants to "True".
RUN_PROPHET = False
RUN_TSFRESH = False
if RUN_PROPHET:
%pip install -q "cmdstanpy==1.2.4" "Prophet==1.1.5"
import logging
import cmdstanpy
logger = logging.getLogger('cmdstanpy')
logger.addHandler(logging.NullHandler())
logger.propagate = False
logger.setLevel(logging.CRITICAL)
from prophet import Prophet
if RUN_TSFRESH:
%pip install -q "tsfresh==0.20.3"
import tsfresh
from tsfresh.utilities.dataframe_functions import roll_time_series
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('dodgers')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912220337.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'dodgers'.
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data from the UC Irvine Machine Learning repository:
fname = "Dodgers.data"
if not os.path.exists(fname):
fname, res = request.urlretrieve(
"https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/" + fname,
fname
)
data_full_pandas = pd.read_csv(fname, header=None)
If we use the pre-stored features, we have to download them as well:
PROPHET_FILES = [
"predictions_prophet_train.csv",
"predictions_prophet_test.csv",
"combined_train_pandas.csv",
"combined_test_pandas.csv"
]
if not RUN_PROPHET:
for fname in PROPHET_FILES:
if not os.path.exists(fname):
fname, res = request.urlretrieve(
"https://static.getml.com/datasets/dodgers/" + fname,
fname
)
TSFRESH_FILES = [
"tsfresh_train_pandas.csv",
"tsfresh_test_pandas.csv"
]
if not RUN_TSFRESH:
for fname in TSFRESH_FILES:
if not os.path.exists(fname):
fname, res = request.urlretrieve(
"https://static.getml.com/datasets/dodgers/" + fname,
fname
)
Prophet is pretty strict about how the columns should be named, so we adapt to these restriction:
data_full_pandas.columns = ["ds", "y"]
data_full_pandas = data_full_pandas[data_full_pandas["y"] >= 0]
data_full_pandas = data_full_pandas.reset_index()
del data_full_pandas["index"]
data_full_pandas["ds"] = [
datetime.strptime(dt, "%m/%d/%Y %H:%M") for dt in data_full_pandas["ds"]
]
data_full_pandas
ds | y | |
---|---|---|
0 | 2005-04-11 07:35:00 | 23 |
1 | 2005-04-11 07:40:00 | 42 |
2 | 2005-04-11 07:45:00 | 37 |
3 | 2005-04-11 07:50:00 | 24 |
4 | 2005-04-11 07:55:00 | 39 |
... | ... | ... |
47492 | 2005-09-30 23:45:00 | 14 |
47493 | 2005-09-30 23:50:00 | 12 |
47494 | 2005-09-30 23:55:00 | 8 |
47495 | 2005-10-01 00:00:00 | 13 |
47496 | 2005-10-01 00:05:00 | 13 |
47497 rows × 2 columns
1.2 Prepare data for getML¶
data_full = getml.data.DataFrame.from_pandas(data_full_pandas, "data_full")
data_full.set_role("y", getml.data.roles.target)
data_full.set_role("ds", getml.data.roles.time_stamp)
data_full
name | ds | y |
---|---|---|
role | time_stamp | target |
unit | time stamp, comparison only | |
0 | 2005-04-11 07:35:00 | 23 |
1 | 2005-04-11 07:40:00 | 42 |
2 | 2005-04-11 07:45:00 | 37 |
3 | 2005-04-11 07:50:00 | 24 |
4 | 2005-04-11 07:55:00 | 39 |
... | ... | |
47492 | 2005-09-30 23:45:00 | 14 |
47493 | 2005-09-30 23:50:00 | 12 |
47494 | 2005-09-30 23:55:00 | 8 |
47495 | 2005-10-01 | 13 |
47496 | 2005-10-01 00:05:00 | 13 |
47497 rows x 2 columns
memory usage: 0.76 MB
name: data_full
type: getml.DataFrame
split = getml.data.split.time(population=data_full, time_stamp="ds", test=getml.data.time.datetime(2005, 8, 20))
split
0 | train |
---|---|
1 | train |
2 | train |
3 | train |
4 | train |
... |
47497 rows
type: StringColumnView
Traffic: population table
To allow the algorithm to capture seasonal information, we include time components (such as the day of the week) as categorical variables.
1.3 Define relational model¶
To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (horizon
, memory
and allow_lagged_targets
). This is done abstractly using Placeholders
The data model consists of two tables:
- Population table
traffic_{test/train}
: holds target and the contemporarily available time-based components - Peripheral table
traffic
: same table as the population table - Join between both placeholders specifies (
horizon
) to prevent leaks and (memory
) that keeps the computations feasible
# 1. The horizon is 1 hour (we predict the traffic volume in one hour).
# 2. The memory is 2 hours, so we allow the algorithm to
# use information from up to 2 hours ago.
# 3. We allow lagged targets. Thus, the algorithm can
# identify autoregressive processes.
time_series = getml.data.TimeSeries(
population=data_full,
split=split,
time_stamps='ds',
horizon=getml.data.time.hours(1),
memory=getml.data.time.hours(2),
lagged_targets=True
)
time_series
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | data_full | DATA_FULL__STAGING_TABLE_2 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | data_full | 11625 | View |
1 | train | data_full | 35872 | View |
name | rows | type | |
---|---|---|---|
0 | data_full | 47497 | DataFrame |
2. Predictive modeling¶
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
2.1 getML Pipeline¶
Set-up of feature learners, selectors & predictor
mapping = getml.preprocessors.Mapping()
seasonal = getml.preprocessors.Seasonal()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
aggregation=getml.feature_learning.FastProp.agg_sets.All,
)
relboost = getml.feature_learning.Relboost(
num_features=10,
loss_function=getml.feature_learning.loss_functions.SquareLoss,
seed=4367,
num_threads=1
)
predictor = getml.predictors.XGBoostRegressor()
Build the pipeline
pipe = getml.pipeline.Pipeline(
tags=['memory: 2h', 'horizon: 1h', 'fast_prop', 'relboost'],
data_model=time_series.data_model,
preprocessors=[seasonal, mapping],
feature_learners=[fast_prop, relboost],
predictors=[predictor]
)
2.2 Model training¶
pipe.check(time_series.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:12 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
OK.
pipe.fit(time_series.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 2866 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:47 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:11 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:12
Trained pipeline.
Time taken: 0:01:20.843726.
Pipeline(data_model='population', feature_learners=['FastProp', 'Relboost'], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=['data_full'], predictors=['XGBoostRegressor'], preprocessors=['Seasonal', 'Mapping'], share_selected_features=0.5, tags=['memory: 2h', 'horizon: 1h', 'fast_prop', 'relboost', 'container-qLmkYe'])
2.3 Model evaluation¶
getml_score = pipe.score(time_series.test)
getml_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 21:56:38 | train | y | 4.4978 | 6.1956 | 0.7778 |
1 | 2024-09-12 21:56:42 | test | y | 4.643 | 6.3837 | 0.7628 |
time_series
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | data_full | DATA_FULL__STAGING_TABLE_2 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | data_full | 11625 | View |
1 | train | data_full | 35872 | View |
name | rows | type | |
---|---|---|---|
0 | data_full | 47497 | DataFrame |
predictions_getml_test = pipe.predict(time_series.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
2.4 Benchmark against Prophet¶
Prophet is a library for generating predictions on univariate time series that contain strong seasonal components. We would therefore expect it to do well on this particular time series.
data_train_pandas = time_series.train.population.to_pandas()
data_test_pandas = time_series.test.population[1:].to_pandas()
data_train_pandas
y | ds | |
---|---|---|
0 | 23.0 | 2005-04-11 07:35:00 |
1 | 42.0 | 2005-04-11 07:40:00 |
2 | 37.0 | 2005-04-11 07:45:00 |
3 | 24.0 | 2005-04-11 07:50:00 |
4 | 39.0 | 2005-04-11 07:55:00 |
... | ... | ... |
35867 | 6.0 | 2005-08-19 23:35:00 |
35868 | 16.0 | 2005-08-19 23:40:00 |
35869 | 11.0 | 2005-08-19 23:45:00 |
35870 | 10.0 | 2005-08-19 23:50:00 |
35871 | 11.0 | 2005-08-19 23:55:00 |
35872 rows × 2 columns
if RUN_PROPHET:
model_prophet = Prophet()
model_prophet = model_prophet.fit(data_train_pandas)
predictions_prophet_train = model_prophet.predict(data_train_pandas)["yhat"]
predictions_prophet_test = model_prophet.predict(data_test_pandas)["yhat"]
else:
predictions_prophet_train = pd.read_csv("predictions_prophet_train.csv")["yhat"]
predictions_prophet_test = pd.read_csv("predictions_prophet_test.csv")["yhat"]
Since we are not using the getML engine for Prophet, we have to implement the metrics ourselves. Luckily, that is not very hard.
in_sample = dict()
out_of_sample = dict()
predictions_prophet_test
0 11.182024 1 10.809442 2 10.439468 3 10.072405 4 9.708546 ... 11619 8.201863 11620 7.822110 11621 7.443635 11622 7.066790 11623 6.691914 Name: yhat, Length: 11624, dtype: float64
data_train_pandas
y | ds | |
---|---|---|
0 | 23.0 | 2005-04-11 07:35:00 |
1 | 42.0 | 2005-04-11 07:40:00 |
2 | 37.0 | 2005-04-11 07:45:00 |
3 | 24.0 | 2005-04-11 07:50:00 |
4 | 39.0 | 2005-04-11 07:55:00 |
... | ... | ... |
35867 | 6.0 | 2005-08-19 23:35:00 |
35868 | 16.0 | 2005-08-19 23:40:00 |
35869 | 11.0 | 2005-08-19 23:45:00 |
35870 | 10.0 | 2005-08-19 23:50:00 |
35871 | 11.0 | 2005-08-19 23:55:00 |
35872 rows × 2 columns
def r_squared(yhat, y):
yhat = np.asarray(yhat)
y = np.asarray(y)
r = scipy.stats.pearsonr(yhat, y)[0]
return r * r
in_sample["rsquared"] = r_squared(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["rsquared"] = r_squared(predictions_prophet_test, data_test_pandas["y"])
def rmse(yhat, y):
yhat = np.asarray(yhat)
y = np.asarray(y)
return np.sqrt(
((y - yhat)*(y - yhat)).sum()/len(y)
)
in_sample["rmse"] = rmse(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["rmse"] = rmse(predictions_prophet_test, data_test_pandas["y"])
def mae(yhat, y):
yhat = np.asarray(yhat)
y = np.asarray(y)
return np.abs(y - yhat).sum()/len(y)
in_sample["mae"] = mae(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["mae"] = mae(predictions_prophet_test, data_test_pandas["y"])
print("""
In sample mae: {:.4f}
In sample rmse: {:.4f}
In sample rsquared: {:.4f}\n
Out of sample mae: {:.4f}
Out of sample rmse: {:.4f}
Out of sample rsquared: {:.4f}
""".format(
in_sample['mae'],
in_sample['rmse'],
in_sample['rsquared'],
out_of_sample['mae'],
out_of_sample['rmse'],
out_of_sample['rsquared'])
)
In sample mae: 5.6729 In sample rmse: 7.5989 In sample rsquared: 0.6655 Out of sample mae: 6.2245 Out of sample rmse: 8.3225 Out of sample rsquared: 0.6306
Let's take a closer look at the predictions to get a better understanding why getML does better than Prophet.
length = 4000
plt.subplots(figsize=(20, 10))
plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth")
plt.plot(predictions_getml_test[:length], label="getml")
plt.plot(predictions_prophet_test[:length], label="prophet")
plt.legend(loc="upper right")
<matplotlib.legend.Legend at 0x7896453b4bd0>
As this plot indicates, getML does better than Prophet, because it can integrate autoregressive processes in addition to seasonal data.
2.5 Benchmark against tsfresh¶
tsfresh is a library for generating features on time series. It uses a brute-force approach: It generates a large number of hard-coded features and then does a feature selection.
For convenience, we have built a wrapper around tsfresh.
As we have discussed in a different notebook, tsfresh consumes a lot of memory. To limit the memory consumption to a feasible level, we only use tsfresh's MinimalFCParameters and IndexBasedFCParameters, which are a superset of the TimeBasedFCParameters.
class TSFreshBuilder():
def __init__(self, num_features, memory, column_id, time_stamp, target):
"""
Scikit-learn style feature builder based on TSFresh.
Args:
num_features: The (maximum) number of features to build.
memory: How much back in time you want to go until the
feature builder starts "forgetting" data.
column_id: The name of the column containing the ids.
time_stamp: The name of the column containing the time stamps.
target: The name of the target column.
"""
self.num_features = num_features
self.memory = memory
self.column_id = column_id
self.time_stamp = time_stamp
self.target = target
self.selected_features = []
def _add_original_columns(self, original_df, df_selected):
for colname in original_df.columns:
df_selected[colname] = np.asarray(
original_df[colname])
return df_selected
def _extract_features(self, df):
df_rolled = roll_time_series(
df,
column_id=self.column_id,
column_sort=self.time_stamp,
max_timeshift=self.memory
)
extracted_minimal = tsfresh.extract_features(
df_rolled,
column_id=self.column_id,
column_sort=self.time_stamp,
default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters()
)
extracted_index_based = tsfresh.extract_features(
df_rolled,
column_id=self.column_id,
column_sort=self.time_stamp,
default_fc_parameters=tsfresh.feature_extraction.settings.IndexBasedFCParameters()
)
extracted_features = pd.concat(
[extracted_minimal, extracted_index_based], axis=1
)
del extracted_minimal
del extracted_index_based
gc.collect()
extracted_features[
extracted_features != extracted_features] = 0.0
extracted_features[
np.isinf(extracted_features)] = 0.0
return extracted_features
def _print_time_taken(self, begin, end):
seconds = end - begin
hours = int(seconds / 3600)
seconds -= float(hours * 3600)
minutes = int(seconds / 60)
seconds -= float(minutes * 60)
seconds = round(seconds, 6)
print(
"Time taken: " + str(hours) + "h:" +
str(minutes) + "m:" + str(seconds)
)
print("")
def _remove_target_column(self, df):
colnames = np.asarray(df.columns)
if self.target not in colnames:
return df
colnames = colnames[colnames != self.target]
return df[colnames]
def _select_features(self, df, target):
df_selected = tsfresh.select_features(
df,
target
)
colnames = np.asarray(df_selected.columns)
correlations = np.asarray([
np.abs(pearsonr(target, df_selected[col]))[0] for col in colnames
])
# [::-1] is somewhat unintuitive syntax,
# but it reverses the entire column.
self.selected_features = colnames[
np.argsort(correlations)
][::-1][:self.num_features]
return df_selected[self.selected_features]
def fit(self, df):
"""
Fits the features.
"""
begin = time.time()
target = np.asarray(df[self.target])
df_without_target = self._remove_target_column(df)
df_extracted = self._extract_features(
df_without_target)
df_selected = self._select_features(
df_extracted, target)
del df_extracted
gc.collect()
df_selected = self._add_original_columns(df, df_selected)
end = time.time()
self._print_time_taken(begin, end)
return df_selected
def transform(self, df):
"""
Transforms the raw data into a set of features.
"""
df_extracted = self._extract_features(df)
df_selected = df_extracted[self.selected_features]
del df_extracted
gc.collect()
df_selected = self._add_original_columns(df, df_selected)
return df_selected
We need to lag our target variable, so we can input as a feature to tsfresh.
y_lagged = np.asarray(data_full_pandas["y"][:-12])
data_full_pandas.loc[12:, "y_lagged"] = y_lagged
data_full_pandas.loc[12:, "id"] = 1
separation = datetime(2005, 8, 20, 0, 0)
data_full_tsfresh = data_full_pandas[12:]
data_train_tsfresh = data_full_tsfresh[data_full_tsfresh["ds"] < separation]
data_test_tsfresh = data_full_tsfresh[data_full_tsfresh["ds"] >= separation]
data_train_tsfresh
ds | y | y_lagged | id | |
---|---|---|---|---|
12 | 2005-04-11 08:35:00 | 42 | 23.0 | 1.0 |
13 | 2005-04-11 08:40:00 | 41 | 42.0 | 1.0 |
14 | 2005-04-11 08:45:00 | 38 | 37.0 | 1.0 |
15 | 2005-04-11 08:50:00 | 38 | 24.0 | 1.0 |
16 | 2005-04-11 08:55:00 | 40 | 39.0 | 1.0 |
... | ... | ... | ... | ... |
35867 | 2005-08-19 23:35:00 | 6 | 18.0 | 1.0 |
35868 | 2005-08-19 23:40:00 | 16 | 25.0 | 1.0 |
35869 | 2005-08-19 23:45:00 | 11 | 25.0 | 1.0 |
35870 | 2005-08-19 23:50:00 | 10 | 18.0 | 1.0 |
35871 | 2005-08-19 23:55:00 | 11 | 16.0 | 1.0 |
35860 rows × 4 columns
data_test_tsfresh
ds | y | y_lagged | id | |
---|---|---|---|---|
35872 | 2005-08-20 00:00:00 | 13 | 9.0 | 1.0 |
35873 | 2005-08-20 00:05:00 | 7 | 24.0 | 1.0 |
35874 | 2005-08-20 00:10:00 | 10 | 8.0 | 1.0 |
35875 | 2005-08-20 00:15:00 | 6 | 13.0 | 1.0 |
35876 | 2005-08-20 00:20:00 | 12 | 14.0 | 1.0 |
... | ... | ... | ... | ... |
47492 | 2005-09-30 23:45:00 | 14 | 28.0 | 1.0 |
47493 | 2005-09-30 23:50:00 | 12 | 18.0 | 1.0 |
47494 | 2005-09-30 23:55:00 | 8 | 36.0 | 1.0 |
47495 | 2005-10-01 00:00:00 | 13 | 21.0 | 1.0 |
47496 | 2005-10-01 00:05:00 | 13 | 27.0 | 1.0 |
11625 rows × 4 columns
We build 20 features, just like we have with getML.
tsfresh_builder = TSFreshBuilder(
num_features=20,
memory=24,
column_id="id",
time_stamp="ds",
target="y"
)
if RUN_TSFRESH:
tsfresh_train_pandas = tsfresh_builder.fit(data_train_tsfresh)
tsfresh_test_pandas = tsfresh_builder.transform(data_test_tsfresh)
else:
tsfresh_train_pandas = pd.read_csv("tsfresh_train_pandas.csv")
tsfresh_test_pandas = pd.read_csv("tsfresh_test_pandas.csv")
Because tsfresh does not come with built-in predictors, we upload the generated features into a getML pipeline.
tsfresh_train = getml.data.DataFrame.from_pandas(tsfresh_train_pandas, "tsfresh_train")
tsfresh_test = getml.data.DataFrame.from_pandas(tsfresh_test_pandas, "tsfresh_test")
for df in [tsfresh_train, tsfresh_test]:
df.set_role("y", getml.data.roles.target)
df.set_role("ds", getml.data.roles.time_stamp)
df.set_role(df.roles.unused_float, getml.data.roles.numerical)
df.set_role(["y_lagged", "id"], getml.data.roles.unused_float)
tsfresh_train
name | ds | y | y_lagged__mean | y_lagged__sum_values | y_lagged__median | y_lagged__maximum | y_lagged__minimum | y_lagged__percentage_of_reoccurring_values_to_all_values | y_lagged__standard_deviation | y_lagged__skewness | y_lagged__variance | y_lagged__kurtosis | y_lagged__length | y_lagged | id |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | target | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float | unused_float |
unit | time stamp, comparison only | ||||||||||||||
0 | 2005-04-11 08:35:00 | 42 | 23 | 23 | 23 | 23 | 23 | 0 | 0 | 0 | 0 | 0 | 1 | 23 | 1 |
1 | 2005-04-11 08:40:00 | 41 | 32.5 | 65 | 32.5 | 42 | 23 | 0 | 9.5 | 0 | 90.25 | 0 | 2 | 42 | 1 |
2 | 2005-04-11 08:45:00 | 38 | 34 | 102 | 37 | 42 | 23 | 0 | 8.0416 | -1.2435 | 64.6667 | 0 | 3 | 37 | 1 |
3 | 2005-04-11 08:50:00 | 38 | 31.5 | 126 | 30.5 | 42 | 23 | 0 | 8.2006 | 0.2261 | 67.25 | -4.6053 | 4 | 24 | 1 |
4 | 2005-04-11 08:55:00 | 40 | 33 | 165 | 37 | 42 | 23 | 0 | 7.9246 | -0.4313 | 62.8 | -2.9949 | 5 | 39 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
35855 | 2005-08-19 23:35:00 | 6 | 21.16 | 529 | 21 | 37 | 10 | 0.84 | 5.0333 | 0.7576 | 25.3344 | 3.1125 | 25 | 18 | 1 |
35856 | 2005-08-19 23:40:00 | 16 | 21.32 | 533 | 21 | 37 | 10 | 0.88 | 5.089 | 0.6509 | 25.8976 | 2.7456 | 25 | 25 | 1 |
35857 | 2005-08-19 23:45:00 | 11 | 21.48 | 537 | 22 | 37 | 10 | 0.84 | 5.139 | 0.5482 | 26.4096 | 2.4358 | 25 | 25 | 1 |
35858 | 2005-08-19 23:50:00 | 10 | 21.32 | 533 | 21 | 37 | 10 | 0.8 | 5.1824 | 0.6203 | 26.8576 | 2.3339 | 25 | 18 | 1 |
35859 | 2005-08-19 23:55:00 | 11 | 21.56 | 539 | 21 | 37 | 14 | 0.84 | 4.7756 | 1.1404 | 22.8064 | 2.7922 | 25 | 16 | 1 |
35860 rows x 15 columns
memory usage: 4.30 MB
name: tsfresh_train
type: getml.DataFrame
We use an untuned XGBoostRegressor to generate predictions from our tsfresh features, just like we have for getML.
predictor = getml.predictors.XGBoostRegressor()
pipe_tsfresh = getml.pipeline.Pipeline(
tags=['tsfresh'],
predictors=[predictor]
)
pipe_tsfresh.fit(tsfresh_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Trained pipeline.
Time taken: 0:00:01.000419.
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['tsfresh'])
tsfresh_score = pipe_tsfresh.score(tsfresh_test)
tsfresh_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 21:56:47 | tsfresh_train | y | 6.8096 | 8.7569 | 0.5565 |
1 | 2024-09-12 21:56:48 | tsfresh_test | y | 7.1877 | 9.3024 | 0.4943 |
predictions_tsfresh_test = pipe_tsfresh.predict(tsfresh_test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Let's take a closer look at the predictions to get a better understanding why getML does better than tsfresh.
length = 4000
plt.subplots(figsize=(20, 10))
plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth")
plt.plot(predictions_getml_test[:length], label="getml")
plt.plot(predictions_tsfresh_test[:length], label="tsfresh")
plt.legend(loc="upper right")
<matplotlib.legend.Legend at 0x7896435f8450>
As we can see, tsfresh struggles with the strong seasonal components of this data set and therefore cannot separate signal from noise to the same extent that getML can.
2.6 Combining Prophet and tsfresh¶
Prophet is good at extracting seasonal features. tsfresh is good at extracting autoregressive features. So what if we tried to combine them? How well would that perform compared to getML?
Let's give it a try. We begin by extracting all of the seasonal features from Prophet and combining them with the tsfresh features:
def combine(dfs):
combined = pd.DataFrame()
for df in dfs:
df = df.copy()
if "id" in df.columns:
del df["id"]
df = df.reset_index()
for col in df.columns:
combined[col] = df[col]
return combined
if RUN_PROPHET:
prophet_train_pandas = model_prophet.predict(data_train_tsfresh)
prophet_test_pandas = model_prophet.predict(data_test_tsfresh)
combined_train_pandas = combine([tsfresh_train_pandas, prophet_train_pandas])
combined_test_pandas = combine([tsfresh_test_pandas, prophet_test_pandas])
else:
combined_train_pandas = pd.read_csv("combined_train_pandas.csv")
combined_test_pandas = pd.read_csv("combined_test_pandas.csv")
We upload the data to getML:
combined_train = getml.data.DataFrame.from_pandas(combined_train_pandas, "combined_train")
combined_test = getml.data.DataFrame.from_pandas(combined_test_pandas, "combined_test")
The multiplicative terms are all zero, so we set them to unused to avoid an ugly warning message we would get from getML.
for df in [combined_train, combined_test]:
df.set_role("y", getml.data.roles.target)
df.set_role("ds", getml.data.roles.time_stamp)
df.set_role(df.roles.unused_float, getml.data.roles.numerical)
df.set_role(["multiplicative_terms", "multiplicative_terms_lower", "multiplicative_terms_upper", "y_lagged"], getml.data.roles.unused_float)
Once again, we train an untuned XGBoostRegressor on top of these features.
predictor = getml.predictors.XGBoostRegressor()
pipe_combined = getml.pipeline.Pipeline(
tags=['prophet + tsfresh'],
predictors=[predictor]
)
pipe_combined.fit(combined_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
Trained pipeline.
Time taken: 0:00:02.480467.
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=False, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['prophet + tsfresh'])
combined_score = pipe_combined.score(combined_test)
combined_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 21:56:51 | combined_train | y | 4.6572 | 6.4011 | 0.7633 |
1 | 2024-09-12 21:56:51 | combined_test | y | 6.1772 | 8.4081 | 0.6661 |
As we can see, combining tsfresh and Prophet generates better predictions than any single one of them, but it is still considerably worse than getML.
predictions_combined_test = pipe_combined.predict(combined_test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
length = 4000
plt.subplots(figsize=(20, 10))
plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth")
plt.plot(predictions_getml_test[:length], label="getml")
plt.plot(predictions_combined_test[:length], label="tsfresh + prophet")
plt.legend(loc="upper right")
<matplotlib.legend.Legend at 0x7896425fcbd0>
2.7 Features¶
The most important feature is the following:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_2_1";
CREATE TABLE "FEATURE_2_1" AS
SELECT AVG(
CASE
WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 16.860223 ) AND ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 28.870791 ) AND ( t2."y" > 9.832317 ) THEN 3.009878351054582
WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 16.860223 ) AND ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 28.870791 ) AND ( t2."y" <= 9.832317 OR t2."y" IS NULL ) THEN -29.25530262742197
WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 16.860223 ) AND ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 28.870791 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t1."weekday__strftime__w__ds" IN ( '0' ) ) THEN -12.75401111588412
WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 16.860223 ) AND ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 28.870791 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t1."weekday__strftime__w__ds" NOT IN ( '0' ) OR t1."weekday__strftime__w__ds" IS NULL ) THEN -5.674777947347508
WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 16.860223 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t2."y" > 17.943925 ) AND ( t2."weekday__strftime__w__ds" IN ( '1', '2', '3', '4' ) ) THEN -23.4772449371152
WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 16.860223 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t2."y" > 17.943925 ) AND ( t2."weekday__strftime__w__ds" NOT IN ( '1', '2', '3', '4' ) OR t2."weekday__strftime__w__ds" IS NULL ) THEN -15.06410432423447
WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 16.860223 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t2."y" <= 17.943925 OR t2."y" IS NULL ) AND ( t1."hour__strftime__h__ds__mapping_target_1_avg" > 16.903299 ) THEN -37.20384319625197
WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 16.860223 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t2."y" <= 17.943925 OR t2."y" IS NULL ) AND ( t1."hour__strftime__h__ds__mapping_target_1_avg" <= 16.903299 OR t1."hour__strftime__h__ds__mapping_target_1_avg" IS NULL ) THEN -28.10249283121772
ELSE NULL
END
) AS "feature_2_1",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "DATA_FULL__STAGING_TABLE_2" t2
ON 1 = 1
WHERE t2."ds__1_000000_hours" <= t1."ds"
AND ( t2."ds__3_000000_hours" > t1."ds" OR t2."ds__3_000000_hours" IS NULL )
GROUP BY t1.rowid;
2.8 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
pipe.features.to_sql().save("dodgers_pipeline", remove=True)
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("dodgers_spark", remove=True)
2.9 Summary of results¶
For a more convenient overview, we summarize these results into a table.
from collections import namedtuple
prophet_score = namedtuple('Score', ['mae', 'rmse', 'rsquared'])(out_of_sample['mae'], out_of_sample['rmse'], out_of_sample['rsquared'])
scores = [getml_score, prophet_score, tsfresh_score, combined_score]
pd.DataFrame(data={
'Name': ['getML', 'Prophet', 'tsfresh', 'Prophet + tsfresh'],
'R-squared': [f'{score.rsquared:.2%}' for score in scores],
'RMSE': [f'{score.rmse:,.2f}' for score in scores],
'MAE': [f'{score.mae:,.2f}' for score in scores]
})
Name | R-squared | RMSE | MAE | |
---|---|---|---|---|
0 | getML | 76.28% | 6.38 | 4.64 |
1 | Prophet | 63.06% | 8.32 | 6.22 |
2 | tsfresh | 49.43% | 9.30 | 7.19 |
3 | Prophet + tsfresh | 66.61% | 8.41 | 6.18 |
As we can see, getML outperforms both Prophet and tsfresh by all three measures.
getml.engine.shutdown()
3. Conclusion¶
We have compared getML's feature learning algorithms to Prophet and tsfresh on a data set related to traffic on LA's 101 North freeway. We found that getML significantly outperforms both Prophet and tsfresh. These results are consistent with the view that relational learning is a powerful tool for time series analysis.
You are encouraged to reproduce these results. You will need getML Enterprise to do so. You can download the trial version for free.