Dodgers - Traffic volume prediction¶

In this tutorial, we demonstrate a univariate time series prediction application with getML.

We benchmark our results against Facebook's Prophet and tsfresh.

getML's relational learning algorithms outperform Prophet's classical time series approach by ~14% and tsfresh's brute force approaches to feature engineering by ~26% (measured in terms of the predictive R-squared).

Summary:

Prediction type: Regression model
Domain: Transportation
Prediction target: traffic volume
Source data: Univariate time series
Population size: 47497

Background¶

The data set features some particularly interesting characteristics common for time series, which classical models may struggle to deal with. Such characteristics are:

High frequency (every five minutes)
Dependence on irregular events (holidays, Dodgers games)
Strong and overlapping cycles (daily, weekly)
Anomalies
Multiple seasonalities

To quote the maintainers of the data set:

"This loop sensor data was collected for the Glendale on ramp for the 101 North freeway in Los Angeles. It is close enough to the stadium to see unusual traffic after a Dodgers game, but not so close and heavily used by game traffic so that the signal for the extra traffic is overly obvious."

The dataset was originally collected for this paper:

"Adaptive event detection with time-varying Poisson processes" A. Ihler, J. Hutchins, and P. Smyth Proceedings of the 12th ACM SIGKDD Conference (KDD-06), August 2006.

It is maintained by the UCI Machine Learning Repository:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

Analysis¶

In [59]:

  Copied!     
 
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "scipy==1.14.1" "ipywidgets==8.1.5"
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "scipy==1.14.1" "ipywidgets==8.1.5"

Let's get started with the analysis and set-up your session:

In [2]:

  Copied!     
 
import os
import gc
import time
import warnings

from datetime import datetime
from urllib import request

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import scipy
from scipy.stats import pearsonr
import getml


warnings.simplefilter(action='ignore', category=UserWarning)
%matplotlib inline

print(f"getML API version: {getml.__version__}\n")
import os import gc import time import warnings from datetime import datetime from urllib import request import numpy as np import pandas as pd import matplotlib.pyplot as plt import scipy from scipy.stats import pearsonr import getml warnings.simplefilter(action='ignore', category=UserWarning) %matplotlib inline print(f"getML API version: {getml.__version__}\n") 

getML API version: 1.5.0

For various technical reasons, we want to pre-store the features for prophet and tsfresh. However, you are very welcome to try this at home and fully reproduce our results. You can just set the two constants to "True".

In [3]:

  Copied!     
 
RUN_PROPHET = False
RUN_TSFRESH = False

if RUN_PROPHET:
    %pip install -q "cmdstanpy==1.2.4" "Prophet==1.1.5"
    import logging
    import cmdstanpy
    logger = logging.getLogger('cmdstanpy')
    logger.addHandler(logging.NullHandler())
    logger.propagate = False
    logger.setLevel(logging.CRITICAL)

    from prophet import Prophet

if RUN_TSFRESH:
    %pip install -q "tsfresh==0.20.3"
    import tsfresh
    from tsfresh.utilities.dataframe_functions import roll_time_series
RUN_PROPHET = False RUN_TSFRESH = False if RUN_PROPHET: %pip install -q "cmdstanpy==1.2.4" "Prophet==1.1.5" import logging import cmdstanpy logger = logging.getLogger('cmdstanpy') logger.addHandler(logging.NullHandler()) logger.propagate = False logger.setLevel(logging.CRITICAL) from prophet import Prophet if RUN_TSFRESH: %pip install -q "tsfresh==0.20.3" import tsfresh from tsfresh.utilities.dataframe_functions import roll_time_series

In [61]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('dodgers')
getml.engine.launch(allow_remote_ips=True, token='token') getml.engine.set_project('dodgers')

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912220337.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'dodgers'.

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data from the UC Irvine Machine Learning repository:

In [5]:

  Copied!     
 
fname = "Dodgers.data"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/" + fname, 
        fname
    )
    
data_full_pandas = pd.read_csv(fname, header=None)
fname = "Dodgers.data" if not os.path.exists(fname): fname, res = request.urlretrieve( "https://archive.ics.uci.edu/ml/machine-learning-databases/event-detection/" + fname, fname ) data_full_pandas = pd.read_csv(fname, header=None)

If we use the pre-stored features, we have to download them as well:

In [6]:

  Copied!     
 
PROPHET_FILES = [
    "predictions_prophet_train.csv",
    "predictions_prophet_test.csv",
    "combined_train_pandas.csv",
    "combined_test_pandas.csv"
]

if not RUN_PROPHET:
    for fname in PROPHET_FILES:
        if not os.path.exists(fname):
            fname, res = request.urlretrieve(
                "https://static.getml.com/datasets/dodgers/" + fname, 
                fname
            )
PROPHET_FILES = [ "predictions_prophet_train.csv", "predictions_prophet_test.csv", "combined_train_pandas.csv", "combined_test_pandas.csv" ] if not RUN_PROPHET: for fname in PROPHET_FILES: if not os.path.exists(fname): fname, res = request.urlretrieve( "https://static.getml.com/datasets/dodgers/" + fname, fname )

In [7]:

  Copied!     
 
TSFRESH_FILES = [
    "tsfresh_train_pandas.csv",
    "tsfresh_test_pandas.csv"
]

if not RUN_TSFRESH:
    for fname in TSFRESH_FILES:
        if not os.path.exists(fname):
            fname, res = request.urlretrieve(
                "https://static.getml.com/datasets/dodgers/" + fname, 
                fname
            )
TSFRESH_FILES = [ "tsfresh_train_pandas.csv", "tsfresh_test_pandas.csv" ] if not RUN_TSFRESH: for fname in TSFRESH_FILES: if not os.path.exists(fname): fname, res = request.urlretrieve( "https://static.getml.com/datasets/dodgers/" + fname, fname )

Prophet is pretty strict about how the columns should be named, so we adapt to these restriction:

In [8]:

  Copied!     
 
data_full_pandas.columns = ["ds", "y"]

data_full_pandas = data_full_pandas[data_full_pandas["y"] >= 0]

data_full_pandas = data_full_pandas.reset_index()

del data_full_pandas["index"]

data_full_pandas["ds"] = [
    datetime.strptime(dt, "%m/%d/%Y %H:%M") for dt in data_full_pandas["ds"]
]
data_full_pandas.columns = ["ds", "y"] data_full_pandas = data_full_pandas[data_full_pandas["y"] >= 0] data_full_pandas = data_full_pandas.reset_index() del data_full_pandas["index"] data_full_pandas["ds"] = [ datetime.strptime(dt, "%m/%d/%Y %H:%M") for dt in data_full_pandas["ds"] ]

In [9]:

  Copied!     
 
data_full_pandas
data_full_pandas

Out[9]:

	ds	y
0	2005-04-11 07:35:00	23
1	2005-04-11 07:40:00	42
2	2005-04-11 07:45:00	37
3	2005-04-11 07:50:00	24
4	2005-04-11 07:55:00	39
...	...	...
47492	2005-09-30 23:45:00	14
47493	2005-09-30 23:50:00	12
47494	2005-09-30 23:55:00	8
47495	2005-10-01 00:00:00	13
47496	2005-10-01 00:05:00	13

47497 rows × 2 columns

1.2 Prepare data for getML¶

In [10]:

  Copied!     
 
data_full = getml.data.DataFrame.from_pandas(data_full_pandas, "data_full")
data_full = getml.data.DataFrame.from_pandas(data_full_pandas, "data_full")

In [11]:

  Copied!     
 
data_full.set_role("y", getml.data.roles.target)
data_full.set_role("ds", getml.data.roles.time_stamp)
data_full.set_role("y", getml.data.roles.target) data_full.set_role("ds", getml.data.roles.time_stamp)

In [12]:

  Copied!     
 
data_full
data_full

Out[12]:

name	ds	y
role	time_stamp	target
unit	time stamp, comparison only
0	2005-04-11 07:35:00	23
1	2005-04-11 07:40:00	42
2	2005-04-11 07:45:00	37
3	2005-04-11 07:50:00	24
4	2005-04-11 07:55:00	39
	...	...
47492	2005-09-30 23:45:00	14
47493	2005-09-30 23:50:00	12
47494	2005-09-30 23:55:00	8
47495	2005-10-01	13
47496	2005-10-01 00:05:00	13

47497 rows x 2 columns
memory usage: 0.76 MB
name: data_full
type: getml.DataFrame

In [13]:

  Copied!     
 
split = getml.data.split.time(population=data_full, time_stamp="ds", test=getml.data.time.datetime(2005, 8, 20))
split
split = getml.data.split.time(population=data_full, time_stamp="ds", test=getml.data.time.datetime(2005, 8, 20)) split

Out[13]:


0	train
1	train
2	train
3	train
4	train
	...

47497 rows
type: StringColumnView

Traffic: population table

To allow the algorithm to capture seasonal information, we include time components (such as the day of the week) as categorical variables.

1.3 Define relational model¶

To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (horizon, memory and allow_lagged_targets). This is done abstractly using Placeholders

The data model consists of two tables:

Population table traffic_{test/train}: holds target and the contemporarily available time-based components
Peripheral table traffic: same table as the population table
Join between both placeholders specifies (horizon) to prevent leaks and (memory) that keeps the computations feasible

In [14]:

  Copied!     
 
# 1. The horizon is 1 hour (we predict the traffic volume in one hour). 
# 2. The memory is 2 hours, so we allow the algorithm to 
#    use information from up to 2 hours ago. 
# 3. We allow lagged targets. Thus, the algorithm can 
#    identify autoregressive processes.

time_series = getml.data.TimeSeries(
    population=data_full,
    split=split,
    time_stamps='ds',
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.hours(2),
    lagged_targets=True
)

time_series
# 1. The horizon is 1 hour (we predict the traffic volume in one hour). # 2. The memory is 2 hours, so we allow the algorithm to # use information from up to 2 hours ago. # 3. We allow lagged targets. Thus, the algorithm can # identify autoregressive processes. time_series = getml.data.TimeSeries( population=data_full, split=split, time_stamps='ds', horizon=getml.data.time.hours(1), memory=getml.data.time.hours(2), lagged_targets=True ) time_series

Out[14]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	data_full	DATA_FULL__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	data_full	11625	View
1	train	data_full	35872	View

peripheral

	name	rows	type
0	data_full	47497	DataFrame

2. Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 getML Pipeline¶

Set-up of feature learners, selectors & predictor

In [15]:

  Copied!     
 
mapping = getml.preprocessors.Mapping()

seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
)

relboost = getml.feature_learning.Relboost(
    num_features=10,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    seed=4367,
    num_threads=1
)

predictor = getml.predictors.XGBoostRegressor()
mapping = getml.preprocessors.Mapping() seasonal = getml.preprocessors.Seasonal() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.SquareLoss, num_threads=1, aggregation=getml.feature_learning.FastProp.agg_sets.All, ) relboost = getml.feature_learning.Relboost( num_features=10, loss_function=getml.feature_learning.loss_functions.SquareLoss, seed=4367, num_threads=1 ) predictor = getml.predictors.XGBoostRegressor()

Build the pipeline

In [16]:

  Copied!     
 
pipe = getml.pipeline.Pipeline(
    tags=['memory: 2h', 'horizon: 1h', 'fast_prop', 'relboost'],
    data_model=time_series.data_model,
    preprocessors=[seasonal, mapping],
    feature_learners=[fast_prop, relboost],
    predictors=[predictor]
)
pipe = getml.pipeline.Pipeline( tags=['memory: 2h', 'horizon: 1h', 'fast_prop', 'relboost'], data_model=time_series.data_model, preprocessors=[seasonal, mapping], feature_learners=[fast_prop, relboost], predictors=[predictor] )

2.2 Model training¶

In [17]:

  Copied!     
 
pipe.check(time_series.train)
pipe.check(time_series.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:12
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

OK.

In [18]:

  Copied!     
 
pipe.fit(time_series.train)
pipe.fit(time_series.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 2866 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:47
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:11
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:12

Trained pipeline.

Time taken: 0:01:20.843726.

Out[18]:

Pipeline(data_model='population',
         feature_learners=['FastProp', 'Relboost'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=['data_full'],
         predictors=['XGBoostRegressor'],
         preprocessors=['Seasonal', 'Mapping'],
         share_selected_features=0.5,
         tags=['memory: 2h', 'horizon: 1h', 'fast_prop', 'relboost', 'container-qLmkYe'])

2.3 Model evaluation¶

In [19]:

  Copied!     
 
getml_score = pipe.score(time_series.test)
getml_score
getml_score = pipe.score(time_series.test) getml_score

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02

Out[19]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 21:56:38	train	y	4.4978	6.1956	0.7778
1	2024-09-12 21:56:42	test	y	4.643	6.3837	0.7628

In [20]:

  Copied!     
 
time_series
time_series

Out[20]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	data_full	DATA_FULL__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	data_full	11625	View
1	train	data_full	35872	View

peripheral

	name	rows	type
0	data_full	47497	DataFrame

In [21]:

  Copied!     
 
predictions_getml_test = pipe.predict(time_series.test)
predictions_getml_test = pipe.predict(time_series.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

2.4 Benchmark against Prophet¶

Prophet is a library for generating predictions on univariate time series that contain strong seasonal components. We would therefore expect it to do well on this particular time series.

In [22]:

  Copied!     
 
data_train_pandas = time_series.train.population.to_pandas()
data_test_pandas = time_series.test.population[1:].to_pandas()
data_train_pandas
data_train_pandas = time_series.train.population.to_pandas() data_test_pandas = time_series.test.population[1:].to_pandas() data_train_pandas

Out[22]:

	y	ds
0	23.0	2005-04-11 07:35:00
1	42.0	2005-04-11 07:40:00
2	37.0	2005-04-11 07:45:00
3	24.0	2005-04-11 07:50:00
4	39.0	2005-04-11 07:55:00
...	...	...
35867	6.0	2005-08-19 23:35:00
35868	16.0	2005-08-19 23:40:00
35869	11.0	2005-08-19 23:45:00
35870	10.0	2005-08-19 23:50:00
35871	11.0	2005-08-19 23:55:00

35872 rows × 2 columns

In [23]:

  Copied!     
 
if RUN_PROPHET:
    model_prophet = Prophet()

    model_prophet = model_prophet.fit(data_train_pandas)

    predictions_prophet_train = model_prophet.predict(data_train_pandas)["yhat"]
    predictions_prophet_test = model_prophet.predict(data_test_pandas)["yhat"]
else:
    predictions_prophet_train = pd.read_csv("predictions_prophet_train.csv")["yhat"]
    predictions_prophet_test = pd.read_csv("predictions_prophet_test.csv")["yhat"]
if RUN_PROPHET: model_prophet = Prophet() model_prophet = model_prophet.fit(data_train_pandas) predictions_prophet_train = model_prophet.predict(data_train_pandas)["yhat"] predictions_prophet_test = model_prophet.predict(data_test_pandas)["yhat"] else: predictions_prophet_train = pd.read_csv("predictions_prophet_train.csv")["yhat"] predictions_prophet_test = pd.read_csv("predictions_prophet_test.csv")["yhat"]

Since we are not using the getML engine for Prophet, we have to implement the metrics ourselves. Luckily, that is not very hard.

In [24]:

  Copied!     
 
in_sample = dict()
out_of_sample = dict()
in_sample = dict() out_of_sample = dict()

In [25]:

  Copied!     
 
predictions_prophet_test
predictions_prophet_test

Out[25]:

0        11.182024
1        10.809442
2        10.439468
3        10.072405
4         9.708546
           ...    
11619     8.201863
11620     7.822110
11621     7.443635
11622     7.066790
11623     6.691914
Name: yhat, Length: 11624, dtype: float64

In [26]:

  Copied!     
 
data_train_pandas
data_train_pandas

Out[26]:

	y	ds
0	23.0	2005-04-11 07:35:00
1	42.0	2005-04-11 07:40:00
2	37.0	2005-04-11 07:45:00
3	24.0	2005-04-11 07:50:00
4	39.0	2005-04-11 07:55:00
...	...	...
35867	6.0	2005-08-19 23:35:00
35868	16.0	2005-08-19 23:40:00
35869	11.0	2005-08-19 23:45:00
35870	10.0	2005-08-19 23:50:00
35871	11.0	2005-08-19 23:55:00

35872 rows × 2 columns

In [27]:

  Copied!     
 
def r_squared(yhat, y):
    yhat = np.asarray(yhat)
    y = np.asarray(y)
    r = scipy.stats.pearsonr(yhat, y)[0]
    return r * r
    
in_sample["rsquared"] = r_squared(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["rsquared"] = r_squared(predictions_prophet_test, data_test_pandas["y"])
def r_squared(yhat, y): yhat = np.asarray(yhat) y = np.asarray(y) r = scipy.stats.pearsonr(yhat, y)[0] return r * r in_sample["rsquared"] = r_squared(predictions_prophet_train, data_train_pandas["y"]) out_of_sample["rsquared"] = r_squared(predictions_prophet_test, data_test_pandas["y"])

In [28]:

  Copied!     
 
def rmse(yhat, y):
    yhat = np.asarray(yhat)
    y = np.asarray(y)
    return np.sqrt(
        ((y - yhat)*(y - yhat)).sum()/len(y)
    )

in_sample["rmse"] = rmse(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["rmse"] = rmse(predictions_prophet_test, data_test_pandas["y"])
def rmse(yhat, y): yhat = np.asarray(yhat) y = np.asarray(y) return np.sqrt( ((y - yhat)*(y - yhat)).sum()/len(y) ) in_sample["rmse"] = rmse(predictions_prophet_train, data_train_pandas["y"]) out_of_sample["rmse"] = rmse(predictions_prophet_test, data_test_pandas["y"])

In [29]:

  Copied!     
 
def mae(yhat, y):
    yhat = np.asarray(yhat)
    y = np.asarray(y)
    return np.abs(y - yhat).sum()/len(y)
    
in_sample["mae"] = mae(predictions_prophet_train, data_train_pandas["y"])
out_of_sample["mae"] = mae(predictions_prophet_test, data_test_pandas["y"])
def mae(yhat, y): yhat = np.asarray(yhat) y = np.asarray(y) return np.abs(y - yhat).sum()/len(y) in_sample["mae"] = mae(predictions_prophet_train, data_train_pandas["y"]) out_of_sample["mae"] = mae(predictions_prophet_test, data_test_pandas["y"])

In [30]:

  Copied!     
 
print("""
In sample mae: {:.4f}
In sample rmse: {:.4f}
In sample rsquared: {:.4f}\n
Out of sample mae: {:.4f}
Out of sample rmse: {:.4f}
Out of sample rsquared: {:.4f}
""".format(
    in_sample['mae'], 
    in_sample['rmse'],
    in_sample['rsquared'],
    out_of_sample['mae'], 
    out_of_sample['rmse'],
    out_of_sample['rsquared'])
)
print(""" In sample mae: {:.4f} In sample rmse: {:.4f} In sample rsquared: {:.4f}\n Out of sample mae: {:.4f} Out of sample rmse: {:.4f} Out of sample rsquared: {:.4f} """.format( in_sample['mae'], in_sample['rmse'], in_sample['rsquared'], out_of_sample['mae'], out_of_sample['rmse'], out_of_sample['rsquared']) )

In sample mae: 5.6729
In sample rmse: 7.5989
In sample rsquared: 0.6655

Out of sample mae: 6.2245
Out of sample rmse: 8.3225
Out of sample rsquared: 0.6306

Let's take a closer look at the predictions to get a better understanding why getML does better than Prophet.

In [31]:

  Copied!     
 
length = 4000

plt.subplots(figsize=(20, 10))

plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth")
plt.plot(predictions_getml_test[:length], label="getml")
plt.plot(predictions_prophet_test[:length], label="prophet")
plt.legend(loc="upper right")
length = 4000 plt.subplots(figsize=(20, 10)) plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth") plt.plot(predictions_getml_test[:length], label="getml") plt.plot(predictions_prophet_test[:length], label="prophet") plt.legend(loc="upper right")

Out[31]:

<matplotlib.legend.Legend at 0x7896453b4bd0>

No description has been provided for this image

As this plot indicates, getML does better than Prophet, because it can integrate autoregressive processes in addition to seasonal data.

2.5 Benchmark against tsfresh¶

tsfresh is a library for generating features on time series. It uses a brute-force approach: It generates a large number of hard-coded features and then does a feature selection.

For convenience, we have built a wrapper around tsfresh.

As we have discussed in a different notebook, tsfresh consumes a lot of memory. To limit the memory consumption to a feasible level, we only use tsfresh's MinimalFCParameters and IndexBasedFCParameters, which are a superset of the TimeBasedFCParameters.

In [32]:

  Copied!     
 
class TSFreshBuilder():
    
    def __init__(self, num_features, memory, column_id, time_stamp, target):
        """
        Scikit-learn style feature builder based on TSFresh.
        
        Args:
            
            num_features: The (maximum) number of features to build.
            
            memory: How much back in time you want to go until the
                    feature builder starts "forgetting" data.
                    
            column_id: The name of the column containing the ids.
            
            time_stamp: The name of the column containing the time stamps.
            
            target: The name of the target column.
        """
        self.num_features = num_features
        self.memory = memory
        self.column_id = column_id
        self.time_stamp = time_stamp
        self.target = target
        
        self.selected_features = []
        
    def _add_original_columns(self, original_df, df_selected):
        for colname in original_df.columns:
            df_selected[colname] = np.asarray(
                original_df[colname])
                    
        return df_selected

    def _extract_features(self, df):
        df_rolled = roll_time_series(
            df, 
            column_id=self.column_id, 
            column_sort=self.time_stamp,
            max_timeshift=self.memory
        )
        
        extracted_minimal = tsfresh.extract_features(
            df_rolled,
            column_id=self.column_id, 
            column_sort=self.time_stamp,
            default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters()
        )
        
        extracted_index_based = tsfresh.extract_features(
            df_rolled,
            column_id=self.column_id, 
            column_sort=self.time_stamp,
            default_fc_parameters=tsfresh.feature_extraction.settings.IndexBasedFCParameters()
        )
        
        extracted_features = pd.concat(
            [extracted_minimal, extracted_index_based], axis=1
        )
        del extracted_minimal
        del extracted_index_based
        
        gc.collect()
        
        extracted_features[
            extracted_features != extracted_features] = 0.0  
        
        extracted_features[
            np.isinf(extracted_features)] = 0.0 
        
        return extracted_features
        
    def _print_time_taken(self, begin, end):

        seconds = end - begin

        hours = int(seconds / 3600)
        seconds -= float(hours * 3600)

        minutes = int(seconds / 60)
        seconds -= float(minutes * 60)

        seconds = round(seconds, 6)

        print(
            "Time taken: " + str(hours) + "h:" +
            str(minutes) + "m:" + str(seconds)
        )

        print("")
        
    def _remove_target_column(self, df):
        colnames = np.asarray(df.columns)
        
        if self.target not in colnames:
            return df
        
        colnames = colnames[colnames != self.target]
        
        return df[colnames]
        
    def _select_features(self, df, target):
        df_selected = tsfresh.select_features(
            df, 
            target
        )
        
        colnames = np.asarray(df_selected.columns)

        correlations = np.asarray([
            np.abs(pearsonr(target, df_selected[col]))[0] for col in colnames
        ])
        
        # [::-1] is somewhat unintuitive syntax,
        # but it reverses the entire column.
        self.selected_features = colnames[
            np.argsort(correlations)
        ][::-1][:self.num_features]

        return df_selected[self.selected_features]
        
    def fit(self, df):
        """
        Fits the features.
        """
        begin = time.time()

        target = np.asarray(df[self.target])
        
        df_without_target = self._remove_target_column(df)
        
        df_extracted = self._extract_features(
            df_without_target)
        
        df_selected = self._select_features(
            df_extracted, target)
                
        del df_extracted
        gc.collect()
        
        df_selected = self._add_original_columns(df, df_selected)

        end = time.time()
        
        self._print_time_taken(begin, end)
        
        return df_selected
    
    def transform(self, df):
        """
        Transforms the raw data into a set of features.
        """
        df_extracted = self._extract_features(df)
        
        df_selected = df_extracted[self.selected_features]
        
        del df_extracted
        gc.collect()
        
        df_selected = self._add_original_columns(df, df_selected)
                                         
        return df_selected
class TSFreshBuilder(): def __init__(self, num_features, memory, column_id, time_stamp, target): """ Scikit-learn style feature builder based on TSFresh. Args: num_features: The (maximum) number of features to build. memory: How much back in time you want to go until the feature builder starts "forgetting" data. column_id: The name of the column containing the ids. time_stamp: The name of the column containing the time stamps. target: The name of the target column. """ self.num_features = num_features self.memory = memory self.column_id = column_id self.time_stamp = time_stamp self.target = target self.selected_features = [] def _add_original_columns(self, original_df, df_selected): for colname in original_df.columns: df_selected[colname] = np.asarray( original_df[colname]) return df_selected def _extract_features(self, df): df_rolled = roll_time_series( df, column_id=self.column_id, column_sort=self.time_stamp, max_timeshift=self.memory ) extracted_minimal = tsfresh.extract_features( df_rolled, column_id=self.column_id, column_sort=self.time_stamp, default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters() ) extracted_index_based = tsfresh.extract_features( df_rolled, column_id=self.column_id, column_sort=self.time_stamp, default_fc_parameters=tsfresh.feature_extraction.settings.IndexBasedFCParameters() ) extracted_features = pd.concat( [extracted_minimal, extracted_index_based], axis=1 ) del extracted_minimal del extracted_index_based gc.collect() extracted_features[ extracted_features != extracted_features] = 0.0 extracted_features[ np.isinf(extracted_features)] = 0.0 return extracted_features def _print_time_taken(self, begin, end): seconds = end - begin hours = int(seconds / 3600) seconds -= float(hours * 3600) minutes = int(seconds / 60) seconds -= float(minutes * 60) seconds = round(seconds, 6) print( "Time taken: " + str(hours) + "h:" + str(minutes) + "m:" + str(seconds) ) print("") def _remove_target_column(self, df): colnames = np.asarray(df.columns) if self.target not in colnames: return df colnames = colnames[colnames != self.target] return df[colnames] def _select_features(self, df, target): df_selected = tsfresh.select_features( df, target ) colnames = np.asarray(df_selected.columns) correlations = np.asarray([ np.abs(pearsonr(target, df_selected[col]))[0] for col in colnames ]) # [::-1] is somewhat unintuitive syntax, # but it reverses the entire column. self.selected_features = colnames[ np.argsort(correlations) ][::-1][:self.num_features] return df_selected[self.selected_features] def fit(self, df): """ Fits the features. """ begin = time.time() target = np.asarray(df[self.target]) df_without_target = self._remove_target_column(df) df_extracted = self._extract_features( df_without_target) df_selected = self._select_features( df_extracted, target) del df_extracted gc.collect() df_selected = self._add_original_columns(df, df_selected) end = time.time() self._print_time_taken(begin, end) return df_selected def transform(self, df): """ Transforms the raw data into a set of features. """ df_extracted = self._extract_features(df) df_selected = df_extracted[self.selected_features] del df_extracted gc.collect() df_selected = self._add_original_columns(df, df_selected) return df_selected

We need to lag our target variable, so we can input as a feature to tsfresh.

In [33]:

  Copied!     
 
y_lagged = np.asarray(data_full_pandas["y"][:-12])
data_full_pandas.loc[12:, "y_lagged"] = y_lagged
data_full_pandas.loc[12:, "id"] = 1
y_lagged = np.asarray(data_full_pandas["y"][:-12]) data_full_pandas.loc[12:, "y_lagged"] = y_lagged data_full_pandas.loc[12:, "id"] = 1

In [34]:

  Copied!     
 
separation = datetime(2005, 8, 20, 0, 0)
data_full_tsfresh = data_full_pandas[12:]
data_train_tsfresh = data_full_tsfresh[data_full_tsfresh["ds"] < separation]
data_test_tsfresh = data_full_tsfresh[data_full_tsfresh["ds"] >= separation]
separation = datetime(2005, 8, 20, 0, 0) data_full_tsfresh = data_full_pandas[12:] data_train_tsfresh = data_full_tsfresh[data_full_tsfresh["ds"] < separation] data_test_tsfresh = data_full_tsfresh[data_full_tsfresh["ds"] >= separation]

In [35]:

  Copied!     
 
data_train_tsfresh
data_train_tsfresh

Out[35]:

	ds	y	y_lagged	id
12	2005-04-11 08:35:00	42	23.0	1.0
13	2005-04-11 08:40:00	41	42.0	1.0
14	2005-04-11 08:45:00	38	37.0	1.0
15	2005-04-11 08:50:00	38	24.0	1.0
16	2005-04-11 08:55:00	40	39.0	1.0
...	...	...	...	...
35867	2005-08-19 23:35:00	6	18.0	1.0
35868	2005-08-19 23:40:00	16	25.0	1.0
35869	2005-08-19 23:45:00	11	25.0	1.0
35870	2005-08-19 23:50:00	10	18.0	1.0
35871	2005-08-19 23:55:00	11	16.0	1.0

35860 rows × 4 columns

In [36]:

  Copied!     
 
data_test_tsfresh
data_test_tsfresh

Out[36]:

	ds	y	y_lagged	id
35872	2005-08-20 00:00:00	13	9.0	1.0
35873	2005-08-20 00:05:00	7	24.0	1.0
35874	2005-08-20 00:10:00	10	8.0	1.0
35875	2005-08-20 00:15:00	6	13.0	1.0
35876	2005-08-20 00:20:00	12	14.0	1.0
...	...	...	...	...
47492	2005-09-30 23:45:00	14	28.0	1.0
47493	2005-09-30 23:50:00	12	18.0	1.0
47494	2005-09-30 23:55:00	8	36.0	1.0
47495	2005-10-01 00:00:00	13	21.0	1.0
47496	2005-10-01 00:05:00	13	27.0	1.0

11625 rows × 4 columns

We build 20 features, just like we have with getML.

In [37]:

  Copied!     
 
tsfresh_builder = TSFreshBuilder(
    num_features=20,
    memory=24,
    column_id="id",
    time_stamp="ds",
    target="y"
)
tsfresh_builder = TSFreshBuilder( num_features=20, memory=24, column_id="id", time_stamp="ds", target="y" )

In [38]:

  Copied!     
 
if RUN_TSFRESH:
    tsfresh_train_pandas = tsfresh_builder.fit(data_train_tsfresh)
    tsfresh_test_pandas = tsfresh_builder.transform(data_test_tsfresh)
else:
    tsfresh_train_pandas = pd.read_csv("tsfresh_train_pandas.csv")
    tsfresh_test_pandas = pd.read_csv("tsfresh_test_pandas.csv")
if RUN_TSFRESH: tsfresh_train_pandas = tsfresh_builder.fit(data_train_tsfresh) tsfresh_test_pandas = tsfresh_builder.transform(data_test_tsfresh) else: tsfresh_train_pandas = pd.read_csv("tsfresh_train_pandas.csv") tsfresh_test_pandas = pd.read_csv("tsfresh_test_pandas.csv")

Because tsfresh does not come with built-in predictors, we upload the generated features into a getML pipeline.

In [39]:

  Copied!     
 
tsfresh_train = getml.data.DataFrame.from_pandas(tsfresh_train_pandas, "tsfresh_train")
tsfresh_test = getml.data.DataFrame.from_pandas(tsfresh_test_pandas, "tsfresh_test")
tsfresh_train = getml.data.DataFrame.from_pandas(tsfresh_train_pandas, "tsfresh_train") tsfresh_test = getml.data.DataFrame.from_pandas(tsfresh_test_pandas, "tsfresh_test")

In [40]:

  Copied!     
 
for df in [tsfresh_train, tsfresh_test]:
    df.set_role("y", getml.data.roles.target)
    df.set_role("ds", getml.data.roles.time_stamp)
    df.set_role(df.roles.unused_float, getml.data.roles.numerical)
    df.set_role(["y_lagged", "id"], getml.data.roles.unused_float)
for df in [tsfresh_train, tsfresh_test]: df.set_role("y", getml.data.roles.target) df.set_role("ds", getml.data.roles.time_stamp) df.set_role(df.roles.unused_float, getml.data.roles.numerical) df.set_role(["y_lagged", "id"], getml.data.roles.unused_float)

In [41]:

  Copied!     
 
tsfresh_train
tsfresh_train

Out[41]:

name	ds	y	y_lagged__mean	y_lagged__sum_values	y_lagged__median	y_lagged__maximum	y_lagged__minimum	y_lagged__percentage_of_reoccurring_values_to_all_values	y_lagged__standard_deviation	y_lagged__skewness	y_lagged__variance	y_lagged__kurtosis	y_lagged__length	y_lagged	id
role	time_stamp	target	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	unused_float	unused_float
unit	time stamp, comparison only
0	2005-04-11 08:35:00	42	23	23	23	23	23	0	0	0	0	0	1	23	1
1	2005-04-11 08:40:00	41	32.5	65	32.5	42	23	0	9.5	0	90.25	0	2	42	1
2	2005-04-11 08:45:00	38	34	102	37	42	23	0	8.0416	-1.2435	64.6667	0	3	37	1
3	2005-04-11 08:50:00	38	31.5	126	30.5	42	23	0	8.2006	0.2261	67.25	-4.6053	4	24	1
4	2005-04-11 08:55:00	40	33	165	37	42	23	0	7.9246	-0.4313	62.8	-2.9949	5	39	1
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
35855	2005-08-19 23:35:00	6	21.16	529	21	37	10	0.84	5.0333	0.7576	25.3344	3.1125	25	18	1
35856	2005-08-19 23:40:00	16	21.32	533	21	37	10	0.88	5.089	0.6509	25.8976	2.7456	25	25	1
35857	2005-08-19 23:45:00	11	21.48	537	22	37	10	0.84	5.139	0.5482	26.4096	2.4358	25	25	1
35858	2005-08-19 23:50:00	10	21.32	533	21	37	10	0.8	5.1824	0.6203	26.8576	2.3339	25	18	1
35859	2005-08-19 23:55:00	11	21.56	539	21	37	14	0.84	4.7756	1.1404	22.8064	2.7922	25	16	1

35860 rows x 15 columns
memory usage: 4.30 MB
name: tsfresh_train
type: getml.DataFrame

We use an untuned XGBoostRegressor to generate predictions from our tsfresh features, just like we have for getML.

In [42]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_tsfresh = getml.pipeline.Pipeline(
    tags=['tsfresh'],
    predictors=[predictor]
)
predictor = getml.predictors.XGBoostRegressor() pipe_tsfresh = getml.pipeline.Pipeline( tags=['tsfresh'], predictors=[predictor] )

In [43]:

  Copied!     
 
pipe_tsfresh.fit(tsfresh_train)
pipe_tsfresh.fit(tsfresh_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Trained pipeline.

Time taken: 0:00:01.000419.

Out[43]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['tsfresh'])

In [44]:

  Copied!     
 
tsfresh_score = pipe_tsfresh.score(tsfresh_test)
tsfresh_score
tsfresh_score = pipe_tsfresh.score(tsfresh_test) tsfresh_score

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[44]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 21:56:47	tsfresh_train	y	6.8096	8.7569	0.5565
1	2024-09-12 21:56:48	tsfresh_test	y	7.1877	9.3024	0.4943

In [45]:

  Copied!     
 
predictions_tsfresh_test = pipe_tsfresh.predict(tsfresh_test)
predictions_tsfresh_test = pipe_tsfresh.predict(tsfresh_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Let's take a closer look at the predictions to get a better understanding why getML does better than tsfresh.

In [46]:

  Copied!     
 
length = 4000

plt.subplots(figsize=(20, 10))

plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth")
plt.plot(predictions_getml_test[:length], label="getml")
plt.plot(predictions_tsfresh_test[:length], label="tsfresh")
plt.legend(loc="upper right")
length = 4000 plt.subplots(figsize=(20, 10)) plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth") plt.plot(predictions_getml_test[:length], label="getml") plt.plot(predictions_tsfresh_test[:length], label="tsfresh") plt.legend(loc="upper right")

Out[46]:

<matplotlib.legend.Legend at 0x7896435f8450>

As we can see, tsfresh struggles with the strong seasonal components of this data set and therefore cannot separate signal from noise to the same extent that getML can.

2.6 Combining Prophet and tsfresh¶

Prophet is good at extracting seasonal features. tsfresh is good at extracting autoregressive features. So what if we tried to combine them? How well would that perform compared to getML?

Let's give it a try. We begin by extracting all of the seasonal features from Prophet and combining them with the tsfresh features:

In [47]:

  Copied!     
 
def combine(dfs):
    combined = pd.DataFrame()
    for df in dfs:
        df = df.copy()
        if "id" in df.columns:
            del df["id"]
        df = df.reset_index()
        for col in df.columns:
            combined[col] = df[col]
    return combined

if RUN_PROPHET:
    prophet_train_pandas = model_prophet.predict(data_train_tsfresh)
    prophet_test_pandas = model_prophet.predict(data_test_tsfresh)

    combined_train_pandas = combine([tsfresh_train_pandas, prophet_train_pandas])
    combined_test_pandas = combine([tsfresh_test_pandas, prophet_test_pandas])
else:
    combined_train_pandas = pd.read_csv("combined_train_pandas.csv")
    combined_test_pandas = pd.read_csv("combined_test_pandas.csv")
def combine(dfs): combined = pd.DataFrame() for df in dfs: df = df.copy() if "id" in df.columns: del df["id"] df = df.reset_index() for col in df.columns: combined[col] = df[col] return combined if RUN_PROPHET: prophet_train_pandas = model_prophet.predict(data_train_tsfresh) prophet_test_pandas = model_prophet.predict(data_test_tsfresh) combined_train_pandas = combine([tsfresh_train_pandas, prophet_train_pandas]) combined_test_pandas = combine([tsfresh_test_pandas, prophet_test_pandas]) else: combined_train_pandas = pd.read_csv("combined_train_pandas.csv") combined_test_pandas = pd.read_csv("combined_test_pandas.csv")

We upload the data to getML:

In [48]:

  Copied!     
 
combined_train = getml.data.DataFrame.from_pandas(combined_train_pandas, "combined_train")
combined_test = getml.data.DataFrame.from_pandas(combined_test_pandas, "combined_test")
combined_train = getml.data.DataFrame.from_pandas(combined_train_pandas, "combined_train") combined_test = getml.data.DataFrame.from_pandas(combined_test_pandas, "combined_test")

The multiplicative terms are all zero, so we set them to unused to avoid an ugly warning message we would get from getML.

In [49]:

  Copied!     
 
for df in [combined_train, combined_test]:
    df.set_role("y", getml.data.roles.target)
    df.set_role("ds", getml.data.roles.time_stamp)
    df.set_role(df.roles.unused_float, getml.data.roles.numerical)
    df.set_role(["multiplicative_terms", "multiplicative_terms_lower", "multiplicative_terms_upper", "y_lagged"], getml.data.roles.unused_float)
for df in [combined_train, combined_test]: df.set_role("y", getml.data.roles.target) df.set_role("ds", getml.data.roles.time_stamp) df.set_role(df.roles.unused_float, getml.data.roles.numerical) df.set_role(["multiplicative_terms", "multiplicative_terms_lower", "multiplicative_terms_upper", "y_lagged"], getml.data.roles.unused_float)

Once again, we train an untuned XGBoostRegressor on top of these features.

In [50]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_combined = getml.pipeline.Pipeline(
    tags=['prophet + tsfresh'],
    predictors=[predictor]
)
predictor = getml.predictors.XGBoostRegressor() pipe_combined = getml.pipeline.Pipeline( tags=['prophet + tsfresh'], predictors=[predictor] )

In [51]:

  Copied!     
 
pipe_combined.fit(combined_train)
pipe_combined.fit(combined_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02

Trained pipeline.

Time taken: 0:00:02.480467.

Out[51]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prophet + tsfresh'])

In [52]:

  Copied!     
 
combined_score = pipe_combined.score(combined_test)
combined_score
combined_score = pipe_combined.score(combined_test) combined_score

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[52]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 21:56:51	combined_train	y	4.6572	6.4011	0.7633
1	2024-09-12 21:56:51	combined_test	y	6.1772	8.4081	0.6661

As we can see, combining tsfresh and Prophet generates better predictions than any single one of them, but it is still considerably worse than getML.

In [53]:

  Copied!     
 
predictions_combined_test = pipe_combined.predict(combined_test)
predictions_combined_test = pipe_combined.predict(combined_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

In [54]:

  Copied!     
 
length = 4000

plt.subplots(figsize=(20, 10))

plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth")
plt.plot(predictions_getml_test[:length], label="getml")
plt.plot(predictions_combined_test[:length], label="tsfresh + prophet")
plt.legend(loc="upper right")
length = 4000 plt.subplots(figsize=(20, 10)) plt.plot(np.asarray(data_test_pandas["y"])[:length], label="ground truth") plt.plot(predictions_getml_test[:length], label="getml") plt.plot(predictions_combined_test[:length], label="tsfresh + prophet") plt.legend(loc="upper right")

Out[54]:

<matplotlib.legend.Legend at 0x7896425fcbd0>

2.7 Features¶

The most important feature is the following:

In [55]:

  Copied!     
 
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]

Out[55]:

DROP TABLE IF EXISTS "FEATURE_2_1";

CREATE TABLE "FEATURE_2_1" AS
SELECT AVG( 
    CASE
        WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 16.860223 ) AND ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 28.870791 ) AND ( t2."y" > 9.832317 ) THEN 3.009878351054582
        WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 16.860223 ) AND ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 28.870791 ) AND ( t2."y" <= 9.832317 OR t2."y" IS NULL ) THEN -29.25530262742197
        WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 16.860223 ) AND ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 28.870791 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t1."weekday__strftime__w__ds" IN ( '0' ) ) THEN -12.75401111588412
        WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" > 16.860223 ) AND ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 28.870791 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t1."weekday__strftime__w__ds" NOT IN ( '0' ) OR t1."weekday__strftime__w__ds" IS NULL ) THEN -5.674777947347508
        WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 16.860223 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t2."y" > 17.943925 ) AND ( t2."weekday__strftime__w__ds" IN ( '1', '2', '3', '4' ) ) THEN -23.4772449371152
        WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 16.860223 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t2."y" > 17.943925 ) AND ( t2."weekday__strftime__w__ds" NOT IN ( '1', '2', '3', '4' ) OR t2."weekday__strftime__w__ds" IS NULL ) THEN -15.06410432423447
        WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 16.860223 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t2."y" <= 17.943925 OR t2."y" IS NULL ) AND ( t1."hour__strftime__h__ds__mapping_target_1_avg" > 16.903299 ) THEN -37.20384319625197
        WHEN ( t2."hour__strftime__h__ds__mapping_1_target_1_avg" <= 16.860223 OR t2."hour__strftime__h__ds__mapping_1_target_1_avg" IS NULL ) AND ( t2."y" <= 17.943925 OR t2."y" IS NULL ) AND ( t1."hour__strftime__h__ds__mapping_target_1_avg" <= 16.903299 OR t1."hour__strftime__h__ds__mapping_target_1_avg" IS NULL ) THEN -28.10249283121772
        ELSE NULL
    END
) AS "feature_2_1",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "DATA_FULL__STAGING_TABLE_2" t2
ON 1 = 1
WHERE t2."ds__1_000000_hours" <= t1."ds"
AND ( t2."ds__3_000000_hours" > t1."ds" OR t2."ds__3_000000_hours" IS NULL )
GROUP BY t1.rowid;

2.8 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [56]:

  Copied!     
 
pipe.features.to_sql().save("dodgers_pipeline", remove=True)
pipe.features.to_sql().save("dodgers_pipeline", remove=True)

In [57]:

  Copied!     
 
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("dodgers_spark", remove=True)
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("dodgers_spark", remove=True)

2.9 Summary of results¶

For a more convenient overview, we summarize these results into a table.

In [58]:

  Copied!     
 
from collections import namedtuple

prophet_score = namedtuple('Score', ['mae', 'rmse', 'rsquared'])(out_of_sample['mae'], out_of_sample['rmse'], out_of_sample['rsquared'])
scores = [getml_score, prophet_score, tsfresh_score, combined_score]

pd.DataFrame(data={
    'Name': ['getML', 'Prophet', 'tsfresh', 'Prophet + tsfresh'],
    'R-squared': [f'{score.rsquared:.2%}' for score in scores],
    'RMSE': [f'{score.rmse:,.2f}' for score in scores],
    'MAE': [f'{score.mae:,.2f}' for score in scores]
})
from collections import namedtuple prophet_score = namedtuple('Score', ['mae', 'rmse', 'rsquared'])(out_of_sample['mae'], out_of_sample['rmse'], out_of_sample['rsquared']) scores = [getml_score, prophet_score, tsfresh_score, combined_score] pd.DataFrame(data={ 'Name': ['getML', 'Prophet', 'tsfresh', 'Prophet + tsfresh'], 'R-squared': [f'{score.rsquared:.2%}' for score in scores], 'RMSE': [f'{score.rmse:,.2f}' for score in scores], 'MAE': [f'{score.mae:,.2f}' for score in scores] })

Out[58]:

	Name	R-squared	RMSE	MAE
0	getML	76.28%	6.38	4.64
1	Prophet	63.06%	8.32	6.22
2	tsfresh	49.43%	9.30	7.19
3	Prophet + tsfresh	66.61%	8.41	6.18

As we can see, getML outperforms both Prophet and tsfresh by all three measures.

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

We have compared getML's feature learning algorithms to Prophet and tsfresh on a data set related to traffic on LA's 101 North freeway. We found that getML significantly outperforms both Prophet and tsfresh. These results are consistent with the view that relational learning is a powerful tool for time series analysis.

You are encouraged to reproduce these results. You will need getML Enterprise to do so. You can download the trial version for free.