Interstate 94 - Multivariate time series prediction¶

In this tutorial, we demonstrate a time series application of getML. We predict the hourly traffic volume on I-94 westbound from Minneapolis-St Paul.

Summary:

Prediction type: Regression model
Domain: Transportation
Prediction target: Hourly traffic volume
Source data: Multivariate time series, 5 components
Population size: 24096

Author: Sören Nikolaus

Background¶

The dataset features some particularly interesting characteristics common for time series, which classical models may struggle to deal with appropriately. Such characteristics are:

High frequency (hourly)
Dependence on irregular events (holidays)
Strong and overlapping cycles (daily, weekly)
Anomalies
Multiple seasonalities

The analysis is built on top of a dataset provided by the MN Department of Transportation, with some data preparation done by John Hogue.

Analysis¶

Let's get started with the analysis and set-up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "matplotlib~=3.9"

import matplotlib.pyplot as plt
%matplotlib inline

import getml

print(f"getML API version: {getml.__version__}\n")

getml.engine.launch()
getml.engine.set_project("interstate94")
%pip install -q "getml==1.5.0" "matplotlib~=3.9" import matplotlib.pyplot as plt %matplotlib inline import getml print(f"getML API version: {getml.__version__}\n") getml.engine.launch() getml.engine.set_project("interstate94")

Note: you may need to restart the kernel to use updated packages.
getML API version: 1.5.0

Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false in /home/user/.getML/getml-1.5.0-x64-community-edition-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912151736.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'interstate94'.

1. Loading data¶

1.1 Download from source¶

Downloading the raw data and convert it into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It is made available in the example notebook featuring the full analysis. We only include data after 2016 and introduced a fixed train/test split at 80% of the available data.

In [2]:

  Copied!     
 
traffic = getml.datasets.load_interstate94(roles=False, units=False)
traffic = getml.datasets.load_interstate94(roles=False, units=False)

  Downloading traffic... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 1.2/1.2 MB • 00:00

The dataset comes with its own seasonal components. However, we choose not to use them, because we want to demonstrate our seasonal preprocessor.

In [3]:

  Copied!     
 
traffic.set_role("ds", getml.data.roles.time_stamp)
traffic.set_role("holiday", getml.data.roles.categorical)
traffic.set_role("traffic_volume", getml.data.roles.target)

traffic
traffic.set_role("ds", getml.data.roles.time_stamp) traffic.set_role("holiday", getml.data.roles.categorical) traffic.set_role("traffic_volume", getml.data.roles.target) traffic

Out[3]:

name	ds	traffic_volume	holiday	hour	weekday	day	month	year
role	time_stamp	target	categorical	unused_float	unused_float	unused_float	unused_float	unused_float
unit	time stamp, comparison only
0	2016-01-01	1513	New Years Day	0	4	1	1	2016
1	2016-01-01 01:00:00	1550	New Years Day	1	4	1	1	2016
2	2016-01-01 02:00:00	993	New Years Day	2	4	1	1	2016
3	2016-01-01 03:00:00	719	New Years Day	3	4	1	1	2016
4	2016-01-01 04:00:00	533	New Years Day	4	4	1	1	2016
	...	...	...	...	...	...	...	...
24091	2018-09-30 19:00:00	3543	No holiday	19	6	30	9	2018
24092	2018-09-30 20:00:00	2781	No holiday	20	6	30	9	2018
24093	2018-09-30 21:00:00	2159	No holiday	21	6	30	9	2018
24094	2018-09-30 22:00:00	1450	No holiday	22	6	30	9	2018
24095	2018-09-30 23:00:00	954	No holiday	23	6	30	9	2018

24096 rows x 8 columns
memory usage: 1.45 MB
name: traffic
type: getml.DataFrame

1.2 Prepare data for getML¶

The getml.datasets.load_interstate94 method took care of the entire data preparation:

Downloads csv's from our servers into python
Converts csv's to getML DataFrames
Sets roles & units to columns inside getML DataFrames

Data visualization

The first week of the original traffic time series is plotted below.

In [4]:

  Copied!     
 
col_data = "black"
col_getml = "darkviolet"
col_data = "black" col_getml = "darkviolet"

In [5]:

  Copied!     
 
fig, ax = plt.subplots(figsize=(20, 10))

# 2016/01/01 was a friday, we'd like to start the visualizations on a monday
start = 72
end = 72 + 168

fig.suptitle(
    "Traffic volume for first full week of the training set",
    fontsize=14,
    fontweight="bold",
)
ax.plot(
    traffic["ds"].to_numpy()[start:end],
    traffic["traffic_volume"].to_numpy()[start:end],
    color=col_data,
)
plt.show()
fig, ax = plt.subplots(figsize=(20, 10)) # 2016/01/01 was a friday, we'd like to start the visualizations on a monday start = 72 end = 72 + 168 fig.suptitle( "Traffic volume for first full week of the training set", fontsize=14, fontweight="bold", ) ax.plot( traffic["ds"].to_numpy()[start:end], traffic["traffic_volume"].to_numpy()[start:end], color=col_data, ) plt.show()

No description has been provided for this image

Traffic: population table

To allow the algorithm to capture seasonal information, we include time components (such as the day of the week) as categorical variables. Note that we could have also used getML's Seasonal preprocessor (getml.prepreprocessors.Seasonal()), but in this case the information was already included in the dataset.

Train/test split

We use getML's split functionality to retrieve a lazily evaluated split column, that we can supply to the time series api below.

In [6]:

  Copied!     
 
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))

Split columns are columns of mere strings that can be used to subset the data by forming bolean conditions over them:

In [7]:

  Copied!     
 
traffic[split == "test"]
traffic[split == "test"]

Out[7]:

name	ds	traffic_volume	holiday	hour	weekday	day	month	year
role	time_stamp	target	categorical	unused_float	unused_float	unused_float	unused_float	unused_float
unit	time stamp, comparison only
0	2018-03-15	577	No holiday	0	3	15	3	2018
1	2018-03-15 01:00:00	354	No holiday	1	3	15	3	2018
2	2018-03-15 02:00:00	259	No holiday	2	3	15	3	2018
3	2018-03-15 03:00:00	360	No holiday	3	3	15	3	2018
4	2018-03-15 04:00:00	910	No holiday	4	3	15	3	2018
...	...	...	...	...	...	...	...	...

unknown number of rows
type: getml.data.View

1.3 Define relational model¶

To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (horizon, memory and allow_lagged_targets). We use the high-level time series api for this.

Under the hood, the time series api abstracts away a self cross join of the population table (traffic) that allows getML's feature learning algorithms to learn patterns from past observations.

In [8]:

  Copied!     
 
time_series = getml.data.TimeSeries(
    population=traffic,
    split=split,
    time_stamps="ds",
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.days(7),
    lagged_targets=True,
)

time_series
time_series = getml.data.TimeSeries( population=traffic, split=split, time_stamps="ds", horizon=getml.data.time.hours(1), memory=getml.data.time.days(7), lagged_targets=True, ) time_series

Out[8]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	traffic	TRAFFIC__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	traffic	unknown	View
1	train	traffic	unknown	View

peripheral

	name	rows	type
0	traffic	24096	DataFrame

2.Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 getML Pipeline¶

Set-up of feature learners, selectors & predictor

In [9]:

  Copied!     
 
# The Seasonal preprocessor extracts seasonal
# components from the time stamps.
seasonal = getml.preprocessors.Seasonal()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,    
    num_features=20,
)

predictor = getml.predictors.XGBoostRegressor()
# The Seasonal preprocessor extracts seasonal # components from the time stamps. seasonal = getml.preprocessors.Seasonal() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.SquareLoss, num_threads=1, num_features=20, ) predictor = getml.predictors.XGBoostRegressor()

Build the pipeline

In [10]:

  Copied!     
 
pipe = getml.pipeline.Pipeline(
    tags=["memory: 7d", "horizon: 1h", "fast_prop"],
    data_model=time_series.data_model,
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    predictors=[predictor],
)
pipe = getml.pipeline.Pipeline( tags=["memory: 7d", "horizon: 1h", "fast_prop"], data_model=time_series.data_model, preprocessors=[seasonal], feature_learners=[fast_prop], predictors=[predictor], )

2.2 Model training¶

In [11]:

  Copied!     
 
pipe.fit(time_series.train)
pipe.fit(time_series.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 373 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

Trained pipeline.

Time taken: 0:00:10.956191.

Out[11]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=['traffic'],
         predictors=['XGBoostRegressor'],
         preprocessors=['Seasonal'],
         share_selected_features=0.5,
         tags=['memory: 7d', 'horizon: 1h', 'fast_prop', 'container-6yFATS'])

2.3 Model evaluation¶

In [12]:

  Copied!     
 
pipe.score(time_series.test)
pipe.score(time_series.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[12]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 15:17:50	train	traffic_volume	200.4302	299.2045	0.9768
1	2024-09-12 15:17:50	test	traffic_volume	179.9515	269.631	0.9816

2.4 Studying features¶

Feature correlations

Correlations of the calculated features with the target

In [13]:

  Copied!     
 
names, correlations = pipe.features.correlations()

plt.subplots(figsize=(20, 10))

plt.bar(names, correlations, color=col_getml)
plt.title("Feature Correlations")
plt.xlabel("Features")
plt.ylabel("Correlations")
plt.xticks(rotation="vertical")
plt.show()
names, correlations = pipe.features.correlations() plt.subplots(figsize=(20, 10)) plt.bar(names, correlations, color=col_getml) plt.title("Feature Correlations") plt.xlabel("Features") plt.ylabel("Correlations") plt.xticks(rotation="vertical") plt.show()

Feature importances

In [14]:

  Copied!     
 
names, importances = pipe.features.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances, color=col_getml)
plt.title("Feature Importances")
plt.xlabel("Features")
plt.ylabel("Importances")
plt.xticks(rotation="vertical")
plt.show()
names, importances = pipe.features.importances() plt.subplots(figsize=(20, 10)) plt.bar(names, importances, color=col_getml) plt.title("Feature Importances") plt.xlabel("Features") plt.ylabel("Importances") plt.xticks(rotation="vertical") plt.show()

Visualizing the learned features

We can also transpile the features as SQL code. Here, we show the most important feature.

In [15]:

  Copied!     
 
by_importance = pipe.features.sort(by="importance")
by_importance[0].sql
by_importance = pipe.features.sort(by="importance") by_importance[0].sql

Out[15]:

DROP TABLE IF EXISTS "FEATURE_1_3";

CREATE TABLE "FEATURE_1_3" AS
SELECT FIRST( t2."traffic_volume" ORDER BY t2."ds, '+1.000000 hours'" ) AS "feature_1_3",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRAFFIC__STAGING_TABLE_2" t2
ON 1 = 1
WHERE t2."ds, '+1.000000 hours'" <= t1."ds"
AND ( t2."ds, '+7.041667 days'" > t1."ds" OR t2."ds, '+7.041667 days'" IS NULL )
AND t1."hour( ds )" = t2."hour( ds )"
GROUP BY t1.rowid;

Plot predictions & traffic volume vs. time

We now plot the predictions against the observed values of the target for the first 7 days of the testing set. You can see that the predictions closely follows the original series. FastProp was able to identify certain patterns in the series, including:

Day and night separation
The daily commuting peeks (on weekdays)
The decline on weekends

In [16]:

  Copied!     
 
predictions = pipe.predict(time_series.test)
predictions = pipe.predict(time_series.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

In [17]:

  Copied!     
 
fig, ax = plt.subplots(figsize=(20, 10))

# the test set starts at 2018/03/15 – a thursday; we introduce an offset to, once again, start on a monday
start = 96
end = 96 + 168

actual = time_series.test.population[start:end].to_pandas()
predicted = predictions[start:end]

ax.plot(actual["ds"], actual["traffic_volume"], color=col_data, label="Actual")
ax.plot(actual["ds"], predicted, color=col_getml, label="Predicted")
fig.suptitle(
    "Predicted vs. actual traffic volume for first full week of testing set",
    fontsize=14,
    fontweight="bold",
)
fig.legend()
plt.show()
fig, ax = plt.subplots(figsize=(20, 10)) # the test set starts at 2018/03/15 – a thursday; we introduce an offset to, once again, start on a monday start = 96 end = 96 + 168 actual = time_series.test.population[start:end].to_pandas() predicted = predictions[start:end] ax.plot(actual["ds"], actual["traffic_volume"], color=col_data, label="Actual") ax.plot(actual["ds"], predicted, color=col_getml, label="Predicted") fig.suptitle( "Predicted vs. actual traffic volume for first full week of testing set", fontsize=14, fontweight="bold", ) fig.legend() plt.show()

2.5 Features¶

The most important feature looks as follows:

In [18]:

  Copied!     
 
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]

Out[18]:

DROP TABLE IF EXISTS "FEATURE_1_3";

CREATE TABLE "FEATURE_1_3" AS
SELECT FIRST( t2."traffic_volume" ORDER BY t2."ds, '+1.000000 hours'" ) AS "feature_1_3",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRAFFIC__STAGING_TABLE_2" t2
ON 1 = 1
WHERE t2."ds, '+1.000000 hours'" <= t1."ds"
AND ( t2."ds, '+7.041667 days'" > t1."ds" OR t2."ds, '+7.041667 days'" IS NULL )
AND t1."hour( ds )" = t2."hour( ds )"
GROUP BY t1.rowid;

In [19]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

In this notebook, we demonstrated a comprehensive approach to predicting hourly traffic volume on Interstate 94 westbound from Minneapolis-St Paul using the getML library. We covered the following steps:

Background and Data Preparation:
- Introduced the dataset and its characteristics.
- Loaded and prepared the data using getml.datasets.load_interstate94.
Data Visualization:
- Visualized the first week of traffic volume data to understand the patterns and trends.
Data Modeling:
- Defined roles and units for the dataset columns.
- Created a time series model using getml.data.TimeSeries to capture temporal dependencies.
Predictive Modeling:
- Built a getML pipeline for relational learning to predict traffic volume.

By leveraging getML's capabilities, we efficiently handled the complexities of time series data, including high frequency, irregular events, and multiple seasonalities. This approach can be extended to other time series prediction tasks in various domains.