Occupancy - A multivariate time series example¶

In this tutorial, you will learn how to apply getML to multivariate time series. It also demonstrates how to use getML's high-level interface for hyperparameter tuning.

Summary:

Prediction type: Binary classification
Domain: Energy
Prediction target: Room occupancy
Source data: 1 table, 32k rows
Population size: 32k

Background¶

Our use case is a public domain data set for predicting room occupancy from sensor data. The results achieved using getML outperform all published results on this data set. Note that this is not only a neat use case for machine learning algorithms, but a real-world application with tangible consequences: If room occupancy is known with sufficient certainty, it can be applied to the control systems of a building. Such as system can reduce the energy consumption by up to 50 %.

Introduction to occupancy prediction¶

Usually, getML is considered to be a tool for feature engineering and machine learning on relational data sets. How can we apply it to (multivariate) time series?

The key is a self-join. Instead of creating features by merging and aggregating peripheral tables in a relational data model, for a time-series, we perform the same operations on the population table itself. This results in features like these:

Aggregations over time, such as the average value of some column for the last 3 days.
Seasonal effects, such as today is a Wednesday, so let's get the average value for the last four Wednesdays.
Lag variables, such as get the value of some column from two hours ago.

Using getML's algorithms for relational learning, we can extract all of these features automatically. Having created a flat table of such features, we can then apply state-of-the-art machine learning algorithms, like xgboost. As you will see in this example, this performs better than traditional time series analysis.

The present analysis is based on a public domain time series dataset. It is available in the UC Irvine Machine Learning Repository. The challenge is straightforward: We want to predict whether an office room is occupied at a given moment in time using sensor data. The data is measured about once a minute. Ground-truth occupancy was obtained from time-stamped pictures. The available columns are

Date, year-month-day hour:minute:second
Temperature, in Celsius
Relative Humidity, %
Light, in Lux
CO2, in ppm
Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air
Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status

As a reference and benchmark, we use this paper:

Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Veronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39.

The authors apply various artifical neural networks algorithms to the data set at hand and achieved accuracies between 80.324% (batch back algorithm) and 99.061% (limited memory quasi-Newton algorithm).

Analysis¶

Let's get started with the analysis and set-up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"

In [2]:

  Copied!     
 
import matplotlib.pyplot as plt

import getml

%matplotlib inline

print(f"getML API version: {getml.__version__}\n")
import matplotlib.pyplot as plt import getml %matplotlib inline print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [26]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project("occupancy")
getml.engine.launch(allow_remote_ips=True, token='token') getml.engine.set_project("occupancy")

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912154725.log.

Connected to project 'occupancy'.

1. Loading data¶

The data set can be downloaded directly from GitHub. It is conveniently separated into a training, a validation and a testing set. This allows us to directly benchmark our results against the results of the original paper later.

In [4]:

  Copied!     
 
(
    population_train,
    population_test,
    population_validation,
) = getml.datasets.load_occupancy()
( population_train, population_test, population_validation, ) = getml.datasets.load_occupancy()

In [5]:

  Copied!     
 
population, split = getml.data.split.concat(
    "population",
    train=population_train,
    validation=population_validation,
    test=population_test,
)
population, split = getml.data.split.concat( "population", train=population_train, validation=population_validation, test=population_test, )

The training set looks like this:

In [6]:

hide_input

  Copied!     
 
population
population

Out[6]:

name	date	Occupancy	Temperature	Humidity	Light	CO2	HumidityRatio
role	time_stamp	target	numerical	numerical	numerical	numerical	numerical
unit	time stamp
0	2015-02-04 17:51:00	1	23.18	27.272	426	721.25	0.004793
1	2015-02-04 17:51:59	1	23.15	27.2675	429.5	714	0.004783
2	2015-02-04 17:53:00	1	23.15	27.245	426	713.5	0.004779
3	2015-02-04 17:54:00	1	23.15	27.2	426	708.25	0.004772
4	2015-02-04 17:55:00	1	23.1	27.2	426	704.5	0.004757
	...	...	...	...	...	...	...
20555	2015-02-18 09:15:00	1	20.815	27.7175	429.75	1505.25	0.004213
20556	2015-02-18 09:16:00	1	20.865	27.745	423.5	1514.5	0.00423
20557	2015-02-18 09:16:59	1	20.89	27.745	423.5	1521.5	0.004237
20558	2015-02-18 09:17:59	1	20.89	28.0225	418.75	1632	0.004279
20559	2015-02-18 09:19:00	1	21	28.1	409	1864	0.004321

20560 rows x 7 columns
memory usage: 1.15 MB
name: population
type: getml.DataFrame

We also assign roles to each column. To learn more about what roles do and why we need them, check out the official documentation.

In [7]:

  Copied!     
 
population.set_role(["Occupancy"], getml.data.roles.target)
population.set_role(
    ["Temperature", "Humidity", "Light", "CO2", "HumidityRatio"],
    getml.data.roles.numerical,
)
population.set_role(["date"], getml.data.roles.time_stamp)

population
population.set_role(["Occupancy"], getml.data.roles.target) population.set_role( ["Temperature", "Humidity", "Light", "CO2", "HumidityRatio"], getml.data.roles.numerical, ) population.set_role(["date"], getml.data.roles.time_stamp) population

Out[7]:

name	date	Occupancy	Temperature	Humidity	Light	CO2	HumidityRatio
role	time_stamp	target	numerical	numerical	numerical	numerical	numerical
unit	time stamp
0	2015-02-04 17:51:00	1	23.18	27.272	426	721.25	0.004793
1	2015-02-04 17:51:59	1	23.15	27.2675	429.5	714	0.004783
2	2015-02-04 17:53:00	1	23.15	27.245	426	713.5	0.004779
3	2015-02-04 17:54:00	1	23.15	27.2	426	708.25	0.004772
4	2015-02-04 17:55:00	1	23.1	27.2	426	704.5	0.004757
	...	...	...	...	...	...	...
20555	2015-02-18 09:15:00	1	20.815	27.7175	429.75	1505.25	0.004213
20556	2015-02-18 09:16:00	1	20.865	27.745	423.5	1514.5	0.00423
20557	2015-02-18 09:16:59	1	20.89	27.745	423.5	1521.5	0.004237
20558	2015-02-18 09:17:59	1	20.89	28.0225	418.75	1632	0.004279
20559	2015-02-18 09:19:00	1	21	28.1	409	1864	0.004321

20560 rows x 7 columns
memory usage: 1.15 MB
name: population
type: getml.DataFrame

2. Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 getML Pipeline¶

We use a Multirel for generating the features and a simple logistic regression for prediction.

We do not spend much effort on the hyperparameters and largely go with the default values. The only exception is that we add some regularization to the XGBoostClassifiers.

We choose to consider data within the last 15 minutes for creating our features.

In [8]:

  Copied!     
 
# Our forecast horizon is 0.
# We do not predict the future, instead we infer
# the present state from current and past sensor data.
horizon = 0.0

# We do not allow the time series features
# to use target values from the past.
# (Otherwise, we would need the horizon to
# be greater than 0.0).
allow_lagged_targets = False

# We want our time series features to only use
# data from the last 15 minutes
memory = getml.data.time.minutes(15)

time_series = getml.data.TimeSeries(
    population=population,
    split=split,
    time_stamps="date",
    horizon=horizon,
    memory=memory,
    lagged_targets=allow_lagged_targets,
)

time_series
# Our forecast horizon is 0. # We do not predict the future, instead we infer # the present state from current and past sensor data. horizon = 0.0 # We do not allow the time series features # to use target values from the past. # (Otherwise, we would need the horizon to # be greater than 0.0). allow_lagged_targets = False # We want our time series features to only use # data from the last 15 minutes memory = getml.data.time.minutes(15) time_series = getml.data.TimeSeries( population=population, split=split, time_stamps="date", horizon=horizon, memory=memory, lagged_targets=allow_lagged_targets, ) time_series

Out[8]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	population	POPULATION__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	population	9751	View
1	train	population	8144	View
2	validation	population	2665	View

peripheral

	name	rows	type
0	population	20560	DataFrame

In [9]:

  Copied!     
 
feature_learner = getml.feature_learning.Multirel(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
)

predictor = getml.predictors.LogisticRegression()

pipe = getml.pipeline.Pipeline(
    data_model=time_series.data_model,
    tags=["memory=15", "logistic regression"],
    feature_learners=[feature_learner],
    predictors=[predictor]
    
)

pipe
feature_learner = getml.feature_learning.Multirel( loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, ) predictor = getml.predictors.LogisticRegression() pipe = getml.pipeline.Pipeline( data_model=time_series.data_model, tags=["memory=15", "logistic regression"], feature_learners=[feature_learner], predictors=[predictor] ) pipe

Out[9]:

Pipeline(data_model='population',
         feature_learners=['Multirel'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['population'],
         predictors=['LogisticRegression'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['memory=15', 'logistic regression'])

.check(...) will be automatically called by .fit(...). But it is always a good idea to call .check(...) separately before fitting, so we still have time for some last-minute fixes.

In [10]:

  Copied!     
 
pipe.check(time_series.train)
pipe.check(time_series.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

2.2 Model training¶

We use a routine for automatic hyperparameter optimization to find the best parameters for the predictor:

In [11]:

  Copied!     
 
pipe = getml.hyperopt.tune_predictors(pipe, time_series)
pipe = getml.hyperopt.tune_predictors(pipe, time_series)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Tuning logistic regression... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 03:06
Time taken: 0:03:06.269772.

Building final pipeline...

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Multirel: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09
  Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  LogisticRegression: Training as predictor... ━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Trained pipeline.

Time taken: 0:00:13.435908.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

The logistic regression has one hyperparameter that can be optimized, namely its L2 regularization parameter. We fine-tune this parameter using the validation set. Note that the validation set is different from the testing set on which the final outcome will be evaluated.

2.3 Model evaluation¶

Let's see how well we did by scoring the model on the testing set.

In [12]:

  Copied!     
 
pipe.score(time_series.test)
pipe.score(time_series.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02

Out[12]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 15:44:58	train	Occupancy	0.9886	0.9948	0.3937
1	2024-09-12 15:45:00	validation	Occupancy	0.9786	0.9937	0.3594
2	2024-09-12 15:45:03	test	Occupancy	0.9933	0.9983	0.3911

In the original paper, the authors tried several approaches. The best out-of-sample values of all the approaches they tried are the following:

Accuracy (testing): 0.99061
AUC (testing): 0.99574

Note that our results outperform the best approach from the original paper, both in terms of accuracy as well as AUC. This demonstrates how powerful getML's relational learning approach is also a powerful tool for time series.

In [13]:

  Copied!     
 
# Refers to the data from the last time
# we called .score(...).
fpr, tpr = pipe.plots.roc_curve()

plt.subplots(figsize=(20, 10))

plt.plot(fpr, tpr)

plt.title("receiver operating characteristic (ROC)")
plt.grid(True)
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")

plt.show()
# Refers to the data from the last time # we called .score(...). fpr, tpr = pipe.plots.roc_curve() plt.subplots(figsize=(20, 10)) plt.plot(fpr, tpr) plt.title("receiver operating characteristic (ROC)") plt.grid(True) plt.xlabel("false positive rate") plt.ylabel("true positive rate") plt.show()

No description has been provided for this image

In [14]:

  Copied!     
 
# Refers to the data from the last time
# we called .score(...).
recall, precision = pipe.plots.precision_recall_curve()

plt.subplots(figsize=(20, 10))

plt.plot(recall, precision)

plt.title("precision-recall curve")
plt.grid(True)
plt.xlabel("recall (true positive rate)")
plt.ylabel("precision")

plt.show()
# Refers to the data from the last time # we called .score(...). recall, precision = pipe.plots.precision_recall_curve() plt.subplots(figsize=(20, 10)) plt.plot(recall, precision) plt.title("precision-recall curve") plt.grid(True) plt.xlabel("recall (true positive rate)") plt.ylabel("precision") plt.show()

In [15]:

  Copied!     
 
# Refers to the data from the last time
# we called .score(...).
proportion, lift = pipe.plots.lift_curve()

plt.subplots(figsize=(20, 10))

plt.plot(proportion, lift)

plt.title("lift curve")
plt.grid(True)
plt.xlabel("proportion")
plt.ylabel("lift")

plt.show()
# Refers to the data from the last time # we called .score(...). proportion, lift = pipe.plots.lift_curve() plt.subplots(figsize=(20, 10)) plt.plot(proportion, lift) plt.title("lift curve") plt.grid(True) plt.xlabel("proportion") plt.ylabel("lift") plt.show()

2.4 Studying the features¶

It is always a good idea to study the features the relational learning algorithm has extracted. We can do so in the feature view of the getML monitor. Open the monitor and select the models tab in the sidebar. You will see an overview over the trained pipelines. Select a pipeline to see the most essential summary plots.

If you want to document them inside your notebook, here is how you can do that:

In [16]:

  Copied!     
 
pipe.features
pipe.features

Out[16]:

	target	name	correlation	importance
0	Occupancy	feature_1_1	0.979	0.009909
1	Occupancy	feature_1_2	0.9784	0.009903
2	Occupancy	feature_1_3	0.9739	0.009595
3	Occupancy	feature_1_4	-0.9776	0.00991
4	Occupancy	feature_1_5	0.9795	0.009937
	...	...	...	...
100	Occupancy	temperature	0.5217	0.005354
101	Occupancy	humidity	-0.0878	0.001188
102	Occupancy	light	0.9145	0.009281
103	Occupancy	co2	0.2618	0.007209
104	Occupancy	humidityratio	0.19	0.002845

In [17]:

  Copied!     
 
names, correlations = pipe.features.correlations()

plt.subplots(figsize=(20, 10))

plt.bar(names, correlations)

plt.title("feature correlations")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("correlations")
plt.xticks(rotation="vertical")

plt.show()
names, correlations = pipe.features.correlations() plt.subplots(figsize=(20, 10)) plt.bar(names, correlations) plt.title("feature correlations") plt.grid(True) plt.xlabel("features") plt.ylabel("correlations") plt.xticks(rotation="vertical") plt.show()

In [18]:

hide_input

  Copied!     
 
names, importances = pipe.features.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances)

plt.title("feature importances")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("importance")
plt.xticks(rotation="vertical")

plt.show()
names, importances = pipe.features.importances() plt.subplots(figsize=(20, 10)) plt.bar(names, importances) plt.title("feature importances") plt.grid(True) plt.xlabel("features") plt.ylabel("importance") plt.xticks(rotation="vertical") plt.show()

The feature importance is calculated by XGBoost based on the improvement of the optimizing criterium at each split in the decision tree and is normalized to 100%.

We first look at the most important feature. The names returned by feature importances are already sorted, so we can just index them, like this:

In [19]:

  Copied!     
 
pipe.features.to_sql()[names[0]]
pipe.features.to_sql()[names[0]]

Out[19]:

DROP TABLE IF EXISTS "FEATURE_1_90";

CREATE TABLE "FEATURE_1_90" AS
SELECT MAX( t1."date" - t2."date" ) AS "feature_1_90",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "POPULATION__STAGING_TABLE_2" t2
ON 1 = 1
WHERE ( t2."date" <= t1."date"
AND ( t2."date__15_000000_minutes" > t1."date" OR t2."date__15_000000_minutes" IS NULL )
) AND (
   ( ( t1."light" > 360.256870 OR t1."light" IS NULL  ) AND ( t2."light" > 609.161616 ) )
OR ( ( t1."light" <= 360.256870 ) )
)
GROUP BY t1.rowid;

Let's check out the second most important feature.

In [20]:

  Copied!     
 
pipe.features.to_sql()[names[1]]
pipe.features.to_sql()[names[1]]

Out[20]:

DROP TABLE IF EXISTS "FEATURE_1_52";

CREATE TABLE "FEATURE_1_52" AS
SELECT SUM( t2."light" ) AS "feature_1_52",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "POPULATION__STAGING_TABLE_2" t2
ON 1 = 1
WHERE ( t2."date" <= t1."date"
AND ( t2."date__15_000000_minutes" > t1."date" OR t2."date__15_000000_minutes" IS NULL )
) AND (
   ( ( t1."light" > 360.256870 ) AND ( t2."light" <= 566.595420 ) )
)
GROUP BY t1.rowid;

These two features demonstrate the power of the getML feature learning algorithms. It is very unlikely to find these features using manual, trial-and-error based methods. The general structure of features found using such methods might be similar, but you would have had to put in much more effort while getting worse results.

When browsing through the remaining features, you will notice that some are columns directly taken from the original table, such as Light and CO2. But these columns are less correlated and less important than the features generated with the relational model based on self-join and upper time stamps.

Because getML uses a feature learning approach, the concept of feature importances can also be carried over to the individual columns.

In [21]:

  Copied!     
 
names, importances = pipe.columns.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances)

plt.title("column importances")
plt.grid(True)
plt.xlabel("columns")
plt.ylabel("importance")
plt.xticks(rotation="vertical")

plt.show()
names, importances = pipe.columns.importances() plt.subplots(figsize=(20, 10)) plt.bar(names, importances) plt.title("column importances") plt.grid(True) plt.xlabel("columns") plt.ylabel("importance") plt.xticks(rotation="vertical") plt.show()

In [22]:

  Copied!     
 
pipe.columns
pipe.columns

Out[22]:

	name	marker	table	importance	target
0	CO2	[PERIPHERAL]	population	0.082947	Occupancy
1	Humidity	[PERIPHERAL]	population	0.002195	Occupancy
2	HumidityRatio	[PERIPHERAL]	population	0.012781	Occupancy
3	Light	[PERIPHERAL]	population	0.151959	Occupancy
4	Temperature	[PERIPHERAL]	population	0.036588	Occupancy
	...	...	...	...	...
7	Humidity	[POPULATION]	population	0.001188	Occupancy
8	HumidityRatio	[POPULATION]	population	0.003191	Occupancy
9	Light	[POPULATION]	population	0.665215	Occupancy
10	Temperature	[POPULATION]	population	0.005463	Occupancy
11	date	[POPULATION]	population	0.015112	Occupancy

As we can see, about 80% of the predictive power comes from the Light column. The second most important column is the Temperature column, we contributes about 7% of the predictive power.

2.5 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [23]:

  Copied!     
 
# Creates a folder named occupancy_pipeline containing
# the SQL code.
pipe.features.to_sql().save("occupancy_pipeline")
# Creates a folder named occupancy_pipeline containing # the SQL code. pipe.features.to_sql().save("occupancy_pipeline")

In [24]:

  Copied!     
 
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("occupancy_spark")
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("occupancy_spark")

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

This tutorial demonstrates that relational learning is a powerful tool for time series. We able to outperform the benchmarks for a scientific paper on a simple public domain time series data set using relatively little effort.

If you want to learn more about getML, check out the official documentation.