Occupancy - A multivariate time series example¶
In this tutorial, you will learn how to apply getML to multivariate time series. It also demonstrates how to use getML's high-level interface for hyperparameter tuning.
Summary:
- Prediction type: Binary classification
- Domain: Energy
- Prediction target: Room occupancy
- Source data: 1 table, 32k rows
- Population size: 32k
Background¶
Our use case is a public domain data set for predicting room occupancy from sensor data. The results achieved using getML outperform all published results on this data set. Note that this is not only a neat use case for machine learning algorithms, but a real-world application with tangible consequences: If room occupancy is known with sufficient certainty, it can be applied to the control systems of a building. Such as system can reduce the energy consumption by up to 50 %.
Introduction to occupancy prediction¶
Usually, getML is considered to be a tool for feature engineering and machine learning on relational data sets. How can we apply it to (multivariate) time series?
The key is a self-join. Instead of creating features by merging and aggregating peripheral tables in a relational data model, for a time-series, we perform the same operations on the population table itself. This results in features like these:
Aggregations over time, such as the average value of some column for the last 3 days.
Seasonal effects, such as today is a Wednesday, so let's get the average value for the last four Wednesdays.
Lag variables, such as get the value of some column from two hours ago.
Using getML's algorithms for relational learning, we can extract all of these features automatically. Having created a flat table of such features, we can then apply state-of-the-art machine learning algorithms, like xgboost. As you will see in this example, this performs better than traditional time series analysis.
The present analysis is based on a public domain time series dataset. It is available in the UC Irvine Machine Learning Repository. The challenge is straightforward: We want to predict whether an office room is occupied at a given moment in time using sensor data. The data is measured about once a minute. Ground-truth occupancy was obtained from time-stamped pictures. The available columns are
- Date, year-month-day hour:minute:second
- Temperature, in Celsius
- Relative Humidity, %
- Light, in Lux
- CO2, in ppm
- Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air
- Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status
As a reference and benchmark, we use this paper:
Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Veronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39.
The authors apply various artifical neural networks algorithms to the data set at hand and achieved accuracies between 80.324% (batch back algorithm) and 99.061% (limited memory quasi-Newton algorithm).
Analysis¶
Let's get started with the analysis and set-up your session:
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"
import matplotlib.pyplot as plt
import getml
%matplotlib inline
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project("occupancy")
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912154725.log.
Connected to project 'occupancy'.
1. Loading data¶
The data set can be downloaded directly from GitHub. It is conveniently separated into a training, a validation and a testing set. This allows us to directly benchmark our results against the results of the original paper later.
(
population_train,
population_test,
population_validation,
) = getml.datasets.load_occupancy()
population, split = getml.data.split.concat(
"population",
train=population_train,
validation=population_validation,
test=population_test,
)
The training set looks like this:
population
name | date | Occupancy | Temperature | Humidity | Light | CO2 | HumidityRatio |
---|---|---|---|---|---|---|---|
role | time_stamp | target | numerical | numerical | numerical | numerical | numerical |
unit | time stamp | ||||||
0 | 2015-02-04 17:51:00 | 1 | 23.18 | 27.272 | 426 | 721.25 | 0.004793 |
1 | 2015-02-04 17:51:59 | 1 | 23.15 | 27.2675 | 429.5 | 714 | 0.004783 |
2 | 2015-02-04 17:53:00 | 1 | 23.15 | 27.245 | 426 | 713.5 | 0.004779 |
3 | 2015-02-04 17:54:00 | 1 | 23.15 | 27.2 | 426 | 708.25 | 0.004772 |
4 | 2015-02-04 17:55:00 | 1 | 23.1 | 27.2 | 426 | 704.5 | 0.004757 |
... | ... | ... | ... | ... | ... | ... | |
20555 | 2015-02-18 09:15:00 | 1 | 20.815 | 27.7175 | 429.75 | 1505.25 | 0.004213 |
20556 | 2015-02-18 09:16:00 | 1 | 20.865 | 27.745 | 423.5 | 1514.5 | 0.00423 |
20557 | 2015-02-18 09:16:59 | 1 | 20.89 | 27.745 | 423.5 | 1521.5 | 0.004237 |
20558 | 2015-02-18 09:17:59 | 1 | 20.89 | 28.0225 | 418.75 | 1632 | 0.004279 |
20559 | 2015-02-18 09:19:00 | 1 | 21 | 28.1 | 409 | 1864 | 0.004321 |
20560 rows x 7 columns
memory usage: 1.15 MB
name: population
type: getml.DataFrame
We also assign roles to each column. To learn more about what roles do and why we need them, check out the official documentation.
population.set_role(["Occupancy"], getml.data.roles.target)
population.set_role(
["Temperature", "Humidity", "Light", "CO2", "HumidityRatio"],
getml.data.roles.numerical,
)
population.set_role(["date"], getml.data.roles.time_stamp)
population
name | date | Occupancy | Temperature | Humidity | Light | CO2 | HumidityRatio |
---|---|---|---|---|---|---|---|
role | time_stamp | target | numerical | numerical | numerical | numerical | numerical |
unit | time stamp | ||||||
0 | 2015-02-04 17:51:00 | 1 | 23.18 | 27.272 | 426 | 721.25 | 0.004793 |
1 | 2015-02-04 17:51:59 | 1 | 23.15 | 27.2675 | 429.5 | 714 | 0.004783 |
2 | 2015-02-04 17:53:00 | 1 | 23.15 | 27.245 | 426 | 713.5 | 0.004779 |
3 | 2015-02-04 17:54:00 | 1 | 23.15 | 27.2 | 426 | 708.25 | 0.004772 |
4 | 2015-02-04 17:55:00 | 1 | 23.1 | 27.2 | 426 | 704.5 | 0.004757 |
... | ... | ... | ... | ... | ... | ... | |
20555 | 2015-02-18 09:15:00 | 1 | 20.815 | 27.7175 | 429.75 | 1505.25 | 0.004213 |
20556 | 2015-02-18 09:16:00 | 1 | 20.865 | 27.745 | 423.5 | 1514.5 | 0.00423 |
20557 | 2015-02-18 09:16:59 | 1 | 20.89 | 27.745 | 423.5 | 1521.5 | 0.004237 |
20558 | 2015-02-18 09:17:59 | 1 | 20.89 | 28.0225 | 418.75 | 1632 | 0.004279 |
20559 | 2015-02-18 09:19:00 | 1 | 21 | 28.1 | 409 | 1864 | 0.004321 |
20560 rows x 7 columns
memory usage: 1.15 MB
name: population
type: getml.DataFrame
2. Predictive modeling¶
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
2.1 getML Pipeline¶
We use a Multirel for generating the features and a simple logistic regression for prediction.
We do not spend much effort on the hyperparameters and largely go with the default values. The only exception is that we add some regularization to the XGBoostClassifiers.
We choose to consider data within the last 15 minutes for creating our features.
# Our forecast horizon is 0.
# We do not predict the future, instead we infer
# the present state from current and past sensor data.
horizon = 0.0
# We do not allow the time series features
# to use target values from the past.
# (Otherwise, we would need the horizon to
# be greater than 0.0).
allow_lagged_targets = False
# We want our time series features to only use
# data from the last 15 minutes
memory = getml.data.time.minutes(15)
time_series = getml.data.TimeSeries(
population=population,
split=split,
time_stamps="date",
horizon=horizon,
memory=memory,
lagged_targets=allow_lagged_targets,
)
time_series
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | population | POPULATION__STAGING_TABLE_2 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | population | 9751 | View |
1 | train | population | 8144 | View |
2 | validation | population | 2665 | View |
name | rows | type | |
---|---|---|---|
0 | population | 20560 | DataFrame |
feature_learner = getml.feature_learning.Multirel(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
)
predictor = getml.predictors.LogisticRegression()
pipe = getml.pipeline.Pipeline(
data_model=time_series.data_model,
tags=["memory=15", "logistic regression"],
feature_learners=[feature_learner],
predictors=[predictor]
)
pipe
Pipeline(data_model='population', feature_learners=['Multirel'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['population'], predictors=['LogisticRegression'], preprocessors=[], share_selected_features=0.5, tags=['memory=15', 'logistic regression'])
.check(...)
will be automatically called by .fit(...)
. But it is always a good idea to call .check(...)
separately before fitting, so we still have time for some last-minute fixes.
pipe.check(time_series.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
2.2 Model training¶
We use a routine for automatic hyperparameter optimization to find the best parameters for the predictor:
pipe = getml.hyperopt.tune_predictors(pipe, time_series)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Tuning logistic regression... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 03:06 Time taken: 0:03:06.269772.
Building final pipeline...
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Multirel: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09 Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02 Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 LogisticRegression: Training as predictor... ━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Trained pipeline.
Time taken: 0:00:13.435908. Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
The logistic regression has one hyperparameter that can be optimized, namely its L2 regularization parameter. We fine-tune this parameter using the validation set. Note that the validation set is different from the testing set on which the final outcome will be evaluated.
2.3 Model evaluation¶
Let's see how well we did by scoring the model on the testing set.
pipe.score(time_series.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Multirel: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 15:44:58 | train | Occupancy | 0.9886 | 0.9948 | 0.3937 |
1 | 2024-09-12 15:45:00 | validation | Occupancy | 0.9786 | 0.9937 | 0.3594 |
2 | 2024-09-12 15:45:03 | test | Occupancy | 0.9933 | 0.9983 | 0.3911 |
In the original paper, the authors tried several approaches. The best out-of-sample values of all the approaches they tried are the following:
- Accuracy (testing): 0.99061
- AUC (testing): 0.99574
Note that our results outperform the best approach from the original paper, both in terms of accuracy as well as AUC. This demonstrates how powerful getML's relational learning approach is also a powerful tool for time series.
# Refers to the data from the last time
# we called .score(...).
fpr, tpr = pipe.plots.roc_curve()
plt.subplots(figsize=(20, 10))
plt.plot(fpr, tpr)
plt.title("receiver operating characteristic (ROC)")
plt.grid(True)
plt.xlabel("false positive rate")
plt.ylabel("true positive rate")
plt.show()
# Refers to the data from the last time
# we called .score(...).
recall, precision = pipe.plots.precision_recall_curve()
plt.subplots(figsize=(20, 10))
plt.plot(recall, precision)
plt.title("precision-recall curve")
plt.grid(True)
plt.xlabel("recall (true positive rate)")
plt.ylabel("precision")
plt.show()
# Refers to the data from the last time
# we called .score(...).
proportion, lift = pipe.plots.lift_curve()
plt.subplots(figsize=(20, 10))
plt.plot(proportion, lift)
plt.title("lift curve")
plt.grid(True)
plt.xlabel("proportion")
plt.ylabel("lift")
plt.show()
2.4 Studying the features¶
It is always a good idea to study the features the relational learning algorithm has extracted. We can do so in the feature view of the getML monitor. Open the monitor and select the models tab in the sidebar. You will see an overview over the trained pipelines. Select a pipeline to see the most essential summary plots.
If you want to document them inside your notebook, here is how you can do that:
pipe.features
target | name | correlation | importance | |
---|---|---|---|---|
0 | Occupancy | feature_1_1 | 0.979 | 0.009909 |
1 | Occupancy | feature_1_2 | 0.9784 | 0.009903 |
2 | Occupancy | feature_1_3 | 0.9739 | 0.009595 |
3 | Occupancy | feature_1_4 | -0.9776 | 0.00991 |
4 | Occupancy | feature_1_5 | 0.9795 | 0.009937 |
... | ... | ... | ... | |
100 | Occupancy | temperature | 0.5217 | 0.005354 |
101 | Occupancy | humidity | -0.0878 | 0.001188 |
102 | Occupancy | light | 0.9145 | 0.009281 |
103 | Occupancy | co2 | 0.2618 | 0.007209 |
104 | Occupancy | humidityratio | 0.19 | 0.002845 |
names, correlations = pipe.features.correlations()
plt.subplots(figsize=(20, 10))
plt.bar(names, correlations)
plt.title("feature correlations")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("correlations")
plt.xticks(rotation="vertical")
plt.show()
names, importances = pipe.features.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title("feature importances")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("importance")
plt.xticks(rotation="vertical")
plt.show()
The feature importance is calculated by XGBoost based on the improvement of the optimizing criterium at each split in the decision tree and is normalized to 100%.
We first look at the most important feature. The names returned by feature importances are already sorted, so we can just index them, like this:
pipe.features.to_sql()[names[0]]
DROP TABLE IF EXISTS "FEATURE_1_90";
CREATE TABLE "FEATURE_1_90" AS
SELECT MAX( t1."date" - t2."date" ) AS "feature_1_90",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "POPULATION__STAGING_TABLE_2" t2
ON 1 = 1
WHERE ( t2."date" <= t1."date"
AND ( t2."date__15_000000_minutes" > t1."date" OR t2."date__15_000000_minutes" IS NULL )
) AND (
( ( t1."light" > 360.256870 OR t1."light" IS NULL ) AND ( t2."light" > 609.161616 ) )
OR ( ( t1."light" <= 360.256870 ) )
)
GROUP BY t1.rowid;
Let's check out the second most important feature.
pipe.features.to_sql()[names[1]]
DROP TABLE IF EXISTS "FEATURE_1_52";
CREATE TABLE "FEATURE_1_52" AS
SELECT SUM( t2."light" ) AS "feature_1_52",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "POPULATION__STAGING_TABLE_2" t2
ON 1 = 1
WHERE ( t2."date" <= t1."date"
AND ( t2."date__15_000000_minutes" > t1."date" OR t2."date__15_000000_minutes" IS NULL )
) AND (
( ( t1."light" > 360.256870 ) AND ( t2."light" <= 566.595420 ) )
)
GROUP BY t1.rowid;
These two features demonstrate the power of the getML feature learning algorithms. It is very unlikely to find these features using manual, trial-and-error based methods. The general structure of features found using such methods might be similar, but you would have had to put in much more effort while getting worse results.
When browsing through the remaining features, you will notice that some are columns directly taken from the original table, such as Light and CO2. But these columns are less correlated and less important than the features generated with the relational model based on self-join and upper time stamps.
Because getML uses a feature learning approach, the concept of feature importances can also be carried over to the individual columns.
names, importances = pipe.columns.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title("column importances")
plt.grid(True)
plt.xlabel("columns")
plt.ylabel("importance")
plt.xticks(rotation="vertical")
plt.show()
pipe.columns
name | marker | table | importance | target | |
---|---|---|---|---|---|
0 | CO2 | [PERIPHERAL] | population | 0.082947 | Occupancy |
1 | Humidity | [PERIPHERAL] | population | 0.002195 | Occupancy |
2 | HumidityRatio | [PERIPHERAL] | population | 0.012781 | Occupancy |
3 | Light | [PERIPHERAL] | population | 0.151959 | Occupancy |
4 | Temperature | [PERIPHERAL] | population | 0.036588 | Occupancy |
... | ... | ... | ... | ... | |
7 | Humidity | [POPULATION] | population | 0.001188 | Occupancy |
8 | HumidityRatio | [POPULATION] | population | 0.003191 | Occupancy |
9 | Light | [POPULATION] | population | 0.665215 | Occupancy |
10 | Temperature | [POPULATION] | population | 0.005463 | Occupancy |
11 | date | [POPULATION] | population | 0.015112 | Occupancy |
As we can see, about 80% of the predictive power comes from the Light column. The second most important column is the Temperature column, we contributes about 7% of the predictive power.
2.5 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named occupancy_pipeline containing
# the SQL code.
pipe.features.to_sql().save("occupancy_pipeline")
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("occupancy_spark")
getml.engine.shutdown()
3. Conclusion¶
This tutorial demonstrates that relational learning is a powerful tool for time series. We able to outperform the benchmarks for a scientific paper on a simple public domain time series data set using relatively little effort.
If you want to learn more about getML, check out the official documentation.