project

Occupancy (nb intro)

EXCERPT (BLOG ITEM + OG, TWITTER)

Occupancy detection

A multivariate time series example

In this tutorial, you will learn how to apply getML to multivariate time series. It also demonstrates how to use getML's high-level interface for hyperparameter tuning.

Summary:

  • Prediction type: Binary classification
  • Domain: Energy
  • Prediction target: Room occupancy
  • Source data: 1 table, 32k rows
  • Population size: 32k

Author: Dr. Johannes King

Our use case is a public domain data set for predicting room occupancy from sensor data. The results achieved using getML outperform all published results on this data set. Note that this is not only a neat use case for machine learning algorithms, but a real-world application with tangible consequences: If room occupancy is known with sufficient certainty, it can be applied to the control systems of a building. Such as system can reduce the energy consumption by up to 50 %.

Background

Introduction to occupancy prediction

Usually, getML is considered to be a tool for feature engineering and machine learning on relational data sets. How can we apply it to (multivariate) time series?

The key is a self-join. Instead of creating features by merging and aggregating peripheral tables in a relational data model, for a time-series, we perform the same operations on the population table itself. This results in features like these:

  • Aggregations over time, such as the average value of some column for the last 3 days.
  • Seasonal effects, such as today is a Wednesday, so let's get the average value for the last four Wednesdays.
  • Lag variables, such as get the value of some column from two hours ago.

Using getML's algorithms for relational learning, we can extract all of these features automatically. Having created a flat table of such features, we can then apply state-of-the-art machine learning algorithms, like xgboost. As you will see in this example, this performs better than traditional time series analysis.

The present analysis is based on a public domain time series dataset. It is available in the UC Irvine Machine Learning Repository. The challenge is straightforward: We want to predict whether an office room is occupied at a given moment in time using sensor data. The data is measured about once a minute. Ground-truth occupancy was obtained from time-stamped pictures. The available columns are

  • Date, year-month-day hour:minute:second
  • Temperature, in Celsius
  • Relative Humidity, %
  • Light, in Lux
  • CO2, in ppm
  • Humidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-air
  • Occupancy, 0 or 1, 0 for not occupied, 1 for occupied status

As a reference and benchmark, we use this paper:

Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Veronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39.

The authors apply various artifical neural networks algorithms to the data set at hand and achieved accuracies between 80.324% (batch back algorithm) and 99.061% (limited memory quasi-Newton algorithm).


Propositionalization: Occupancy detection

In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

  • Prediction type: Binary classification
  • Domain: Energy
  • Prediction target: Room occupancy
  • Source data: 1 table, 32k rows
  • Population size: 32k

Author: Dr. Johannes King

Background

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

Our use case here is a public domain data set for predicting room occupancy from sensor data. For further details about the data set refer to the full notebook.

Related code example

Initial Notebook:
Open in nbviewer
Open in mybinder

Propositionalization:
Open in nbviewer
Open in mybinder