EXCERPT (BLOG ITEM + OG, TWITTER)
In this tutorial, you will learn how to apply getML to multivariate time series. It also demonstrates how to use getML's high-level interface for hyperparameter tuning.
Summary:
Author: Dr. Johannes King
Our use case is a public domain data set for predicting room occupancy from sensor data. The results achieved using getML outperform all published results on this data set. Note that this is not only a neat use case for machine learning algorithms, but a real-world application with tangible consequences: If room occupancy is known with sufficient certainty, it can be applied to the control systems of a building. Such as system can reduce the energy consumption by up to 50 %.
Usually, getML is considered to be a tool for feature engineering and machine learning on relational data sets. How can we apply it to (multivariate) time series?
The key is a self-join. Instead of creating features by merging and aggregating peripheral tables in a relational data model, for a time-series, we perform the same operations on the population table itself. This results in features like these:
Using getML's algorithms for relational learning, we can extract all of these features automatically. Having created a flat table of such features, we can then apply state-of-the-art machine learning algorithms, like xgboost. As you will see in this example, this performs better than traditional time series analysis.
The present analysis is based on a public domain time series dataset. It is available in the UC Irvine Machine Learning Repository. The challenge is straightforward: We want to predict whether an office room is occupied at a given moment in time using sensor data. The data is measured about once a minute. Ground-truth occupancy was obtained from time-stamped pictures. The available columns are
As a reference and benchmark, we use this paper:
Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Veronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39.
The authors apply various artifical neural networks algorithms to the data set at hand and achieved accuracies between 80.324% (batch back algorithm) and 99.061% (limited memory quasi-Newton algorithm).
In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.
Summary:
Author: Dr. Johannes King
A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.
getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.
Our use case here is a public domain data set for predicting room occupancy from sensor data. For further details about the data set refer to the full notebook.
Initial Notebook:
Open in nbviewer
Open in mybinder
Propositionalization:
Open in nbviewer
Open in mybinder