project

Air pollution (nb intro)

EXCERPT (BLOG ITEM + OG, TWITTER)

Why feature learning is better than simple propositionalization

NOTE: Due to featuretools's and tsfresh's memory requirements, this notebook will not run on MyBinder when RUN_FEATURETOOLS=True RUN_TSFRESH=True.

In this notebook we will compare getML to featuretools and tsfresh, both of which open-source libraries for feature engineering. We find that advanced algorithms featured in getML yield significantly better predictions on this dataset. We then discuss why that is.

Summary:

  • Prediction type: Regression model
  • Domain: Air pollution
  • Prediction target: pm 2.5 concentration
  • Source data: Multivariate time series
  • Population size: 41757

Author: Dr. Patrick Urbanke

Background

Many data scientists and AutoML tools use propositionalization methods for feature engineering. These propositionalization methods usually work as follows:

  • Generate a large number of hard-coded features
  • Use feature selection to pick a percentage of these features

By contrast, getML (https://getml.com/product) contains approaches for feature learning: Feature learning adapts machine learning approaches such as decision trees or gradient boosting to the problem of extracting features from relational data and time series.

In this notebook, we will benchmark getML (https://getml.com/product) against featuretools (https://www.featuretools.com/) and tsfresh (https://tsfresh.readthedocs.io/en/latest/). Both of these libaries use propositionalization approaches for feature engineering.

As our example dataset, we use a publicly available dataset on air pollution in Beijing, China (https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data). The data set has been originally used in the following study:

Liang, X., Zou, T., Guo, B., Li, S., Zhang, H., Zhang, S., Huang, H. and Chen, S. X. (2015). Assessing Beijing's PM2.5 pollution: severity, weather impact, APEC and winter heating. Proceedings of the Royal Society A, 471, 20150257.

We find that getML significantly outperforms featuretools and tsfresh in terms of predictive accuracy (R-squared of 62.3% vs R-squared of 50.4%).

Our findings indicate that getML's feature learning algorithms are better at adapting to data sets and are also more scalable due to their lower memory requirement.


Propositionalization: Predicting air pollution in Beijing

NOTE: Due to featuretools's and tsfresh's memory requirements, this notebook will not run on MyBinder.

In this notebook we will compare getML to featuretools and tsfresh, both of which open-source libraries for feature engineering. We find that advanced algorithms featured in getML yield significantly better predictions on this dataset. We then discuss why that is.

Summary:

  • Prediction type: Regression model
  • Domain: Air pollution
  • Prediction target: pm 2.5 concentration
  • Source data: Multivariate time series
  • Population size: 41757

Author: Dr. Patrick Urbanke

Background

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

As our example dataset, we use a publicly available dataset on air pollution in Beijing, China (https://archive.ics.uci.edu/ml/datasets/Beijing+PM2.5+Data). For further details about the data set refer to the full notebook.

Related code example

Initial Notebook:
Open in nbviewer
Open in mybinder

Propositionalization:
Open in nbviewer
Open in mybinder