project

Robot (nb intro)

EXCERPT (BLOG ITEM + OG, TWITTER)

Feature engineering on sensor data - how to overcome feature explosion

The purpose of this notebook is to illustrate how we can overcome the feature explosion problem based on an example dataset involving sensor data.

Summary:

  • Prediction type: Regression
  • Domain: Robotics
  • Prediction target: The force vector on the robot's arm
  • Population size: 15001

Author: Dr. Patrick Urbanke

Feature explosion

The problem

The feature explosion problem is one of the most important issues in automated feature engineering. In fact, it is probably the main reason why automated feature engineering is not already the norm in data science projects involving business data.

To illustrate the problem, consider how data scientists write features for a simple time series problem:

SELECT SOME_AGGREGATION(t2.some_column)
FROM some_table t1
LEFT JOIN some_table t2
ON t1.join_key = t2.join_key
WHERE t2.some_other_column >= some_value
AND t2.rowid <= t1.rowid
AND t2.rowid + some_other_value > t1.rowid
GROUP BY t1.rowid;

Think about that for a second.

Every column that we have can either be aggregated (some_column) or it can be used for our conditions (some_other_column). That means if we have n columns to aggregate, we can potentially build conditions for $n$ other columns. In other words, the computational complexity is $n^2$ in the number of columns.

Note that this problem occurs regardless of whether you automate feature engineering or you do it by hand. The size of the search space is $n^2$ in the number of columns in either case, unless you can rule something out a-priori.

This problem is known as feature explosion.

The solution

So when we have relational data or time series with many columns, what do we do? The answer is to write different features. Specifically, suppose we had features like this:

SELECT SOME_AGGREGATION(
    CASE 
         WHEN t2.some_column > some_value THEN weight1
         WHEN t2.some_column <= some_value THEN weight2
    END
)
FROM some_table t1
LEFT JOIN some_table t2
ON t1.join_key = t2.join_key
WHERE t2.rowid <= t1.rowid
AND t2.rowid + some_other_value > t1.rowid
GROUP BY t1.rowid;

weight1 and weight2 are learnable weights. An algorithm that generates features like this can only use columns for conditions, it is not allowed to aggregate columns – and it doesn't need to do so.

That means the computational complexity is linear instead of quadratic. For data sets with a large number of columns this can make all the difference in the world. For instance, if you have 100 columns the size of the search space of the second approach is only 1% of the size of the search space of the first one.

getML features an algorithm called relboost, which generates features according to this principle and is therefore very suitable for data sets with many columns.

The data set

To illustrate the problem, we use a data set related to robotics. When robots interact with humans, the most important think is that they don't hurt people. In order to prevent such accidents, the force vector on the robot's arm is measured. However, measuring the force vector is expensive.

Therefore, we want consider an alternative approach. We would like to predict the force vector based on other sensor data that are less costly to measure. To do so, we use machine learning.

However, the data set contains measurements from almost 100 different sensors and we do not know which and how many sensors are relevant for predicting the force vector.

The data set has been generously provided by Erik Berger who originally collected it for his dissertation:

Berger, E. (2018). Behavior-Specific Proprioception Models for Robotic Force Estimation: A Machine Learning Approach. Freiberg, Germany: Technische Universitaet Bergakademie Freiberg.


Propositionalization: Robot sensor data

In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

  • Prediction type: Regression
  • Domain: Robotics
  • Prediction target: The force vector on the robot's arm
  • Population size: 15001

Author: Dr. Patrick Urbanke

Background

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

The data set has been generously provided by Erik Berger who originally collected it for his dissertation:

Berger, E. (2018). Behavior-Specific Proprioception Models for Robotic Force Estimation: A Machine Learning Approach. Freiberg, Germany: Technische Universitaet Bergakademie Freiberg.

Related code example

Initial Notebook:
Open in nbviewer
Open in mybinder

Propositionalization:
Open in nbviewer
Open in mybinder