Propositionalization: Predicting air pollution in Beijing¶

In this notebook we will compare getML to featuretools and tsfresh, both of which open-source libraries for feature engineering. We find that advanced algorithms featured in getML yield significantly better predictions on this dataset. We then discuss why that is.

Summary:

Prediction type: Regression model
Domain: Air pollution
Prediction target: pm 2.5 concentration
Source data: Multivariate time series
Population size: 41757

Background¶

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

As our example dataset, we use a publicly available dataset on air pollution in Beijing, China (https://archive.ics.uci.edu/dataset/381/beijing+pm2+5+data). For further details about the data set refer to the full notebook.

Analysis¶

Table of contents

Loading data
Predictive modeling
Comparison
Conclusion

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3"
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3"

Note: you may need to restart the kernel to use updated packages.

In [2]:

  Copied!     
 
import os
import sys

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path
from urllib import request
import getml
import pandas as pd

print(f"getML API version: {getml.__version__}\n")
import os import sys os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" from pathlib import Path from urllib import request import getml import pandas as pd print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [3]:

  Copied!     
 
# If we are in Colab, we need to fetch the utils folder from the repository
if os.getenv("COLAB_RELEASE_TAG"):
    !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'
# If we are in Colab, we need to fetch the utils folder from the repository if os.getenv("COLAB_RELEASE_TAG"): !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'

In [4]:

  Copied!     
 
parent = Path(os.getcwd()).parent.as_posix()

if parent not in sys.path:
    sys.path.append(parent)

from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder
parent = Path(os.getcwd()).parent.as_posix() if parent not in sys.path: sys.path.append(parent) from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data from the UCI Machine Learning repository.

In [5]:

  Copied!     
 
FEATURETOOLS_FILES = ["featuretools_training.csv", "featuretools_test.csv"]

for fname in FEATURETOOLS_FILES:
    if not os.path.exists(fname):
        fname, res = request.urlretrieve(
            "https://static.getml.com/datasets/air_pollution/featuretools/" + fname,
            fname,
        )
FEATURETOOLS_FILES = ["featuretools_training.csv", "featuretools_test.csv"] for fname in FEATURETOOLS_FILES: if not os.path.exists(fname): fname, res = request.urlretrieve( "https://static.getml.com/datasets/air_pollution/featuretools/" + fname, fname, )

In [6]:

  Copied!     
 
TSFRESH_FILES = ["tsfresh_training.csv", "tsfresh_test.csv"]

for fname in TSFRESH_FILES:
    if not os.path.exists(fname):
        fname, res = request.urlretrieve(
            "https://static.getml.com/datasets/air_pollution/tsfresh/" + fname, fname
        )
TSFRESH_FILES = ["tsfresh_training.csv", "tsfresh_test.csv"] for fname in TSFRESH_FILES: if not os.path.exists(fname): fname, res = request.urlretrieve( "https://static.getml.com/datasets/air_pollution/tsfresh/" + fname, fname )

In [7]:

  Copied!     
 
fname = "PRSA_data_2010.1.1-2014.12.31.csv"

if not os.path.exists(fname):
    fname, res = request.urlretrieve(
        "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/" + fname,
        fname,
    )
fname = "PRSA_data_2010.1.1-2014.12.31.csv" if not os.path.exists(fname): fname, res = request.urlretrieve( "https://archive.ics.uci.edu/ml/machine-learning-databases/00381/" + fname, fname, )

1.2 Prepare data for tsfresh and getML¶

Our our goal is to predict the pm2.5 concentration from factors such as weather or time of day. However, there are some missing entries for pm2.5, so we get rid of them.

In [8]:

  Copied!     
 
data_full_pandas = pd.read_csv(fname)

data_full_pandas = data_full_pandas[
    data_full_pandas["pm2.5"] == data_full_pandas["pm2.5"]
]
data_full_pandas = pd.read_csv(fname) data_full_pandas = data_full_pandas[ data_full_pandas["pm2.5"] == data_full_pandas["pm2.5"] ]

tsfresh requires a date column, so we build one.

In [9]:

  Copied!     
 
def add_leading_zero(val):
    if len(str(val)) == 1:
        return "0" + str(val)
    return str(val)


data_full_pandas["month"] = [add_leading_zero(val) for val in data_full_pandas["month"]]

data_full_pandas["day"] = [add_leading_zero(val) for val in data_full_pandas["day"]]

data_full_pandas["hour"] = [add_leading_zero(val) for val in data_full_pandas["hour"]]


def make_date(year, month, day, hour):
    return year + "-" + month + "-" + day + " " + hour + ":00:00"


data_full_pandas["date"] = [
    make_date(str(year), month, day, hour)
    for year, month, day, hour in zip(
        data_full_pandas["year"],
        data_full_pandas["month"],
        data_full_pandas["day"],
        data_full_pandas["hour"],
    )
]
def add_leading_zero(val): if len(str(val)) == 1: return "0" + str(val) return str(val) data_full_pandas["month"] = [add_leading_zero(val) for val in data_full_pandas["month"]] data_full_pandas["day"] = [add_leading_zero(val) for val in data_full_pandas["day"]] data_full_pandas["hour"] = [add_leading_zero(val) for val in data_full_pandas["hour"]] def make_date(year, month, day, hour): return year + "-" + month + "-" + day + " " + hour + ":00:00" data_full_pandas["date"] = [ make_date(str(year), month, day, hour) for year, month, day, hour in zip( data_full_pandas["year"], data_full_pandas["month"], data_full_pandas["day"], data_full_pandas["hour"], ) ]

tsfresh also requires the time series to have ids. Since there is only a single time series, that series has the same id.

In [10]:

  Copied!     
 
data_full_pandas["id"] = 1
data_full_pandas["id"] = 1

The dataset now contains many columns that we do not need or that tsfresh cannot process. For instance, cbwd might actually contain useful information, but it is a categorical variable, which is difficult to handle for tsfresh, so we remove it.

We also want to split our data into a training and testing set.

In [11]:

  Copied!     
 
data_train_pandas = data_full_pandas[data_full_pandas["year"] < 2014]
data_test_pandas = data_full_pandas[data_full_pandas["year"] == 2014]
data_full_pandas = data_full_pandas
data_train_pandas = data_full_pandas[data_full_pandas["year"] < 2014] data_test_pandas = data_full_pandas[data_full_pandas["year"] == 2014] data_full_pandas = data_full_pandas

In [12]:

  Copied!     
 
def remove_unwanted_columns(df):
    del df["cbwd"]
    del df["year"]
    del df["month"]
    del df["day"]
    del df["hour"]
    del df["No"]
    return df


data_full_pandas = remove_unwanted_columns(data_full_pandas)
data_train_pandas = remove_unwanted_columns(data_train_pandas)
data_test_pandas = remove_unwanted_columns(data_test_pandas)
def remove_unwanted_columns(df): del df["cbwd"] del df["year"] del df["month"] del df["day"] del df["hour"] del df["No"] return df data_full_pandas = remove_unwanted_columns(data_full_pandas) data_train_pandas = remove_unwanted_columns(data_train_pandas) data_test_pandas = remove_unwanted_columns(data_test_pandas)

In [13]:

  Copied!     
 
data_full_pandas
data_full_pandas

Out[13]:

	pm2.5	DEWP	TEMP	PRES	Iws	Is	Ir	date	id
24	129.0	-16	-4.0	1020.0	1.79	0	0	2010-01-02 00:00:00	1
25	148.0	-15	-4.0	1020.0	2.68	0	0	2010-01-02 01:00:00	1
26	159.0	-11	-5.0	1021.0	3.57	0	0	2010-01-02 02:00:00	1
27	181.0	-7	-5.0	1022.0	5.36	1	0	2010-01-02 03:00:00	1
28	138.0	-7	-5.0	1022.0	6.25	2	0	2010-01-02 04:00:00	1
...	...	...	...	...	...	...	...	...	...
43819	8.0	-23	-2.0	1034.0	231.97	0	0	2014-12-31 19:00:00	1
43820	10.0	-22	-3.0	1034.0	237.78	0	0	2014-12-31 20:00:00	1
43821	10.0	-22	-3.0	1034.0	242.70	0	0	2014-12-31 21:00:00	1
43822	8.0	-22	-4.0	1034.0	246.72	0	0	2014-12-31 22:00:00	1
43823	12.0	-21	-3.0	1034.0	249.85	0	0	2014-12-31 23:00:00	1

41757 rows × 9 columns

We then load the data into the getML engine. We begin by setting a project.

In [14]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("air_pollution")
getml.engine.launch(allow_remote_ips=True, token="token") getml.engine.set_project("air_pollution")

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240913121328.log.

Connected to project 'air_pollution'.

In [15]:

  Copied!     
 
df_full = getml.data.DataFrame.from_pandas(data_full_pandas, name="full")
df_full["date"] = df_full["date"].as_ts()
df_full = getml.data.DataFrame.from_pandas(data_full_pandas, name="full") df_full["date"] = df_full["date"].as_ts()

We need to assign roles to the columns, such as defining the target column.

In [16]:

  Copied!     
 
df_full.set_role(["date"], getml.data.roles.time_stamp)
df_full.set_role(["pm2.5"], getml.data.roles.target)
df_full.set_role(
    ["DEWP", "TEMP", "PRES", "Iws", "Is", "Ir"], getml.data.roles.numerical
)
df_full
df_full.set_role(["date"], getml.data.roles.time_stamp) df_full.set_role(["pm2.5"], getml.data.roles.target) df_full.set_role( ["DEWP", "TEMP", "PRES", "Iws", "Is", "Ir"], getml.data.roles.numerical ) df_full

Out[16]:

name	date	pm2.5	DEWP	TEMP	PRES	Iws	Is	Ir	id
role	time_stamp	target	numerical	numerical	numerical	numerical	numerical	numerical	unused_float
unit	time stamp, comparison only
0	2010-01-02	129	-16	-4	1020	1.79	0	0	1
1	2010-01-02 01:00:00	148	-15	-4	1020	2.68	0	0	1
2	2010-01-02 02:00:00	159	-11	-5	1021	3.57	0	0	1
3	2010-01-02 03:00:00	181	-7	-5	1022	5.36	1	0	1
4	2010-01-02 04:00:00	138	-7	-5	1022	6.25	2	0	1
	...	...	...	...	...	...	...	...	...
41752	2014-12-31 19:00:00	8	-23	-2	1034	231.97	0	0	1
41753	2014-12-31 20:00:00	10	-22	-3	1034	237.78	0	0	1
41754	2014-12-31 21:00:00	10	-22	-3	1034	242.7	0	0	1
41755	2014-12-31 22:00:00	8	-22	-4	1034	246.72	0	0	1
41756	2014-12-31 23:00:00	12	-21	-3	1034	249.85	0	0	1

41757 rows x 9 columns
memory usage: 3.01 MB
name: full
type: getml.DataFrame

In [17]:

  Copied!     
 
split = getml.data.split.time(
    population=df_full, time_stamp="date", test=getml.data.time.datetime(2014, 1, 1)
)
split
split = getml.data.split.time( population=df_full, time_stamp="date", test=getml.data.time.datetime(2014, 1, 1) ) split

Out[17]:


0	train
1	train
2	train
3	train
4	train
	...

41757 rows
type: StringColumnView

2. Predictive modeling¶

2.1 Propositionalization with getML's FastProp¶

In [18]:

  Copied!     
 
time_series = getml.data.TimeSeries(
    population=df_full,
    alias="population",
    split=split,
    time_stamps="date",
    memory=getml.data.time.days(1),
)

time_series
time_series = getml.data.TimeSeries( population=df_full, alias="population", split=split, time_stamps="date", memory=getml.data.time.days(1), ) time_series

Out[18]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	full	FULL__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	full	unknown	View
1	train	full	unknown	View

peripheral

	name	rows	type
0	full	41757	DataFrame

In [19]:

  Copied!     
 
fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
)

pipe_fp_fl = getml.pipeline.Pipeline(
    tags=["memory: 1d", "simple features"],
    data_model=time_series.data_model,
    feature_learners=[fast_prop],
)

pipe_fp_fl
fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.SquareLoss, num_threads=1, aggregation=getml.feature_learning.FastProp.agg_sets.All, ) pipe_fp_fl = getml.pipeline.Pipeline( tags=["memory: 1d", "simple features"], data_model=time_series.data_model, feature_learners=[fast_prop], ) pipe_fp_fl

Out[19]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=['full'],
         predictors=[],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['memory: 1d', 'simple features'])

In [20]:

  Copied!     
 
pipe_fp_fl.check(time_series.train)
pipe_fp_fl.check(time_series.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
⠼ Checking... ━━━━━━━━━━━━━━━━━━━━                      50% • 00:01

  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02

OK.

In [21]:

  Copied!     
 
benchmark = Benchmark()
benchmark = Benchmark()

In [22]:

  Copied!     
 
with benchmark("fastprop"):
    pipe_fp_fl.fit(time_series.train)
    fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")
with benchmark("fastprop"): pipe_fp_fl.fit(time_series.train) fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 331 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08

Trained pipeline.

Time taken: 0:00:08.079324.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04

In [23]:

  Copied!     
 
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

In [24]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)
predictor = getml.predictors.XGBoostRegressor() pipe_fp_pr = getml.pipeline.Pipeline( tags=["prediction", "fastprop"], predictors=[predictor] )

In [25]:

  Copied!     
 
pipe_fp_pr.fit(fastprop_train)
pipe_fp_pr.fit(fastprop_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:28

Trained pipeline.

Time taken: 0:00:28.352024.

Out[25]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'fastprop'])

In [26]:

  Copied!     
 
pipe_fp_pr.score(fastprop_test)
pipe_fp_pr.score(fastprop_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[26]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-13 12:14:20	fastprop_train	pm2.5	38.3028	55.2472	0.6438
1	2024-09-13 12:14:20	fastprop_test	pm2.5	44.2502	63.4168	0.5462

2.2 Using featuretools¶

To make things a bit easier, we have written a high-level wrapper around featuretools which we placed in a separate module (utils).

In [27]:

  Copied!     
 
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(days=0),
    memory=pd.Timedelta(days=1),
    column_id="id",
    time_stamp="date",
    target="pm2.5",
)
ft_builder = FTTimeSeriesBuilder( num_features=200, horizon=pd.Timedelta(days=0), memory=pd.Timedelta(days=1), column_id="id", time_stamp="date", target="pm2.5", )

In [28]:

  Copied!     
 
with benchmark("featuretools"):
    featuretools_training = ft_builder.fit(data_train_pandas)

featuretools_test = ft_builder.transform(data_test_pandas)
with benchmark("featuretools"): featuretools_training = ft_builder.fit(data_train_pandas) featuretools_test = ft_builder.transform(data_test_pandas)

featuretools: Trying features...

Selecting the best out of 298 features...
Time taken: 0h:35m:42.348872

In [29]:

  Copied!     
 
df_featuretools_training = getml.data.DataFrame.from_pandas(
    featuretools_training, name="featuretools_training"
)
df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test"
)
df_featuretools_training = getml.data.DataFrame.from_pandas( featuretools_training, name="featuretools_training" ) df_featuretools_test = getml.data.DataFrame.from_pandas( featuretools_test, name="featuretools_test" )

In [30]:

  Copied!     
 
def set_roles_featuretools(df):
    df["date"] = df["date"].as_ts()
    df.set_role(["pm2.5"], getml.data.roles.target)
    df.set_role(["date"], getml.data.roles.time_stamp)
    df.set_role(df.roles.unused, getml.data.roles.numerical)
    df.set_role(["id"], getml.data.roles.unused_float)
    return df


df_featuretools_training = set_roles_featuretools(df_featuretools_training)
df_featuretools_test = set_roles_featuretools(df_featuretools_test)
def set_roles_featuretools(df): df["date"] = df["date"].as_ts() df.set_role(["pm2.5"], getml.data.roles.target) df.set_role(["date"], getml.data.roles.time_stamp) df.set_role(df.roles.unused, getml.data.roles.numerical) df.set_role(["id"], getml.data.roles.unused_float) return df df_featuretools_training = set_roles_featuretools(df_featuretools_training) df_featuretools_test = set_roles_featuretools(df_featuretools_test)

In [31]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["featuretools", "memory: 1d"], predictors=[predictor]
)

pipe_ft_pr
predictor = getml.predictors.XGBoostRegressor() pipe_ft_pr = getml.pipeline.Pipeline( tags=["featuretools", "memory: 1d"], predictors=[predictor] ) pipe_ft_pr

Out[31]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['featuretools', 'memory: 1d'])

In [32]:

  Copied!     
 
pipe_ft_pr.check(df_featuretools_training)
pipe_ft_pr.check(df_featuretools_training)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
⠦ Checking...                                            0% • 00:00

  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

In [33]:

  Copied!     
 
pipe_ft_pr.fit(df_featuretools_training)
pipe_ft_pr.fit(df_featuretools_training)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:10

Trained pipeline.

Time taken: 0:00:10.205240.

Out[33]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['featuretools', 'memory: 1d'])

In [34]:

  Copied!     
 
pipe_ft_pr.score(df_featuretools_test)
pipe_ft_pr.score(df_featuretools_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[34]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-13 12:58:20	featuretools_training	pm2.5	38.277	54.8781	0.6506
1	2024-09-13 12:58:20	featuretools_test	pm2.5	43.9151	62.5672	0.5594

2.3 Using tsfresh¶

tsfresh is a rather low-level library. To make things a bit easier, we have written a high-level wrapper which we placed in a separate module (utils).

To limit the memory consumption, we undertake the following steps:

We limit ourselves to a memory of 1 day from any point in time. This is necessary, because tsfresh duplicates records for every time stamp. That means that looking back 7 days instead of one day, the memory consumption would be seven times as high.
We extract only tsfresh's MinimalFCParameters and IndexBasedFCParameters (the latter is a superset of TimeBasedFCParameters).

In order to make sure that tsfresh's features can be compared to getML's features, we also do the following:

We apply tsfresh's built-in feature selection algorithm.
Of the remaining features, we only keep the 40 features most correlated with the target (in terms of the absolute value of the correlation).
We add the original columns as additional features.

In [35]:

  Copied!     
 
data_train_pandas
data_train_pandas

Out[35]:

	pm2.5	DEWP	TEMP	PRES	Iws	Is	Ir	date	id
24	129.0	-16	-4.0	1020.0	1.79	0	0	2010-01-02 00:00:00	1
25	148.0	-15	-4.0	1020.0	2.68	0	0	2010-01-02 01:00:00	1
26	159.0	-11	-5.0	1021.0	3.57	0	0	2010-01-02 02:00:00	1
27	181.0	-7	-5.0	1022.0	5.36	1	0	2010-01-02 03:00:00	1
28	138.0	-7	-5.0	1022.0	6.25	2	0	2010-01-02 04:00:00	1
...	...	...	...	...	...	...	...	...	...
35059	22.0	-19	7.0	1013.0	114.87	0	0	2013-12-31 19:00:00	1
35060	18.0	-21	7.0	1014.0	119.79	0	0	2013-12-31 20:00:00	1
35061	23.0	-21	7.0	1014.0	125.60	0	0	2013-12-31 21:00:00	1
35062	20.0	-21	6.0	1014.0	130.52	0	0	2013-12-31 22:00:00	1
35063	23.0	-20	7.0	1014.0	137.67	0	0	2013-12-31 23:00:00	1

33096 rows × 9 columns

In [36]:

  Copied!     
 
tsfresh_builder = TSFreshBuilder(
    num_features=200, memory=24, column_id="id", time_stamp="date", target="pm2.5"
)

with benchmark("tsfresh"):
    tsfresh_training = tsfresh_builder.fit(data_train_pandas)

tsfresh_test = tsfresh_builder.transform(data_test_pandas)
tsfresh_builder = TSFreshBuilder( num_features=200, memory=24, column_id="id", time_stamp="date", target="pm2.5" ) with benchmark("tsfresh"): tsfresh_training = tsfresh_builder.fit(data_train_pandas) tsfresh_test = tsfresh_builder.transform(data_test_pandas)

Rolling: 100%|██████████| 40/40 [00:25<00:00,  1.55it/s]
Feature Extraction: 100%|██████████| 40/40 [00:23<00:00,  1.67it/s]
Feature Extraction: 100%|██████████| 40/40 [00:23<00:00,  1.67it/s]

Selecting the best out of 78 features...
Time taken: 0h:1m:21.479083

Rolling: 100%|██████████| 40/40 [00:03<00:00, 13.11it/s]
Feature Extraction: 100%|██████████| 40/40 [00:06<00:00,  6.33it/s]
Feature Extraction: 100%|██████████| 40/40 [00:06<00:00,  6.12it/s]

tsfresh does not contain built-in machine learning algorithms. In order to ensure a fair comparison, we use the exact same machine learning algorithm we have also used for getML: An XGBoost regressor with all hyperparameters set to their default value.

In order to do so, we load the tsfresh features into the getML engine.

In [37]:

  Copied!     
 
df_tsfresh_training = getml.data.DataFrame.from_pandas(
    tsfresh_training, name="tsfresh_training"
)
df_tsfresh_test = getml.data.DataFrame.from_pandas(tsfresh_test, name="tsfresh_test")
df_tsfresh_training = getml.data.DataFrame.from_pandas( tsfresh_training, name="tsfresh_training" ) df_tsfresh_test = getml.data.DataFrame.from_pandas(tsfresh_test, name="tsfresh_test")

As usual, we need to set roles:

In [38]:

  Copied!     
 
def set_roles_tsfresh(df):
    df["date"] = df["date"].as_ts()
    df.set_role(["pm2.5"], getml.data.roles.target)
    df.set_role(["date"], getml.data.roles.time_stamp)
    df.set_role(df.roles.unused, getml.data.roles.numerical)
    df.set_role(["id"], getml.data.roles.unused_float)
    return df


df_tsfresh_training = set_roles_tsfresh(df_tsfresh_training)
df_tsfresh_test = set_roles_tsfresh(df_tsfresh_test)
def set_roles_tsfresh(df): df["date"] = df["date"].as_ts() df.set_role(["pm2.5"], getml.data.roles.target) df.set_role(["date"], getml.data.roles.time_stamp) df.set_role(df.roles.unused, getml.data.roles.numerical) df.set_role(["id"], getml.data.roles.unused_float) return df df_tsfresh_training = set_roles_tsfresh(df_tsfresh_training) df_tsfresh_test = set_roles_tsfresh(df_tsfresh_test)

In this case, our pipeline is very simple. It only consists of a single XGBoostRegressor.

In [39]:

  Copied!     
 
predictor = getml.predictors.XGBoostRegressor()

pipe_tsf_pr = getml.pipeline.Pipeline(
    tags=["tsfresh", "memory: 1d"], predictors=[predictor]
)

pipe_tsf_pr
predictor = getml.predictors.XGBoostRegressor() pipe_tsf_pr = getml.pipeline.Pipeline( tags=["tsfresh", "memory: 1d"], predictors=[predictor] ) pipe_tsf_pr

Out[39]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['tsfresh', 'memory: 1d'])

In [40]:

  Copied!     
 
pipe_tsf_pr.check(df_tsfresh_training)
pipe_tsf_pr.check(df_tsfresh_training)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

In [41]:

  Copied!     
 
pipe_tsf_pr.fit(df_tsfresh_training)
pipe_tsf_pr.fit(df_tsfresh_training)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05

Trained pipeline.

Time taken: 0:00:05.645952.

Out[41]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['tsfresh', 'memory: 1d'])

In [42]:

  Copied!     
 
pipe_tsf_pr.score(df_tsfresh_test)
pipe_tsf_pr.score(df_tsfresh_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[42]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-13 13:00:08	tsfresh_training	pm2.5	40.8917	57.9517	0.6099
1	2024-09-13 13:00:08	tsfresh_test	pm2.5	47.1106	66.6	0.5015

3. Comparison¶

In [43]:

  Copied!     
 
num_features = dict(
    fastprop=289,
    featuretools=114,
    tsfresh=72,
)

runtime_per_feature = [
    benchmark.runtimes["fastprop"] / num_features["fastprop"],
    benchmark.runtimes["featuretools"] / num_features["featuretools"],
    benchmark.runtimes["tsfresh"] / num_features["tsfresh"],
]

features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]

normalized_runtime_per_feature = [
    r / runtime_per_feature[0] for r in runtime_per_feature
]

comparison = pd.DataFrame(
    dict(
        runtime=[
            benchmark.runtimes["fastprop"],
            benchmark.runtimes["featuretools"],
            benchmark.runtimes["tsfresh"],
        ],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        normalized_runtime=[
            1,
            benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
            benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"],
        ],
        normalized_runtime_per_feature=normalized_runtime_per_feature,
        mae=[pipe_fp_pr.mae, pipe_ft_pr.mae, pipe_tsf_pr.mae],
        rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse, pipe_tsf_pr.rmse],
        rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared, pipe_tsf_pr.rsquared],
    )
)

comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]

comparison
num_features = dict( fastprop=289, featuretools=114, tsfresh=72, ) runtime_per_feature = [ benchmark.runtimes["fastprop"] / num_features["fastprop"], benchmark.runtimes["featuretools"] / num_features["featuretools"], benchmark.runtimes["tsfresh"] / num_features["tsfresh"], ] features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature] normalized_runtime_per_feature = [ r / runtime_per_feature[0] for r in runtime_per_feature ] comparison = pd.DataFrame( dict( runtime=[ benchmark.runtimes["fastprop"], benchmark.runtimes["featuretools"], benchmark.runtimes["tsfresh"], ], num_features=num_features.values(), features_per_second=features_per_second, normalized_runtime=[ 1, benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"], benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"], ], normalized_runtime_per_feature=normalized_runtime_per_feature, mae=[pipe_fp_pr.mae, pipe_ft_pr.mae, pipe_tsf_pr.mae], rmse=[pipe_fp_pr.rmse, pipe_ft_pr.rmse, pipe_tsf_pr.rmse], rsquared=[pipe_fp_pr.rsquared, pipe_ft_pr.rsquared, pipe_tsf_pr.rsquared], ) ) comparison.index = ["getML: FastProp", "featuretools", "tsfresh"] comparison

Out[43]:

	runtime	num_features	features_per_second	normalized_runtime	normalized_runtime_per_feature	mae	rmse	rsquared
getML: FastProp	0 days 00:00:13.433118	289	21.514167	1.000000	1.000000	44.250202	63.416780	0.546191
featuretools	0 days 00:35:42.350566	114	0.053213	159.482747	404.306039	43.915071	62.567175	0.559369
tsfresh	0 days 00:01:21.479423	72	0.883658	6.065563	24.346701	47.110594	66.599982	0.501524

In [44]:

  Copied!     
 
# export for further use
comparison.to_csv("comparisons/air_pollution.csv")
# export for further use comparison.to_csv("comparisons/air_pollution.csv")

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

4. Conclusion¶

We have compared getML's feature learning algorithms to tsfresh's brute-force feature engineering approaches on a data set related to air pollution in China. We found that getML significantly outperforms featuretools and tsfresh. These results are consistent with the view that feature learning can yield significant improvements over simple propositionalization approaches.

However, there are other datasets on which simple propositionalization performs well. Our suggestion is therefore to think of algorithms like FastProp and RelMT as tools in a toolbox. If a simple tool like FastProp gets the job done, then use that. But when you need more advanced approaches, like RelMT, you should have them at your disposal as well.

You are encouraged to reproduce these results.