Propositionalization: Occupancy detection¶

In this notebbok, we compare getML's FastProp against well-known feature engineering libraries featuretools and tsfresh.

Summary:

Prediction type: Binary classification
Domain: Energy
Prediction target: Room occupancy
Source data: 1 table, 32k rows
Population size: 32k

Background¶

A common approach to feature engineering is to generate attribute-value representations from relational data by applying a fixed set of aggregations to columns of interest and perform a feature selection on the (possibly large) set of generated features afterwards. In academia, this approach is called propositionalization.

getML's FastProp is an implementation of this propositionalization approach that has been optimized for speed and memory efficiency. In this notebook, we want to demonstrate how – well – fast FastProp is. To this end, we will benchmark FastProp against the popular feature engineering libraries featuretools and tsfresh. Both of these libraries use propositionalization approaches for feature engineering.

Our use case here is a public domain data set for predicting room occupancy from sensor data. For further details about the data set refer to the full notebook.

Let's get started with the analysis and set-up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3"
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3"

Note: you may need to restart the kernel to use updated packages.

In [2]:

  Copied!     
 
import os
import sys

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path

import pandas as pd
import getml

print(f"getML API version: {getml.__version__}\n")
import os import sys os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" from pathlib import Path import pandas as pd import getml print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [3]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("occupancy")
getml.engine.launch(allow_remote_ips=True, token="token") getml.engine.set_project("occupancy")

getML Engine is already running.

Connected to project 'occupancy'.

In [4]:

  Copied!     
 
# If we are in Colab, we need to fetch the utils folder from the repository
if os.getenv("COLAB_RELEASE_TAG"):
    !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'
# If we are in Colab, we need to fetch the utils folder from the repository if os.getenv("COLAB_RELEASE_TAG"): !curl -L https://api.github.com/repos/getml/getml-demo/tarball/master | tar --wildcards --strip-components=1 -xz '*utils*'

In [5]:

  Copied!     
 
parent = Path(os.getcwd()).parent.as_posix()

if parent not in sys.path:
    sys.path.append(parent)

from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder
parent = Path(os.getcwd()).parent.as_posix() if parent not in sys.path: sys.path.append(parent) from utils import Benchmark, FTTimeSeriesBuilder, TSFreshBuilder

1. Loading data¶

The data set can be downloaded directly from GitHub. It is conveniently separated into a train, a validation and a testing set. This allows us to directly benchmark our results against the results of the original paper later.

In [6]:

  Copied!     
 
data_test, data_train, data_validate = getml.datasets.load_occupancy(roles=True)
data_test, data_train, data_validate = getml.datasets.load_occupancy(roles=True)

  Downloading population_train... ━━━━━━━━━━━━━━━━ 100% • 554.6/554.6 kB • 00:00
  Downloading population_test... ━━━━━━━━━━━━━━━━━ 100% • 668.3/668.3 kB • 00:00
  Downloading population_validation... ━━━━━━━━━━━━ 100% • 186.5/186.5   • 00:00
  Downloading population_validation... ━━━━━━━━━━━━ 100% • 186.5/186.5   • 00:00
  Downloading population_validation... ━━━━━━━━━━━━ 100% • 186.5/186.5   • 00:00
                                                           kB

In [7]:

  Copied!     
 
data_all, split = getml.data.split.concat(
    "data_all",
    train=data_train,
    validation=data_validate,
    test=data_test,
)
data_all, split = getml.data.split.concat( "data_all", train=data_train, validation=data_validate, test=data_test, )

The train set looks like this:

In [8]:

hide_input

  Copied!     
 
data_train
data_train

Out[8]:

name	date	Occupancy	Temperature	Humidity	Light	CO2	HumidityRatio
role	time_stamp	target	numerical	numerical	numerical	numerical	numerical
unit	time stamp
0	2015-02-11 14:48:00	1	21.76	31.1333	437.3333	1029.6667	0.005021
1	2015-02-11 14:49:00	1	21.79	31	437.3333	1000	0.005009
2	2015-02-11 14:50:00	1	21.7675	31.1225	434	1003.75	0.005022
3	2015-02-11 14:51:00	1	21.7675	31.1225	439	1009.5	0.005022
4	2015-02-11 14:51:59	1	21.79	31.1333	437.3333	1005.6667	0.00503
	...	...	...	...	...	...	...
9747	2015-02-18 09:15:00	1	20.815	27.7175	429.75	1505.25	0.004213
9748	2015-02-18 09:16:00	1	20.865	27.745	423.5	1514.5	0.00423
9749	2015-02-18 09:16:59	1	20.89	27.745	423.5	1521.5	0.004237
9750	2015-02-18 09:17:59	1	20.89	28.0225	418.75	1632	0.004279
9751	2015-02-18 09:19:00	1	21	28.1	409	1864	0.004321

9752 rows x 7 columns
memory usage: 0.55 MB
name: population_test
type: getml.DataFrame

2. Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 Propositionalization with getML's FastProp¶

We use all possible aggregations. Because tsfresh and featuretools are single-threaded, we limit our FastProp algorithm to one thread as well, to ensure a fair comparison.

In [9]:

  Copied!     
 
# Our forecast horizon is 0.
# We do not predict the future, instead we infer
# the present state from current and past sensor data.
horizon = 0.0

# We do not allow the time series features
# to use target values from the past.
# (Otherwise, we would need the horizon to
# be greater than 0.0).
allow_lagged_targets = False

# We want our time series features to only use
# data from the last 15 minutes
memory = getml.data.time.minutes(15)

time_series = getml.data.TimeSeries(
    population=data_all,
    split=split,
    time_stamps="date",
    horizon=horizon,
    memory=memory,
    lagged_targets=allow_lagged_targets,
)

time_series
# Our forecast horizon is 0. # We do not predict the future, instead we infer # the present state from current and past sensor data. horizon = 0.0 # We do not allow the time series features # to use target values from the past. # (Otherwise, we would need the horizon to # be greater than 0.0). allow_lagged_targets = False # We want our time series features to only use # data from the last 15 minutes memory = getml.data.time.minutes(15) time_series = getml.data.TimeSeries( population=data_all, split=split, time_stamps="date", horizon=horizon, memory=memory, lagged_targets=allow_lagged_targets, ) time_series

Out[9]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	data_all	DATA_ALL__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	data_all	unknown	View
1	train	data_all	unknown	View
2	validation	data_all	unknown	View

peripheral

	name	rows	type
0	data_all	20560	DataFrame

In [10]:

  Copied!     
 
feature_learner = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
    num_threads=1,
)
feature_learner = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, aggregation=getml.feature_learning.FastProp.agg_sets.All, num_threads=1, )

Next, we create the pipeline. In contrast to our usual approach, we create two pipelines in this notebook. One for feature learning (suffix _fl) and one for predicition (suffix _pr). This allows for a fair comparison of runtimes.

In [11]:

  Copied!     
 
pipe_fp_fl = getml.pipeline.Pipeline(
    feature_learners=[feature_learner],
    data_model=time_series.data_model,
    tags=["feature learning", "fastprop"],
)
pipe_fp_fl = getml.pipeline.Pipeline( feature_learners=[feature_learner], data_model=time_series.data_model, tags=["feature learning", "fastprop"], )

In [12]:

  Copied!     
 
pipe_fp_fl.check(time_series.train)
pipe_fp_fl.check(time_series.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

The wrappers around featuretools and tsfresh fit on the training set and then return the training features. We therefore measure the time it takes getML's FastProp algorithm to fit on the training set and create the training features.

In [13]:

  Copied!     
 
benchmark = Benchmark()
benchmark = Benchmark()

In [14]:

  Copied!     
 
with benchmark("fastprop"):
    pipe_fp_fl.fit(time_series.train)
    fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")
with benchmark("fastprop"): pipe_fp_fl.fit(time_series.train) fastprop_train = pipe_fp_fl.transform(time_series.train, df_name="fastprop_train")

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 331 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

Trained pipeline.

Time taken: 0:00:01.031077.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

In [15]:

  Copied!     
 
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")
fastprop_test = pipe_fp_fl.transform(time_series.test, df_name="fastprop_test")

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Now we create a dedicated prediction pipeline and provide the fast prop features (contrained in fastprop_train and fastprop_test.)

In [16]:

  Copied!     
 
predictor = getml.predictors.XGBoostClassifier()

pipe_fp_pr = getml.pipeline.Pipeline(
    tags=["prediction", "fastprop"], predictors=[predictor]
)
predictor = getml.predictors.XGBoostClassifier() pipe_fp_pr = getml.pipeline.Pipeline( tags=["prediction", "fastprop"], predictors=[predictor] )

In [17]:

  Copied!     
 
pipe_fp_pr.check(fastprop_train)

pipe_fp_pr.fit(fastprop_train)
pipe_fp_pr.check(fastprop_train) pipe_fp_pr.fit(fastprop_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04

Trained pipeline.

Time taken: 0:00:04.947655.

Out[17]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'fastprop'])

In [18]:

  Copied!     
 
pipe_fp_pr.score(fastprop_test)
pipe_fp_pr.score(fastprop_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[18]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-13 13:38:49	fastprop_train	Occupancy	0.9995	1.	0.004464
1	2024-09-13 13:38:49	fastprop_test	Occupancy	0.9888	0.9982	0.044213

2.2 Propositionalization with featuretools¶

In [19]:

  Copied!     
 
data_train = time_series.train.population.to_df("train")
data_test = time_series.test.population.to_df("test")
data_train = time_series.train.population.to_df("train") data_test = time_series.test.population.to_df("test")

In [20]:

  Copied!     
 
dfs_pandas = {}

for df in getml.project.data_frames:
    dfs_pandas[df.name] = df.to_pandas()
    dfs_pandas[df.name]["id"] = 1
dfs_pandas = {} for df in getml.project.data_frames: dfs_pandas[df.name] = df.to_pandas() dfs_pandas[df.name]["id"] = 1

In [21]:

  Copied!     
 
ft_builder = FTTimeSeriesBuilder(
    num_features=200,
    horizon=pd.Timedelta(minutes=0),
    memory=pd.Timedelta(minutes=15),
    column_id="id",
    time_stamp="date",
    target="Occupancy",
)
ft_builder = FTTimeSeriesBuilder( num_features=200, horizon=pd.Timedelta(minutes=0), memory=pd.Timedelta(minutes=15), column_id="id", time_stamp="date", target="Occupancy", )

The FTTimeSeriesBuilder provides a fit method that is designed to be equivilant to to the fit method of the predictorless getML pipeline above.

In [22]:

  Copied!     
 
with benchmark("featuretools"):
    featuretools_train = ft_builder.fit(dfs_pandas["train"])

featuretools_test = ft_builder.transform(dfs_pandas["test"])

df_featuretools_train = getml.data.DataFrame.from_pandas(
    featuretools_train, name="featuretools_train", roles=data_train.roles
)
df_featuretools_test = getml.data.DataFrame.from_pandas(
    featuretools_test, name="featuretools_test", roles=data_train.roles
)

df_featuretools_train.set_role(
    df_featuretools_train.roles.unused, getml.data.roles.numerical
)

df_featuretools_test.set_role(
    df_featuretools_test.roles.unused, getml.data.roles.numerical
)
with benchmark("featuretools"): featuretools_train = ft_builder.fit(dfs_pandas["train"]) featuretools_test = ft_builder.transform(dfs_pandas["test"]) df_featuretools_train = getml.data.DataFrame.from_pandas( featuretools_train, name="featuretools_train", roles=data_train.roles ) df_featuretools_test = getml.data.DataFrame.from_pandas( featuretools_test, name="featuretools_test", roles=data_train.roles ) df_featuretools_train.set_role( df_featuretools_train.roles.unused, getml.data.roles.numerical ) df_featuretools_test.set_role( df_featuretools_test.roles.unused, getml.data.roles.numerical )

featuretools: Trying features...
Selecting the best out of 262 features...
Time taken: 0h:7m:20.109537

In [23]:

  Copied!     
 
predictor = getml.predictors.XGBoostClassifier()

pipe_ft_pr = getml.pipeline.Pipeline(
    tags=["prediction", "featuretools"], predictors=[predictor]
)

pipe_ft_pr
predictor = getml.predictors.XGBoostClassifier() pipe_ft_pr = getml.pipeline.Pipeline( tags=["prediction", "featuretools"], predictors=[predictor] ) pipe_ft_pr

Out[23]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'featuretools'])

In [24]:

  Copied!     
 
pipe_ft_pr.check(df_featuretools_train)
pipe_ft_pr.check(df_featuretools_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING.

Out[24]:

	type	label	message
0	WARNING	COLUMN SHOULD BE UNUSED	All non-NULL entries in column 'id' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').

In [25]:

  Copied!     
 
pipe_ft_pr.fit(df_featuretools_train)
pipe_ft_pr.fit(df_featuretools_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03

Trained pipeline.

Time taken: 0:00:03.236058.

Out[25]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['prediction', 'featuretools'])

In [26]:

  Copied!     
 
pipe_ft_pr.score(df_featuretools_test)
pipe_ft_pr.score(df_featuretools_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[26]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-13 13:52:31	featuretools_train	Occupancy	0.9995	1.	0.005065
1	2024-09-13 13:52:31	featuretools_test	Occupancy	0.9885	0.9972	0.049236

2.3 Propositionalization with tsfresh¶

In [27]:

  Copied!     
 
tsfresh_builder = TSFreshBuilder(
    num_features=200, memory=15, column_id="id", time_stamp="date", target="Occupancy"
)

with benchmark("tsfresh"):
    tsfresh_train = tsfresh_builder.fit(dfs_pandas["train"])

tsfresh_test = tsfresh_builder.transform(dfs_pandas["test"])

df_tsfresh_train = getml.data.DataFrame.from_pandas(
    tsfresh_train, name="tsfresh_train", roles=data_train.roles
)
df_tsfresh_test = getml.data.DataFrame.from_pandas(
    tsfresh_test, name="tsfresh_test", roles=data_train.roles
)

df_tsfresh_train.set_role(df_tsfresh_train.roles.unused, getml.data.roles.numerical)

df_tsfresh_test.set_role(df_tsfresh_test.roles.unused, getml.data.roles.numerical)
tsfresh_builder = TSFreshBuilder( num_features=200, memory=15, column_id="id", time_stamp="date", target="Occupancy" ) with benchmark("tsfresh"): tsfresh_train = tsfresh_builder.fit(dfs_pandas["train"]) tsfresh_test = tsfresh_builder.transform(dfs_pandas["test"]) df_tsfresh_train = getml.data.DataFrame.from_pandas( tsfresh_train, name="tsfresh_train", roles=data_train.roles ) df_tsfresh_test = getml.data.DataFrame.from_pandas( tsfresh_test, name="tsfresh_test", roles=data_train.roles ) df_tsfresh_train.set_role(df_tsfresh_train.roles.unused, getml.data.roles.numerical) df_tsfresh_test.set_role(df_tsfresh_test.roles.unused, getml.data.roles.numerical)

Rolling: 100%|██████████| 40/40 [00:02<00:00, 16.43it/s]
Feature Extraction: 100%|██████████| 40/40 [00:05<00:00,  7.81it/s]
Feature Extraction: 100%|██████████| 40/40 [00:04<00:00,  8.43it/s]

Selecting the best out of 65 features...
Time taken: 0h:0m:14.295165

Rolling: 100%|██████████| 40/40 [00:02<00:00, 19.85it/s]
Feature Extraction: 100%|██████████| 40/40 [00:04<00:00,  9.71it/s]
Feature Extraction: 100%|██████████| 40/40 [00:04<00:00,  9.69it/s]

In [28]:

  Copied!     
 
pipe_tsf_pr = getml.pipeline.Pipeline(
    tags=["predicition", "tsfresh"], predictors=[predictor]
)

pipe_tsf_pr
pipe_tsf_pr = getml.pipeline.Pipeline( tags=["predicition", "tsfresh"], predictors=[predictor] ) pipe_tsf_pr

Out[28]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['predicition', 'tsfresh'])

In [29]:

  Copied!     
 
pipe_tsf_pr.check(df_tsfresh_train)
pipe_tsf_pr.check(df_tsfresh_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING.

Out[29]:

	type	label	message
0	WARNING	COLUMN SHOULD BE UNUSED	All non-NULL entries in column 'id' in POPULATION__STAGING_TABLE_1 are equal to each other. You should consider setting its role to unused_float or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').

In [30]:

  Copied!     
 
pipe_tsf_pr.fit(df_tsfresh_train)
pipe_tsf_pr.fit(df_tsfresh_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 0 issues labeled INFO and 1 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

Trained pipeline.

Time taken: 0:00:01.669099.

Out[30]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=False,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['predicition', 'tsfresh'])

In [31]:

  Copied!     
 
pipe_tsf_pr.score(df_tsfresh_test)
pipe_tsf_pr.score(df_tsfresh_test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[31]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-13 13:53:00	tsfresh_train	Occupancy	0.9985	1.	0.006898
1	2024-09-13 13:53:00	tsfresh_test	Occupancy	0.9877	0.9979	0.049359

3. Comparison¶

In [32]:

  Copied!     
 
num_features = dict(
    fastprop=289,
    featuretools=103,
    tsfresh=60,
)

runtime_per_feature = [
    benchmark.runtimes["fastprop"] / num_features["fastprop"],
    benchmark.runtimes["featuretools"] / num_features["featuretools"],
    benchmark.runtimes["tsfresh"] / num_features["tsfresh"],
]

features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature]

normalized_runtime_per_feature = [
    r / runtime_per_feature[0] for r in runtime_per_feature
]

comparison = pd.DataFrame(
    dict(
        runtime=[
            benchmark.runtimes["fastprop"],
            benchmark.runtimes["featuretools"],
            benchmark.runtimes["tsfresh"],
        ],
        num_features=num_features.values(),
        features_per_second=features_per_second,
        normalized_runtime=[
            1,
            benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"],
            benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"],
        ],
        normalized_runtime_per_feature=normalized_runtime_per_feature,
        accuracy=[pipe_fp_pr.accuracy, pipe_ft_pr.accuracy, pipe_tsf_pr.accuracy],
        auc=[pipe_fp_pr.auc, pipe_ft_pr.auc, pipe_tsf_pr.auc],
        cross_entropy=[
            pipe_fp_pr.cross_entropy,
            pipe_ft_pr.cross_entropy,
            pipe_tsf_pr.cross_entropy,
        ],
    )
)

comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]
num_features = dict( fastprop=289, featuretools=103, tsfresh=60, ) runtime_per_feature = [ benchmark.runtimes["fastprop"] / num_features["fastprop"], benchmark.runtimes["featuretools"] / num_features["featuretools"], benchmark.runtimes["tsfresh"] / num_features["tsfresh"], ] features_per_second = [1.0 / r.total_seconds() for r in runtime_per_feature] normalized_runtime_per_feature = [ r / runtime_per_feature[0] for r in runtime_per_feature ] comparison = pd.DataFrame( dict( runtime=[ benchmark.runtimes["fastprop"], benchmark.runtimes["featuretools"], benchmark.runtimes["tsfresh"], ], num_features=num_features.values(), features_per_second=features_per_second, normalized_runtime=[ 1, benchmark.runtimes["featuretools"] / benchmark.runtimes["fastprop"], benchmark.runtimes["tsfresh"] / benchmark.runtimes["fastprop"], ], normalized_runtime_per_feature=normalized_runtime_per_feature, accuracy=[pipe_fp_pr.accuracy, pipe_ft_pr.accuracy, pipe_tsf_pr.accuracy], auc=[pipe_fp_pr.auc, pipe_ft_pr.auc, pipe_tsf_pr.auc], cross_entropy=[ pipe_fp_pr.cross_entropy, pipe_ft_pr.cross_entropy, pipe_tsf_pr.cross_entropy, ], ) ) comparison.index = ["getML: FastProp", "featuretools", "tsfresh"]

In [33]:

  Copied!     
 
comparison
comparison

Out[33]:

	runtime	num_features	features_per_second	normalized_runtime	normalized_runtime_per_feature	accuracy	auc	cross_entropy
getML: FastProp	0 days 00:00:01.825967	289	158.277936	1.000000	1.000000	0.988823	0.998166	0.044213
featuretools	0 days 00:07:20.110459	103	0.234032	241.028704	676.308484	0.988455	0.997207	0.049236
tsfresh	0 days 00:00:14.295312	60	4.197184	7.828899	37.710510	0.987718	0.997861	0.049359

In [34]:

  Copied!     
 
# export for further use
comparison.to_csv("comparisons/occupancy.csv")
# export for further use comparison.to_csv("comparisons/occupancy.csv")

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()