Seznam - Predicting transaction volume¶
Seznam is a Czech company with a scope similar to Google. The purpose of this notebook is to analyze data from Seznam's wallet, predicting the transaction volume.
Summary:
- Prediction type: Regression model
- Domain: E-commerce
- Prediction target: Transaction volume
- Population size: 1,462,078
Background¶
Seznam is a Czech company with a scope similar to Google. The purpose of this notebook is to analyze data from Seznam's wallet, predicting the transaction volume.
Since the dataset is in Czech, we will quickly translate the meaning of the main tables:
- dobito: contains data on prepayments into a wallet
- probehnuto: contains data on charges from a wallet
- probehnuto_mimo_penezenku: contains data on charges, from sources other than a wallet
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015)(Now residing at relational-data.org.).
We will benchmark getML's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "ipywidgets==8.1.5"
import os
import warnings
import pandas as pd
import featuretools
import woodwork as ww
import getml
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
warnings.simplefilter(action='ignore', category=FutureWarning)
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.set_project('seznam')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912150434.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'seznam'.
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="Seznam",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='Seznam', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
dobito = load_if_needed("dobito")
probehnuto = load_if_needed("probehnuto")
probehnuto_mimo_penezenku = load_if_needed("probehnuto_mimo_penezenku")
dobito
name | client_id | month_year_datum_transakce | sluzba | kc_dobito |
---|---|---|---|---|
role | unused_float | unused_string | unused_string | unused_string |
0 | 7157857 | 2012-10-01 | c | 1045.62 |
1 | 109700 | 2015-10-01 | c | 5187.28 |
2 | 51508 | 2015-08-01 | c | 408.20 |
3 | 9573550 | 2012-10-01 | c | 521.24 |
4 | 9774621 | 2014-11-01 | c | 386.22 |
... | ... | ... | ... | |
554341 | 65283 | 2012-09-01 | g | 7850.00 |
554342 | 6091446 | 2012-08-01 | g | 31400.00 |
554343 | 1264806 | 2013-08-01 | g | -8220.52 |
554344 | 101103 | 2012-08-01 | g | 3140.00 |
554345 | 8674551 | 2012-08-01 | g | 6280.00 |
554346 rows x 4 columns
memory usage: 29.59 MB
name: dobito
type: getml.DataFrame
probehnuto
name | client_id | month_year_datum_transakce | sluzba | kc_proklikano |
---|---|---|---|---|
role | unused_float | unused_string | unused_string | unused_string |
0 | 109145 | 2013-06-01 | c | -31.40 |
1 | 9804394 | 2015-10-01 | h | 37.68 |
2 | 9803353 | 2015-10-01 | h | 725.34 |
3 | 9801753 | 2015-10-01 | h | 194.68 |
4 | 9800425 | 2015-10-01 | h | 1042.48 |
... | ... | ... | ... | |
1462073 | 98857 | 2015-08-01 | NULL | 153.86 |
1462074 | 95776 | 2015-09-01 | NULL | 153.86 |
1462075 | 98857 | 2015-09-01 | NULL | 153.86 |
1462076 | 90001 | 2015-10-01 | NULL | 310.86 |
1462077 | 946957 | 2015-10-01 | NULL | 153.86 |
1462078 rows x 4 columns
memory usage: 77.07 MB
name: probehnuto
type: getml.DataFrame
probehnuto_mimo_penezenku
name | client_id | Month/Year | probehla_inzerce_mimo_penezenku |
---|---|---|---|
role | unused_float | unused_string | unused_string |
0 | 3901 | 2012-08-01 | ANO |
1 | 3901 | 2012-09-01 | ANO |
2 | 3901 | 2012-10-01 | ANO |
3 | 3901 | 2012-11-01 | ANO |
4 | 3901 | 2012-12-01 | ANO |
... | ... | ... | |
599381 | 9804086 | 2015-10-01 | ANO |
599382 | 9804238 | 2015-10-01 | ANO |
599383 | 9804782 | 2015-10-01 | ANO |
599384 | 9804810 | 2015-10-01 | ANO |
599385 | 9805032 | 2015-10-01 | ANO |
599386 rows x 3 columns
memory usage: 23.38 MB
name: probehnuto_mimo_penezenku
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
dobito.set_role("client_id", getml.data.roles.join_key)
dobito.set_role("month_year_datum_transakce", getml.data.roles.time_stamp)
dobito.set_role("sluzba", getml.data.roles.categorical)
dobito.set_role("kc_dobito", getml.data.roles.numerical)
dobito.set_unit("sluzba", "service")
dobito
name | month_year_datum_transakce | client_id | sluzba | kc_dobito |
---|---|---|---|---|
role | time_stamp | join_key | categorical | numerical |
unit | time stamp, comparison only | service | ||
0 | 2012-10-01 | 7157857 | c | 1045.62 |
1 | 2015-10-01 | 109700 | c | 5187.28 |
2 | 2015-08-01 | 51508 | c | 408.2 |
3 | 2012-10-01 | 9573550 | c | 521.24 |
4 | 2014-11-01 | 9774621 | c | 386.22 |
... | ... | ... | ... | |
554341 | 2012-09-01 | 65283 | g | 7850 |
554342 | 2012-08-01 | 6091446 | g | 31400 |
554343 | 2013-08-01 | 1264806 | g | -8220.52 |
554344 | 2012-08-01 | 101103 | g | 3140 |
554345 | 2012-08-01 | 8674551 | g | 6280 |
554346 rows x 4 columns
memory usage: 13.30 MB
name: dobito
type: getml.DataFrame
probehnuto.set_role("client_id", getml.data.roles.join_key)
probehnuto.set_role("month_year_datum_transakce", getml.data.roles.time_stamp)
probehnuto.set_role("sluzba", getml.data.roles.categorical)
probehnuto.set_role("kc_proklikano", getml.data.roles.target)
probehnuto.set_unit("sluzba", "service")
probehnuto
name | month_year_datum_transakce | client_id | kc_proklikano | sluzba |
---|---|---|---|---|
role | time_stamp | join_key | target | categorical |
unit | time stamp, comparison only | service | ||
0 | 2013-06-01 | 109145 | -31.4 | c |
1 | 2015-10-01 | 9804394 | 37.68 | h |
2 | 2015-10-01 | 9803353 | 725.34 | h |
3 | 2015-10-01 | 9801753 | 194.68 | h |
4 | 2015-10-01 | 9800425 | 1042.48 | h |
... | ... | ... | ... | |
1462073 | 2015-08-01 | 98857 | 153.86 | NULL |
1462074 | 2015-09-01 | 95776 | 153.86 | NULL |
1462075 | 2015-09-01 | 98857 | 153.86 | NULL |
1462076 | 2015-10-01 | 90001 | 310.86 | NULL |
1462077 | 2015-10-01 | 946957 | 153.86 | NULL |
1462078 rows x 4 columns
memory usage: 35.09 MB
name: probehnuto
type: getml.DataFrame
probehnuto_mimo_penezenku.set_role("client_id", getml.data.roles.join_key)
probehnuto_mimo_penezenku.set_role("Month/Year", getml.data.roles.time_stamp)
probehnuto_mimo_penezenku
name | Month/Year | client_id | probehla_inzerce_mimo_penezenku |
---|---|---|---|
role | time_stamp | join_key | unused_string |
unit | time stamp, comparison only | ||
0 | 2012-08-01 | 3901 | ANO |
1 | 2012-09-01 | 3901 | ANO |
2 | 2012-10-01 | 3901 | ANO |
3 | 2012-11-01 | 3901 | ANO |
4 | 2012-12-01 | 3901 | ANO |
... | ... | ... | |
599381 | 2015-10-01 | 9804086 | ANO |
599382 | 2015-10-01 | 9804238 | ANO |
599383 | 2015-10-01 | 9804782 | ANO |
599384 | 2015-10-01 | 9804810 | ANO |
599385 | 2015-10-01 | 9805032 | ANO |
599386 rows x 3 columns
memory usage: 14.39 MB
name: probehnuto_mimo_penezenku
type: getml.DataFrame
split = getml.data.split.random(train=0.8, test=0.2)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | test |
4 | train |
... |
infinite number of rows
type: StringColumnView
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
2.1 Define relational model¶
star_schema = getml.data.StarSchema(population=probehnuto, alias="population", split=split)
star_schema.join(
probehnuto,
on="client_id",
time_stamps="month_year_datum_transakce",
lagged_targets=True,
horizon=getml.data.time.days(1),
)
star_schema.join(
dobito,
on="client_id",
time_stamps="month_year_datum_transakce",
)
star_schema.join(
probehnuto_mimo_penezenku,
on="client_id",
time_stamps=("month_year_datum_transakce", "Month/Year"),
)
star_schema
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | dobito | DOBITO__STAGING_TABLE_2 |
2 | probehnuto | PROBEHNUTO__STAGING_TABLE_3 |
3 | probehnuto_mimo_penezenku | PROBEHNUTO_MIMO_PENEZENKU__STAGING_TABLE_4 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | probehnuto | 292833 | View |
1 | train | probehnuto | 1169245 | View |
name | rows | type | |
---|---|---|---|
0 | probehnuto | 1462078 | DataFrame |
1 | dobito | 554346 | DataFrame |
2 | probehnuto_mimo_penezenku | 599386 | DataFrame |
2.2 getML pipeline¶
Set-up the feature learner & predictor
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
aggregation=getml.feature_learning.FastProp.agg_sets.All,
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
sampling_factor=0.1,
)
feature_selector = getml.predictors.XGBoostRegressor(n_jobs=1, external_memory=True)
predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
Build the pipeline
pipe1 = getml.Pipeline(
tags=['fast_prop'],
data_model=star_schema.data_model,
preprocessors=[mapping],
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=[predictor],
include_categorical=True,
)
pipe1
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=['XGBoostRegressor'], include_categorical=True, loss_function='SquareLoss', peripheral=['dobito', 'probehnuto', 'probehnuto_mimo_penezenku'], predictors=['XGBoostRegressor'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop'])
2.3 Model training¶
pipe1.check(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:20 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and DOBITO__STAGING_TABLE_2 over 'client_id' and 'client_id', there are no corresponding entries for 2.228789% of entries in 'client_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and PROBEHNUTO_MIMO_PENEZENKU__STAGING_TABLE_4 over 'client_id' and 'client_id', there are no corresponding entries for 26.543966% of entries in 'client_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
pipe1.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 909 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:13 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 02:31 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 20:02 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 09:24
Trained pipeline.
Time taken: 0:33:12.643341.
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=['XGBoostRegressor'], include_categorical=True, loss_function='SquareLoss', peripheral=['dobito', 'probehnuto', 'probehnuto_mimo_penezenku'], predictors=['XGBoostRegressor'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop', 'container-AeQKVm'])
2.4 Model evaluation¶
fastprop_score = pipe1.score(star_schema.test)
fastprop_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:19
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 12:26:41 | train | kc_proklikano | 2940.4502 | 14384.5507 | 0.9423 |
1 | 2024-09-12 12:27:01 | test | kc_proklikano | 2998.9588 | 18673.8813 | 0.8751 |
2.5 featuretools¶
include = (getml.data.random() < 0.25)
include
0 | true |
---|---|
1 | false |
2 | true |
3 | false |
4 | false |
... |
infinite number of rows
type: BooleanColumnView
population_train_pd = star_schema.train.population[include].to_pandas()
population_test_pd = star_schema.test.population.to_pandas()
population_train_pd["id"] = population_train_pd.index
population_test_pd["id"] = population_test_pd.index
probehnuto_pd = probehnuto.drop(probehnuto.roles.unused).to_pandas()
dobito_pd = dobito.drop(dobito.roles.unused).to_pandas()
probehnuto_mimo_penezenku_pd = probehnuto_mimo_penezenku.drop(probehnuto_mimo_penezenku.roles.unused).to_pandas()
def prepare_peripheral(peripheral_pd, train_or_test):
"""
Helper function that imitates the behavior of
the data model defined above.
"""
peripheral_new = peripheral_pd.merge(
train_or_test[["id", "client_id", "month_year_datum_transakce"]],
on="client_id"
)
peripheral_new = peripheral_new[
peripheral_new["month_year_datum_transakce_x"] < peripheral_new["month_year_datum_transakce_y"]
]
del peripheral_new["month_year_datum_transakce_y"]
del peripheral_new["client_id"]
return peripheral_new.rename({"month_year_datum_transakce_y": "month_year_datum_transakce"})
def prepare_probehnuto_mimo_penezenku(peripheral_pd, train_or_test):
"""
Helper function that imitates the behavior of
the data model defined above.
"""
peripheral_new = peripheral_pd.merge(
train_or_test[["id", "client_id", "month_year_datum_transakce"]],
on="client_id"
)
peripheral_new = peripheral_new[
peripheral_new["Month/Year"] < peripheral_new["month_year_datum_transakce"]
]
del peripheral_new["month_year_datum_transakce"]
del peripheral_new["client_id"]
return peripheral_new
dobito_train_pd = prepare_peripheral(dobito_pd, population_train_pd)
dobito_test_pd = prepare_peripheral(dobito_pd, population_test_pd)
dobito_train_pd
sluzba | kc_dobito | month_year_datum_transakce_x | id | |
---|---|---|---|---|
0 | c | 1045.62 | 2012-10-01 | 2127 |
1 | c | 1045.62 | 2012-10-01 | 17709 |
2 | c | 1045.62 | 2012-10-01 | 50363 |
14 | c | 408.20 | 2015-08-01 | 152319 |
15 | c | 521.24 | 2012-10-01 | 153913 |
... | ... | ... | ... | ... |
4462027 | g | 6280.00 | 2012-08-01 | 92370 |
4462028 | g | 6280.00 | 2012-08-01 | 140842 |
4462029 | g | 6280.00 | 2012-08-01 | 146070 |
4462030 | g | 6280.00 | 2012-08-01 | 175024 |
4462031 | g | 6280.00 | 2012-08-01 | 253772 |
2240543 rows × 4 columns
probehnuto_train_pd = prepare_peripheral(probehnuto_pd, population_train_pd)
probehnuto_test_pd = prepare_peripheral(probehnuto_pd, population_test_pd)
probehnuto_train_pd
sluzba | kc_proklikano | month_year_datum_transakce_x | id | |
---|---|---|---|---|
1 | c | -31.40 | 2013-06-01 | 281262 |
4 | c | -31.40 | 2013-06-01 | 288356 |
6 | c | -31.40 | 2013-06-01 | 289265 |
7 | c | -31.40 | 2013-06-01 | 289267 |
10 | c | -31.40 | 2013-06-01 | 290759 |
... | ... | ... | ... | ... |
11186627 | None | 13545.96 | 2015-06-01 | 175888 |
11186634 | None | 13545.96 | 2015-06-01 | 272451 |
11186644 | None | 13545.96 | 2015-06-01 | 284406 |
11186660 | None | 153.86 | 2015-07-01 | 286198 |
11186663 | None | 153.86 | 2015-07-01 | 284454 |
5388870 rows × 4 columns
probehnuto_mimo_penezenku_train_pd = prepare_probehnuto_mimo_penezenku(probehnuto_mimo_penezenku_pd, population_train_pd)
probehnuto_mimo_penezenku_test_pd = prepare_probehnuto_mimo_penezenku(probehnuto_mimo_penezenku_pd, population_test_pd)
probehnuto_mimo_penezenku_train_pd
Month/Year | id | |
---|---|---|
0 | 2012-08-01 | 269301 |
8 | 2012-08-01 | 9204 |
9 | 2012-08-01 | 23838 |
10 | 2012-08-01 | 24471 |
11 | 2012-08-01 | 45868 |
... | ... | ... |
3568048 | 2015-09-01 | 160015 |
3568050 | 2015-09-01 | 19 |
3568051 | 2015-09-01 | 1565 |
3568053 | 2015-09-01 | 151283 |
3568060 | 2015-09-01 | 158546 |
2832768 rows × 2 columns
del population_train_pd["client_id"]
del population_test_pd["client_id"]
population_train_pd
sluzba | kc_proklikano | month_year_datum_transakce | id | |
---|---|---|---|---|
0 | c | -31.40 | 2013-06-01 | 0 |
1 | h | 725.34 | 2015-10-01 | 1 |
2 | h | 8550.22 | 2015-10-01 | 2 |
3 | h | 2408.38 | 2015-10-01 | 3 |
4 | h | 1893.42 | 2015-10-01 | 4 |
... | ... | ... | ... | ... |
292153 | None | 153.86 | 2015-03-01 | 292153 |
292154 | None | 153.86 | 2015-05-01 | 292154 |
292155 | None | 13545.96 | 2015-06-01 | 292155 |
292156 | None | 153.86 | 2015-06-01 | 292156 |
292157 | None | 153.86 | 2015-08-01 | 292157 |
292158 rows × 4 columns
def add_index(df):
df.insert(0, "index", range(len(df)))
population_pd_logical_types = {
'id': ww.logical_types.Integer,
'sluzba': ww.logical_types.Categorical,
'kc_proklikano': ww.logical_types.Double,
'month_year_datum_transakce': ww.logical_types.Datetime
}
population_train_pd.ww.init(logical_types=population_pd_logical_types, index='id', name='population')
population_test_pd.ww.init(logical_types=population_pd_logical_types, index='id', name='population')
add_index(dobito_train_pd)
add_index(dobito_test_pd)
dobito_pd_logical_types = {
'index': ww.logical_types.Integer,
'sluzba': ww.logical_types.Categorical,
'kc_dobito': ww.logical_types.Double,
'month_year_datum_transakce_x': ww.logical_types.Datetime,
'id': ww.logical_types.Integer
}
dobito_train_pd.ww.init(logical_types=dobito_pd_logical_types, index='index', name='dobito')
dobito_test_pd.ww.init(logical_types=dobito_pd_logical_types, index='index', name='dobito')
add_index(probehnuto_train_pd)
add_index(probehnuto_test_pd)
probehnuto_pd_logical_types = {
'index': ww.logical_types.Integer,
'sluzba': ww.logical_types.Categorical,
'kc_proklikano': ww.logical_types.Double,
'month_year_datum_transakce_x': ww.logical_types.Datetime,
'id': ww.logical_types.Integer
}
probehnuto_train_pd.ww.init(logical_types=probehnuto_pd_logical_types, index='index', name='probehnuto')
probehnuto_test_pd.ww.init(logical_types=probehnuto_pd_logical_types, index='index', name='probehnuto')
add_index(probehnuto_mimo_penezenku_train_pd)
add_index(probehnuto_mimo_penezenku_test_pd)
probehnuto_mimo_penezenku_pd_logical_types = {
'index': ww.logical_types.Integer,
'Month/Year': ww.logical_types.Datetime,
'id': ww.logical_types.Integer
}
probehnuto_mimo_penezenku_train_pd.ww.init(logical_types=probehnuto_mimo_penezenku_pd_logical_types, index='index', name='probehnuto_mimo_penezenku')
probehnuto_mimo_penezenku_test_pd.ww.init(logical_types=probehnuto_mimo_penezenku_pd_logical_types, index='index', name='probehnuto_mimo_penezenku')
dataframes_train = {
"population" : (population_train_pd, ),
"dobito": (dobito_train_pd, ),
"probehnuto": (probehnuto_train_pd, ),
"probehnuto_mimo_penezenku": (probehnuto_mimo_penezenku_train_pd, ),
}
dataframes_test = {
"population" : (population_test_pd, ),
"dobito": (dobito_test_pd, ),
"probehnuto": (probehnuto_test_pd, ),
"probehnuto_mimo_penezenku": (probehnuto_mimo_penezenku_test_pd, ),
}
relationships = [
("population", "id", "dobito", "id"),
("population", "id", "probehnuto", "id"),
("population", "id", "probehnuto_mimo_penezenku", "id"),
]
featuretools_train_pd = featuretools.dfs(
dataframes=dataframes_train,
relationships=relationships,
target_dataframe_name="population")[0]
featuretools_test_pd = featuretools.dfs(
dataframes=dataframes_test,
relationships=relationships,
target_dataframe_name="population")[0]
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
featuretools_train.set_role("kc_proklikano", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)
featuretools_train
name | kc_proklikano | sluzba | COUNT(dobito) | MODE(dobito.sluzba) | NUM_UNIQUE(dobito.sluzba) | COUNT(probehnuto) | MODE(probehnuto.sluzba) | NUM_UNIQUE(probehnuto.sluzba) | COUNT(probehnuto_mimo_penezenku) | DAY(month_year_datum_transakce) | MONTH(month_year_datum_transakce) | WEEKDAY(month_year_datum_transakce) | YEAR(month_year_datum_transakce) | MODE(dobito.DAY(month_year_datum_transakce_x)) | MODE(dobito.MONTH(month_year_datum_transakce_x)) | MODE(dobito.WEEKDAY(month_year_datum_transakce_x)) | MODE(dobito.YEAR(month_year_datum_transakce_x)) | NUM_UNIQUE(dobito.DAY(month_year_datum_transakce_x)) | NUM_UNIQUE(dobito.MONTH(month_year_datum_transakce_x)) | NUM_UNIQUE(dobito.WEEKDAY(month_year_datum_transakce_x)) | NUM_UNIQUE(dobito.YEAR(month_year_datum_transakce_x)) | MODE(probehnuto.DAY(month_year_datum_transakce_x)) | MODE(probehnuto.MONTH(month_year_datum_transakce_x)) | MODE(probehnuto.WEEKDAY(month_year_datum_transakce_x)) | MODE(probehnuto.YEAR(month_year_datum_transakce_x)) | NUM_UNIQUE(probehnuto.DAY(month_year_datum_transakce_x)) | NUM_UNIQUE(probehnuto.MONTH(month_year_datum_transakce_x)) | NUM_UNIQUE(probehnuto.WEEKDAY(month_year_datum_transakce_x)) | NUM_UNIQUE(probehnuto.YEAR(month_year_datum_transakce_x)) | MODE(probehnuto_mimo_penezenku.DAY(Month/Year)) | MODE(probehnuto_mimo_penezenku.MONTH(Month/Year)) | MODE(probehnuto_mimo_penezenku.WEEKDAY(Month/Year)) | MODE(probehnuto_mimo_penezenku.YEAR(Month/Year)) | NUM_UNIQUE(probehnuto_mimo_penezenku.DAY(Month/Year)) | NUM_UNIQUE(probehnuto_mimo_penezenku.MONTH(Month/Year)) | NUM_UNIQUE(probehnuto_mimo_penezenku.WEEKDAY(Month/Year)) | NUM_UNIQUE(probehnuto_mimo_penezenku.YEAR(Month/Year)) | MAX(dobito.kc_dobito) | MEAN(dobito.kc_dobito) | MIN(dobito.kc_dobito) | SKEW(dobito.kc_dobito) | STD(dobito.kc_dobito) | SUM(dobito.kc_dobito) | MAX(probehnuto.kc_proklikano) | MEAN(probehnuto.kc_proklikano) | MIN(probehnuto.kc_proklikano) | SKEW(probehnuto.kc_proklikano) | STD(probehnuto.kc_proklikano) | SUM(probehnuto.kc_proklikano) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | -31.4 | c | 1 | c | 1 | 13 | d | 1 | 0 | 1 | 6 | 5 | 2013 | 1 | 12 | 5 | 2012 | 1 | 1 | 1 | 1 | 1 | 8 | 0 | 2012 | 1 | 10 | 6 | 2 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1306.24 | 1306.24 | 1306.24 | nan | nan | 1306.24 | 351.68 | 155.7923 | 9.42 | 0.5817 | 79.3799 | 2025.3 |
1 | 725.34 | h | 4 | h | 1 | 5 | h | 1 | 0 | 1 | 10 | 3 | 2015 | 1 | 5 | 0 | 2015 | 1 | 4 | 4 | 1 | 1 | 5 | 0 | 2015 | 1 | 5 | 5 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1036.2 | 614.655 | 257.48 | 0.5563 | 324.3624 | 2458.62 | 634.28 | 388.732 | 131.88 | 0.09478 | 205.8605 | 1943.66 |
2 | 8550.22 | h | 7 | h | 2 | 11 | h | 1 | 0 | 1 | 10 | 3 | 2015 | 1 | 8 | 2 | 2015 | 1 | 6 | 5 | 2 | 1 | 1 | 2 | 2015 | 1 | 11 | 7 | 2 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 62800 | 20907.9143 | 0 | 1.214 | 22375.0788 | 146355.4 | 39752.4 | 13280.7727 | 3215.36 | 1.4845 | 12240.1205 | 146088.5 |
3 | 2408.38 | h | 4 | h | 1 | 5 | h | 1 | 0 | 1 | 10 | 3 | 2015 | 1 | 5 | 1 | 2015 | 1 | 4 | 4 | 1 | 1 | 5 | 0 | 2015 | 1 | 5 | 5 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1554.3 | 1361.19 | 1296.82 | 2 | 128.74 | 5444.76 | 1635.94 | 1092.72 | 15.7 | -1.0846 | 711.3296 | 5463.6 |
4 | 1893.42 | h | 12 | h | 4 | 22 | d | 3 | 0 | 1 | 10 | 3 | 2015 | 1 | 2 | 4 | 2015 | 1 | 5 | 6 | 2 | 1 | 2 | 2 | 2015 | 1 | 9 | 6 | 2 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2615.62 | 1078.3283 | 310.86 | 0.8429 | 835.3628 | 12939.94 | 1510.34 | 471 | 0 | 0.8615 | 531.1884 | 10362 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
292153 | 153.86 | NULL | 12 | c | 1 | 34 | d | 1 | 0 | 1 | 3 | 6 | 2015 | 1 | 1 | 0 | 2013 | 1 | 8 | 6 | 3 | 1 | 1 | 5 | 2013 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 518.1 | 337.8117 | 153.86 | 0.02976 | 99.6511 | 4053.74 | 188.4 | 141.9465 | 0 | -2.9319 | 44.2498 | 4826.18 |
292154 | 153.86 | NULL | 6 | f | 1 | 34 | f | 1 | 0 | 1 | 5 | 4 | 2015 | 1 | 2 | 1 | 2013 | 1 | 6 | 4 | 3 | 1 | 1 | 5 | 2013 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1296.82 | 913.74 | 31.4 | -2.1286 | 444.4448 | 5482.44 | 188.4 | 150.8124 | -31.4 | -5.2155 | 33.226 | 5127.62 |
292155 | 13545.96 | NULL | 214 | c | 8 | 283 | d | 8 | 0 | 1 | 6 | 0 | 2015 | 1 | 1 | 5 | 2014 | 1 | 12 | 7 | 4 | 1 | 1 | 5 | 2014 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 7131326 | 794524.2239 | -11422331 | -2.0115 | 1432748.8604 | 170028183.91 | 6965622.14 | 640600.4428 | -866.64 | 2.2138 | 1006527.2489 | 181289925.3 |
292156 | 153.86 | NULL | 14 | c | 1 | 0 | NULL | NULL | 0 | 1 | 6 | 0 | 2015 | 1 | 4 | 6 | 2014 | 1 | 12 | 7 | 2 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 518.1 | 179.8771 | 153.86 | 3.7417 | 97.3472 | 2518.28 | nan | nan | nan | nan | nan | 0 |
292157 | 153.86 | NULL | 8 | c | 1 | 36 | NULL | 0 | 0 | 1 | 8 | 5 | 2015 | 1 | 8 | 1 | 2013 | 1 | 7 | 5 | 4 | 1 | 1 | 0 | 2013 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 942 | 764.1975 | 518.1 | -1.1545 | 115.2358 | 6113.58 | 188.4 | 155.1683 | 153.86 | 5.7312 | 5.7838 | 5586.06 |
292158 rows x 49 columns
memory usage: 72.46 MB
name: featuretools_train
type: getml.DataFrame
featuretools_test.set_role("kc_proklikano", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)
featuretools_test
name | kc_proklikano | sluzba | COUNT(dobito) | MODE(dobito.sluzba) | NUM_UNIQUE(dobito.sluzba) | COUNT(probehnuto) | MODE(probehnuto.sluzba) | NUM_UNIQUE(probehnuto.sluzba) | COUNT(probehnuto_mimo_penezenku) | DAY(month_year_datum_transakce) | MONTH(month_year_datum_transakce) | WEEKDAY(month_year_datum_transakce) | YEAR(month_year_datum_transakce) | MODE(dobito.DAY(month_year_datum_transakce_x)) | MODE(dobito.MONTH(month_year_datum_transakce_x)) | MODE(dobito.WEEKDAY(month_year_datum_transakce_x)) | MODE(dobito.YEAR(month_year_datum_transakce_x)) | NUM_UNIQUE(dobito.DAY(month_year_datum_transakce_x)) | NUM_UNIQUE(dobito.MONTH(month_year_datum_transakce_x)) | NUM_UNIQUE(dobito.WEEKDAY(month_year_datum_transakce_x)) | NUM_UNIQUE(dobito.YEAR(month_year_datum_transakce_x)) | MODE(probehnuto.DAY(month_year_datum_transakce_x)) | MODE(probehnuto.MONTH(month_year_datum_transakce_x)) | MODE(probehnuto.WEEKDAY(month_year_datum_transakce_x)) | MODE(probehnuto.YEAR(month_year_datum_transakce_x)) | NUM_UNIQUE(probehnuto.DAY(month_year_datum_transakce_x)) | NUM_UNIQUE(probehnuto.MONTH(month_year_datum_transakce_x)) | NUM_UNIQUE(probehnuto.WEEKDAY(month_year_datum_transakce_x)) | NUM_UNIQUE(probehnuto.YEAR(month_year_datum_transakce_x)) | MODE(probehnuto_mimo_penezenku.DAY(Month/Year)) | MODE(probehnuto_mimo_penezenku.MONTH(Month/Year)) | MODE(probehnuto_mimo_penezenku.WEEKDAY(Month/Year)) | MODE(probehnuto_mimo_penezenku.YEAR(Month/Year)) | NUM_UNIQUE(probehnuto_mimo_penezenku.DAY(Month/Year)) | NUM_UNIQUE(probehnuto_mimo_penezenku.MONTH(Month/Year)) | NUM_UNIQUE(probehnuto_mimo_penezenku.WEEKDAY(Month/Year)) | NUM_UNIQUE(probehnuto_mimo_penezenku.YEAR(Month/Year)) | MAX(dobito.kc_dobito) | MEAN(dobito.kc_dobito) | MIN(dobito.kc_dobito) | SKEW(dobito.kc_dobito) | STD(dobito.kc_dobito) | SUM(dobito.kc_dobito) | MAX(probehnuto.kc_proklikano) | MEAN(probehnuto.kc_proklikano) | MIN(probehnuto.kc_proklikano) | SKEW(probehnuto.kc_proklikano) | STD(probehnuto.kc_proklikano) | SUM(probehnuto.kc_proklikano) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical |
0 | 194.68 | h | 2 | d | 2 | 2 | d | 2 | 0 | 1 | 10 | 3 | 2015 | 1 | 9 | 1 | 2015 | 1 | 1 | 1 | 1 | 1 | 9 | 1 | 2015 | 1 | 1 | 1 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1296.82 | 777.15 | 257.48 | nan | 734.9244 | 1554.3 | 763.02 | 401.92 | 40.82 | nan | 510.6725 | 803.84 |
1 | 405.06 | h | 1 | h | 1 | 2 | h | 1 | 0 | 1 | 10 | 3 | 2015 | 1 | 8 | 5 | 2015 | 1 | 1 | 1 | 1 | 1 | 8 | 1 | 2015 | 1 | 2 | 2 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1296.82 | 1296.82 | 1296.82 | nan | nan | 1296.82 | 565.2 | 452.16 | 339.12 | nan | 159.8627 | 904.32 |
2 | 580.9 | h | 4 | d | 2 | 5 | d | 2 | 0 | 1 | 10 | 3 | 2015 | 1 | 9 | 1 | 2015 | 1 | 3 | 3 | 1 | 1 | 9 | 1 | 2015 | 1 | 4 | 4 | 1 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1296.82 | 1231.665 | 1036.2 | -2. | 130.31 | 4926.66 | 913.74 | 454.044 | 34.54 | 0.2893 | 328.7162 | 2270.22 |
3 | 106.76 | h | 0 | NULL | NULL | 0 | NULL | NULL | 0 | 1 | 10 | 3 | 2015 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | nan | nan | nan | nan | nan | 0 | nan | nan | nan | nan | nan | 0 |
4 | 1927.96 | h | 15 | d | 2 | 21 | d | 2 | 0 | 1 | 10 | 3 | 2015 | 1 | 9 | 0 | 2015 | 1 | 10 | 6 | 2 | 1 | 9 | 0 | 2015 | 1 | 12 | 7 | 2 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 7784.06 | 1850.088 | 257.48 | 2.4789 | 1898.9207 | 27751.32 | 5199.84 | 1148.1933 | 25.12 | 1.8651 | 1342.4638 | 24112.06 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
292828 | 153.86 | NULL | 5 | c | 1 | 36 | d | 2 | 0 | 1 | 4 | 2 | 2015 | 1 | 12 | 5 | 2013 | 1 | 4 | 4 | 3 | 1 | 8 | 5 | 2013 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1306.24 | 1045.62 | 31.4 | -2.2358 | 566.9809 | 5228.1 | 351.68 | 150.3711 | -31.4 | -0.2998 | 56.2491 | 5413.36 |
292829 | 153.86 | NULL | 3 | c | 1 | 35 | c | 1 | 0 | 1 | 6 | 0 | 2015 | 1 | 4 | 1 | 2012 | 1 | 3 | 3 | 3 | 1 | 1 | 5 | 2013 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 2615.62 | 1757.3533 | 62.8 | -1.7316 | 1467.5674 | 5272.06 | 188.4 | 150.0023 | -62.8 | -5.4539 | 37.9032 | 5250.08 |
292830 | 153.86 | NULL | 6 | f | 1 | 35 | NULL | 0 | 0 | 1 | 7 | 2 | 2015 | 1 | 3 | 5 | 2014 | 1 | 4 | 3 | 4 | 1 | 1 | 0 | 2013 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1884 | 831.5767 | 518.1 | 2.1524 | 530.786 | 4989.46 | 188.4 | 155.2057 | 153.86 | 5.6511 | 5.8638 | 5432.2 |
292831 | 310.86 | NULL | 3 | c | 2 | 38 | NULL | 0 | 0 | 1 | 10 | 3 | 2015 | 1 | 8 | 2 | 2012 | 1 | 2 | 3 | 3 | 1 | 8 | 5 | 2013 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 5024 | 4442.0533 | 4151.08 | 1.7321 | 503.9806 | 13326.16 | 376.8 | 312.9258 | 310.86 | 6.0854 | 10.6864 | 11891.18 |
292832 | 153.86 | NULL | 4 | c | 1 | 35 | NULL | 0 | 0 | 1 | 10 | 3 | 2015 | 1 | 1 | 1 | 2013 | 1 | 4 | 3 | 3 | 1 | 1 | 5 | 2013 | 1 | 12 | 7 | 4 | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | 1868.3 | 1415.355 | 628 | -0.9865 | 589.9991 | 5661.42 | 157 | 154.0394 | 153.86 | 3.9889 | 0.7395 | 5391.38 |
292833 rows x 49 columns
memory usage: 72.62 MB
name: featuretools_test
type: getml.DataFrame
We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.
Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.
data_model = getml.data.DataModel("population")
imputation = getml.preprocessors.Imputation()
predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
pipe2 = getml.Pipeline(
tags=['featuretools'],
data_model=data_model,
preprocessors=[imputation],
predictors=[predictor],
include_categorical=True,
)
pipe2
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
pipe2.fit(featuretools_train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 0 issues labeled INFO and 7 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:32
Trained pipeline.
Time taken: 0:00:33.622797.
Pipeline(data_model='population', feature_learners=[], feature_selectors=[], include_categorical=True, loss_function='SquareLoss', peripheral=[], predictors=['XGBoostRegressor'], preprocessors=['Imputation'], share_selected_features=0.5, tags=['featuretools'])
featuretools_score = pipe2.score(featuretools_test)
featuretools_score
⠸ Staging... 0% • 00:32
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 12:37:42 | featuretools_train | kc_proklikano | 5024.2643 | 23362.8008 | 0.8394 |
1 | 2024-09-12 12:37:45 | featuretools_test | kc_proklikano | 5183.7763 | 34050.186 | 0.5751 |
2.6 Features¶
The most important feature looks as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_64";
CREATE TABLE "FEATURE_1_64" AS
SELECT EWMA_TREND_1H( t2."kc_proklikano", t1."month_year_datum_transakce" - t2."month_year_datum_transakce__1_000000_days" ) AS "feature_1_64",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "PROBEHNUTO__STAGING_TABLE_3" t2
ON t1."client_id" = t2."client_id"
WHERE t2."month_year_datum_transakce__1_000000_days" <= t1."month_year_datum_transakce"
AND t1."sluzba" = t2."sluzba"
GROUP BY t1.rowid;
2.7 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named seznam_pipeline containing
# the SQL code.
pipe1.features.to_sql().save("seznam_pipeline")
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("seznam_spark")
2.8 Discussion¶
For a more convenient overview, we summarize our results into a table.
scores = [fastprop_score, featuretools_score]
pd.DataFrame(data={
'Name': ['getML: FastProp', 'featuretools'],
'R-squared': [f'{score.rsquared:.2%}' for score in scores],
'RMSE': [f'{score.rmse:,.0f}' for score in scores],
'MAE': [f'{score.mae:,.0f}' for score in scores]
})
Name | R-squared | RMSE | MAE | |
---|---|---|---|---|
0 | getML: FastProp | 87.51% | 18,674 | 2,999 |
1 | featuretools | 57.51% | 34,050 | 5,184 |
getml.engine.shutdown()
3. Conclusion¶
We have benchmarked getML against featuretools on a dataset related to online transactions. We have found that getML outperforms featuretools by a wide margin.
References¶
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).