Seznam - Predicting transaction volume¶

Seznam is a Czech company with a scope similar to Google. The purpose of this notebook is to analyze data from Seznam's wallet, predicting the transaction volume.

Summary:

Prediction type: Regression model
Domain: E-commerce
Prediction target: Transaction volume
Population size: 1,462,078

Background¶

Seznam is a Czech company with a scope similar to Google. The purpose of this notebook is to analyze data from Seznam's wallet, predicting the transaction volume.

Since the dataset is in Czech, we will quickly translate the meaning of the main tables:

dobito: contains data on prepayments into a wallet
probehnuto: contains data on charges from a wallet
probehnuto_mimo_penezenku: contains data on charges, from sources other than a wallet

The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015)(Now residing at relational-data.org.).

We will benchmark getML's feature learning algorithms against featuretools, an open-source implementation of the propositionalization algorithm, similar to getML's FastProp.

Analysis¶

Let's get started with the analysis and set up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "ipywidgets==8.1.5"
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "ipywidgets==8.1.5"

In [2]:

  Copied!     
 
import os
import warnings

import pandas as pd

import featuretools
import woodwork as ww
import getml

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
warnings.simplefilter(action='ignore', category=FutureWarning)

print(f"getML API version: {getml.__version__}\n")
import os import warnings import pandas as pd import featuretools import woodwork as ww import getml os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" warnings.simplefilter(action='ignore', category=FutureWarning) print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [3]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token='token')
getml.set_project('seznam')
getml.engine.launch(allow_remote_ips=True, token='token') getml.set_project('seznam')

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912150434.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'seznam'.

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data:

In [4]:

  Copied!     
 
conn = getml.database.connect_mysql(
    host="relational.fel.cvut.cz",
    dbname="Seznam",
    port=3306,
    user="guest",
    password="ctu-relational"
)

conn
conn = getml.database.connect_mysql( host="relational.fel.cvut.cz", dbname="Seznam", port=3306, user="guest", password="ctu-relational" ) conn

Out[4]:

Connection(dbname='Seznam', dialect='mysql', host='relational.fel.cvut.cz', port=3306)

In [5]:

  Copied!     
 
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
def load_if_needed(name): """ Loads the data from the relational learning repository, if the data frame has not already been loaded. """ if not getml.data.exists(name): data_frame = getml.DataFrame.from_db( name=name, table_name=name, conn=conn ) data_frame.save() else: data_frame = getml.data.load_data_frame(name) return data_frame

In [6]:

  Copied!     
 
dobito = load_if_needed("dobito")
probehnuto = load_if_needed("probehnuto")
probehnuto_mimo_penezenku = load_if_needed("probehnuto_mimo_penezenku")
dobito = load_if_needed("dobito") probehnuto = load_if_needed("probehnuto") probehnuto_mimo_penezenku = load_if_needed("probehnuto_mimo_penezenku")

In [7]:

  Copied!     
 
dobito
dobito

Out[7]:

name	client_id	month_year_datum_transakce	sluzba	kc_dobito
role	unused_float	unused_string	unused_string	unused_string
0	7157857	2012-10-01	c	1045.62
1	109700	2015-10-01	c	5187.28
2	51508	2015-08-01	c	408.20
3	9573550	2012-10-01	c	521.24
4	9774621	2014-11-01	c	386.22
	...	...	...	...
554341	65283	2012-09-01	g	7850.00
554342	6091446	2012-08-01	g	31400.00
554343	1264806	2013-08-01	g	-8220.52
554344	101103	2012-08-01	g	3140.00
554345	8674551	2012-08-01	g	6280.00

554346 rows x 4 columns
memory usage: 29.59 MB
name: dobito
type: getml.DataFrame

In [8]:

  Copied!     
 
probehnuto
probehnuto

Out[8]:

name	client_id	month_year_datum_transakce	sluzba	kc_proklikano
role	unused_float	unused_string	unused_string	unused_string
0	109145	2013-06-01	c	-31.40
1	9804394	2015-10-01	h	37.68
2	9803353	2015-10-01	h	725.34
3	9801753	2015-10-01	h	194.68
4	9800425	2015-10-01	h	1042.48
	...	...	...	...
1462073	98857	2015-08-01	NULL	153.86
1462074	95776	2015-09-01	NULL	153.86
1462075	98857	2015-09-01	NULL	153.86
1462076	90001	2015-10-01	NULL	310.86
1462077	946957	2015-10-01	NULL	153.86

1462078 rows x 4 columns
memory usage: 77.07 MB
name: probehnuto
type: getml.DataFrame

In [9]:

  Copied!     
 
probehnuto_mimo_penezenku
probehnuto_mimo_penezenku

Out[9]:

name	client_id	Month/Year	probehla_inzerce_mimo_penezenku
role	unused_float	unused_string	unused_string
0	3901	2012-08-01	ANO
1	3901	2012-09-01	ANO
2	3901	2012-10-01	ANO
3	3901	2012-11-01	ANO
4	3901	2012-12-01	ANO
	...	...	...
599381	9804086	2015-10-01	ANO
599382	9804238	2015-10-01	ANO
599383	9804782	2015-10-01	ANO
599384	9804810	2015-10-01	ANO
599385	9805032	2015-10-01	ANO

599386 rows x 3 columns
memory usage: 23.38 MB
name: probehnuto_mimo_penezenku
type: getml.DataFrame

1.2 Prepare data for getML¶

getML requires that we define roles for each of the columns.

In [10]:

  Copied!     
 
dobito.set_role("client_id", getml.data.roles.join_key)
dobito.set_role("month_year_datum_transakce", getml.data.roles.time_stamp)
dobito.set_role("sluzba", getml.data.roles.categorical)
dobito.set_role("kc_dobito", getml.data.roles.numerical)

dobito.set_unit("sluzba", "service")

dobito
dobito.set_role("client_id", getml.data.roles.join_key) dobito.set_role("month_year_datum_transakce", getml.data.roles.time_stamp) dobito.set_role("sluzba", getml.data.roles.categorical) dobito.set_role("kc_dobito", getml.data.roles.numerical) dobito.set_unit("sluzba", "service") dobito

Out[10]:

name	month_year_datum_transakce	client_id	sluzba	kc_dobito
role	time_stamp	join_key	categorical	numerical
unit	time stamp, comparison only		service
0	2012-10-01	7157857	c	1045.62
1	2015-10-01	109700	c	5187.28
2	2015-08-01	51508	c	408.2
3	2012-10-01	9573550	c	521.24
4	2014-11-01	9774621	c	386.22
	...	...	...	...
554341	2012-09-01	65283	g	7850
554342	2012-08-01	6091446	g	31400
554343	2013-08-01	1264806	g	-8220.52
554344	2012-08-01	101103	g	3140
554345	2012-08-01	8674551	g	6280

554346 rows x 4 columns
memory usage: 13.30 MB
name: dobito
type: getml.DataFrame

In [11]:

  Copied!     
 
probehnuto.set_role("client_id", getml.data.roles.join_key)
probehnuto.set_role("month_year_datum_transakce", getml.data.roles.time_stamp)
probehnuto.set_role("sluzba", getml.data.roles.categorical)
probehnuto.set_role("kc_proklikano", getml.data.roles.target)

probehnuto.set_unit("sluzba", "service")

probehnuto
probehnuto.set_role("client_id", getml.data.roles.join_key) probehnuto.set_role("month_year_datum_transakce", getml.data.roles.time_stamp) probehnuto.set_role("sluzba", getml.data.roles.categorical) probehnuto.set_role("kc_proklikano", getml.data.roles.target) probehnuto.set_unit("sluzba", "service") probehnuto

Out[11]:

name	month_year_datum_transakce	client_id	kc_proklikano	sluzba
role	time_stamp	join_key	target	categorical
unit	time stamp, comparison only			service
0	2013-06-01	109145	-31.4	c
1	2015-10-01	9804394	37.68	h
2	2015-10-01	9803353	725.34	h
3	2015-10-01	9801753	194.68	h
4	2015-10-01	9800425	1042.48	h
	...	...	...	...
1462073	2015-08-01	98857	153.86	NULL
1462074	2015-09-01	95776	153.86	NULL
1462075	2015-09-01	98857	153.86	NULL
1462076	2015-10-01	90001	310.86	NULL
1462077	2015-10-01	946957	153.86	NULL

1462078 rows x 4 columns
memory usage: 35.09 MB
name: probehnuto
type: getml.DataFrame

In [12]:

  Copied!     
 
probehnuto_mimo_penezenku.set_role("client_id", getml.data.roles.join_key)
probehnuto_mimo_penezenku.set_role("Month/Year", getml.data.roles.time_stamp)

probehnuto_mimo_penezenku
probehnuto_mimo_penezenku.set_role("client_id", getml.data.roles.join_key) probehnuto_mimo_penezenku.set_role("Month/Year", getml.data.roles.time_stamp) probehnuto_mimo_penezenku

Out[12]:

name	Month/Year	client_id	probehla_inzerce_mimo_penezenku
role	time_stamp	join_key	unused_string
unit	time stamp, comparison only
0	2012-08-01	3901	ANO
1	2012-09-01	3901	ANO
2	2012-10-01	3901	ANO
3	2012-11-01	3901	ANO
4	2012-12-01	3901	ANO
	...	...	...
599381	2015-10-01	9804086	ANO
599382	2015-10-01	9804238	ANO
599383	2015-10-01	9804782	ANO
599384	2015-10-01	9804810	ANO
599385	2015-10-01	9805032	ANO

599386 rows x 3 columns
memory usage: 14.39 MB
name: probehnuto_mimo_penezenku
type: getml.DataFrame

In [13]:

  Copied!     
 
split = getml.data.split.random(train=0.8, test=0.2)
split
split = getml.data.split.random(train=0.8, test=0.2) split

Out[13]:


0	train
1	train
2	train
3	test
4	train
	...

infinite number of rows
type: StringColumnView

2. Predictive modeling¶

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model¶

In [14]:

  Copied!     
 
star_schema = getml.data.StarSchema(population=probehnuto, alias="population", split=split)

star_schema.join(
    probehnuto,
    on="client_id",
    time_stamps="month_year_datum_transakce",
    lagged_targets=True,
    horizon=getml.data.time.days(1),
)

star_schema.join(
    dobito,
    on="client_id",
    time_stamps="month_year_datum_transakce",
)

star_schema.join(
    probehnuto_mimo_penezenku,
    on="client_id", 
    time_stamps=("month_year_datum_transakce",  "Month/Year"),
)

star_schema
star_schema = getml.data.StarSchema(population=probehnuto, alias="population", split=split) star_schema.join( probehnuto, on="client_id", time_stamps="month_year_datum_transakce", lagged_targets=True, horizon=getml.data.time.days(1), ) star_schema.join( dobito, on="client_id", time_stamps="month_year_datum_transakce", ) star_schema.join( probehnuto_mimo_penezenku, on="client_id", time_stamps=("month_year_datum_transakce", "Month/Year"), ) star_schema

Out[14]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	dobito	DOBITO__STAGING_TABLE_2
2	probehnuto	PROBEHNUTO__STAGING_TABLE_3
3	probehnuto_mimo_penezenku	PROBEHNUTO_MIMO_PENEZENKU__STAGING_TABLE_4

container

population

	subset	name	rows	type
0	test	probehnuto	292833	View
1	train	probehnuto	1169245	View

peripheral

	name	rows	type
0	probehnuto	1462078	DataFrame
1	dobito	554346	DataFrame
2	probehnuto_mimo_penezenku	599386	DataFrame

2.2 getML pipeline¶

Set-up the feature learner & predictor

In [15]:

  Copied!     
 
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,    
    sampling_factor=0.1,
)

feature_selector = getml.predictors.XGBoostRegressor(n_jobs=1, external_memory=True)

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
mapping = getml.preprocessors.Mapping() fast_prop = getml.feature_learning.FastProp( aggregation=getml.feature_learning.FastProp.agg_sets.All, loss_function=getml.feature_learning.loss_functions.SquareLoss, num_threads=1, sampling_factor=0.1, ) feature_selector = getml.predictors.XGBoostRegressor(n_jobs=1, external_memory=True) predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

Build the pipeline

In [16]:

  Copied!     
 
pipe1 = getml.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=[predictor],
    include_categorical=True,
)

pipe1
pipe1 = getml.Pipeline( tags=['fast_prop'], data_model=star_schema.data_model, preprocessors=[mapping], feature_learners=[fast_prop], feature_selectors=[feature_selector], predictors=[predictor], include_categorical=True, ) pipe1

Out[16]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostRegressor'],
         include_categorical=True,
         loss_function='SquareLoss',
         peripheral=['dobito', 'probehnuto', 'probehnuto_mimo_penezenku'],
         predictors=['XGBoostRegressor'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop'])

2.3 Model training¶

In [17]:

  Copied!     
 
pipe1.check(star_schema.train)
pipe1.check(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:20
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02

The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.

Out[17]:

	type	label	message
0	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and DOBITO__STAGING_TABLE_2 over 'client_id' and 'client_id', there are no corresponding entries for 2.228789% of entries in 'client_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
1	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and PROBEHNUTO_MIMO_PENEZENKU__STAGING_TABLE_4 over 'client_id' and 'client_id', there are no corresponding entries for 26.543966% of entries in 'client_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.

In [18]:

  Copied!     
 
pipe1.fit(star_schema.train)
pipe1.fit(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 909 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:13
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 02:31
  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 20:02
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 09:24

Trained pipeline.

Time taken: 0:33:12.643341.

Out[18]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostRegressor'],
         include_categorical=True,
         loss_function='SquareLoss',
         peripheral=['dobito', 'probehnuto', 'probehnuto_mimo_penezenku'],
         predictors=['XGBoostRegressor'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-AeQKVm'])

2.4 Model evaluation¶

In [19]:

  Copied!     
 
fastprop_score = pipe1.score(star_schema.test)
fastprop_score
fastprop_score = pipe1.score(star_schema.test) fastprop_score

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:19

Out[19]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 12:26:41	train	kc_proklikano	2940.4502	14384.5507	0.9423
1	2024-09-12 12:27:01	test	kc_proklikano	2998.9588	18673.8813	0.8751

2.5 featuretools¶

In [20]:

  Copied!     
 
include = (getml.data.random() < 0.25)
include
include = (getml.data.random() < 0.25) include

Out[20]:


0	true
1	false
2	true
3	false
4	false
	...

infinite number of rows
type: BooleanColumnView

In [21]:

  Copied!     
 
population_train_pd = star_schema.train.population[include].to_pandas()
population_test_pd = star_schema.test.population.to_pandas()
population_train_pd = star_schema.train.population[include].to_pandas() population_test_pd = star_schema.test.population.to_pandas()

In [22]:

  Copied!     
 
population_train_pd["id"] = population_train_pd.index
population_test_pd["id"] = population_test_pd.index
population_train_pd["id"] = population_train_pd.index population_test_pd["id"] = population_test_pd.index

In [23]:

  Copied!     
 
probehnuto_pd = probehnuto.drop(probehnuto.roles.unused).to_pandas()
dobito_pd = dobito.drop(dobito.roles.unused).to_pandas()
probehnuto_mimo_penezenku_pd = probehnuto_mimo_penezenku.drop(probehnuto_mimo_penezenku.roles.unused).to_pandas()
probehnuto_pd = probehnuto.drop(probehnuto.roles.unused).to_pandas() dobito_pd = dobito.drop(dobito.roles.unused).to_pandas() probehnuto_mimo_penezenku_pd = probehnuto_mimo_penezenku.drop(probehnuto_mimo_penezenku.roles.unused).to_pandas()

In [24]:

  Copied!     
 
def prepare_peripheral(peripheral_pd, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    peripheral_new = peripheral_pd.merge(
        train_or_test[["id", "client_id", "month_year_datum_transakce"]],
        on="client_id"
    )

    peripheral_new = peripheral_new[
        peripheral_new["month_year_datum_transakce_x"] < peripheral_new["month_year_datum_transakce_y"]
    ]

    del peripheral_new["month_year_datum_transakce_y"]
    del peripheral_new["client_id"]

    return peripheral_new.rename({"month_year_datum_transakce_y": "month_year_datum_transakce"})
def prepare_peripheral(peripheral_pd, train_or_test): """ Helper function that imitates the behavior of the data model defined above. """ peripheral_new = peripheral_pd.merge( train_or_test[["id", "client_id", "month_year_datum_transakce"]], on="client_id" ) peripheral_new = peripheral_new[ peripheral_new["month_year_datum_transakce_x"] < peripheral_new["month_year_datum_transakce_y"] ] del peripheral_new["month_year_datum_transakce_y"] del peripheral_new["client_id"] return peripheral_new.rename({"month_year_datum_transakce_y": "month_year_datum_transakce"})

In [25]:

  Copied!     
 
def prepare_probehnuto_mimo_penezenku(peripheral_pd, train_or_test):
    """
    Helper function that imitates the behavior of 
    the data model defined above.
    """
    peripheral_new = peripheral_pd.merge(
        train_or_test[["id", "client_id", "month_year_datum_transakce"]],
        on="client_id"
    )

    peripheral_new = peripheral_new[
        peripheral_new["Month/Year"] < peripheral_new["month_year_datum_transakce"]
    ]

    del peripheral_new["month_year_datum_transakce"]
    del peripheral_new["client_id"]

    return peripheral_new
def prepare_probehnuto_mimo_penezenku(peripheral_pd, train_or_test): """ Helper function that imitates the behavior of the data model defined above. """ peripheral_new = peripheral_pd.merge( train_or_test[["id", "client_id", "month_year_datum_transakce"]], on="client_id" ) peripheral_new = peripheral_new[ peripheral_new["Month/Year"] < peripheral_new["month_year_datum_transakce"] ] del peripheral_new["month_year_datum_transakce"] del peripheral_new["client_id"] return peripheral_new

In [26]:

  Copied!     
 
dobito_train_pd = prepare_peripheral(dobito_pd, population_train_pd)
dobito_test_pd = prepare_peripheral(dobito_pd, population_test_pd)
dobito_train_pd
dobito_train_pd = prepare_peripheral(dobito_pd, population_train_pd) dobito_test_pd = prepare_peripheral(dobito_pd, population_test_pd) dobito_train_pd

Out[26]:

	sluzba	kc_dobito	month_year_datum_transakce_x	id
0	c	1045.62	2012-10-01	2127
1	c	1045.62	2012-10-01	17709
2	c	1045.62	2012-10-01	50363
14	c	408.20	2015-08-01	152319
15	c	521.24	2012-10-01	153913
...	...	...	...	...
4462027	g	6280.00	2012-08-01	92370
4462028	g	6280.00	2012-08-01	140842
4462029	g	6280.00	2012-08-01	146070
4462030	g	6280.00	2012-08-01	175024
4462031	g	6280.00	2012-08-01	253772

2240543 rows × 4 columns

In [27]:

  Copied!     
 
probehnuto_train_pd = prepare_peripheral(probehnuto_pd, population_train_pd)
probehnuto_test_pd = prepare_peripheral(probehnuto_pd, population_test_pd)
probehnuto_train_pd
probehnuto_train_pd = prepare_peripheral(probehnuto_pd, population_train_pd) probehnuto_test_pd = prepare_peripheral(probehnuto_pd, population_test_pd) probehnuto_train_pd

Out[27]:

	sluzba	kc_proklikano	month_year_datum_transakce_x	id
1	c	-31.40	2013-06-01	281262
4	c	-31.40	2013-06-01	288356
6	c	-31.40	2013-06-01	289265
7	c	-31.40	2013-06-01	289267
10	c	-31.40	2013-06-01	290759
...	...	...	...	...
11186627	None	13545.96	2015-06-01	175888
11186634	None	13545.96	2015-06-01	272451
11186644	None	13545.96	2015-06-01	284406
11186660	None	153.86	2015-07-01	286198
11186663	None	153.86	2015-07-01	284454

5388870 rows × 4 columns

In [28]:

  Copied!     
 
probehnuto_mimo_penezenku_train_pd = prepare_probehnuto_mimo_penezenku(probehnuto_mimo_penezenku_pd, population_train_pd)
probehnuto_mimo_penezenku_test_pd = prepare_probehnuto_mimo_penezenku(probehnuto_mimo_penezenku_pd, population_test_pd)
probehnuto_mimo_penezenku_train_pd
probehnuto_mimo_penezenku_train_pd = prepare_probehnuto_mimo_penezenku(probehnuto_mimo_penezenku_pd, population_train_pd) probehnuto_mimo_penezenku_test_pd = prepare_probehnuto_mimo_penezenku(probehnuto_mimo_penezenku_pd, population_test_pd) probehnuto_mimo_penezenku_train_pd

Out[28]:

	Month/Year	id
0	2012-08-01	269301
8	2012-08-01	9204
9	2012-08-01	23838
10	2012-08-01	24471
11	2012-08-01	45868
...	...	...
3568048	2015-09-01	160015
3568050	2015-09-01	19
3568051	2015-09-01	1565
3568053	2015-09-01	151283
3568060	2015-09-01	158546

2832768 rows × 2 columns

In [29]:

  Copied!     
 
del population_train_pd["client_id"]
del population_test_pd["client_id"]
del population_train_pd["client_id"] del population_test_pd["client_id"]

In [30]:

  Copied!     
 
population_train_pd
population_train_pd

Out[30]:

	sluzba	kc_proklikano	month_year_datum_transakce	id
0	c	-31.40	2013-06-01	0
1	h	725.34	2015-10-01	1
2	h	8550.22	2015-10-01	2
3	h	2408.38	2015-10-01	3
4	h	1893.42	2015-10-01	4
...	...	...	...	...
292153	None	153.86	2015-03-01	292153
292154	None	153.86	2015-05-01	292154
292155	None	13545.96	2015-06-01	292155
292156	None	153.86	2015-06-01	292156
292157	None	153.86	2015-08-01	292157

292158 rows × 4 columns

In [31]:

  Copied!     
 
def add_index(df):
    df.insert(0, "index", range(len(df)))

population_pd_logical_types = {
    'id': ww.logical_types.Integer,
    'sluzba': ww.logical_types.Categorical,
    'kc_proklikano': ww.logical_types.Double,
    'month_year_datum_transakce': ww.logical_types.Datetime
}
population_train_pd.ww.init(logical_types=population_pd_logical_types, index='id', name='population')
population_test_pd.ww.init(logical_types=population_pd_logical_types, index='id', name='population')

add_index(dobito_train_pd)
add_index(dobito_test_pd)
dobito_pd_logical_types = {
    'index': ww.logical_types.Integer,
    'sluzba': ww.logical_types.Categorical,
    'kc_dobito': ww.logical_types.Double,
    'month_year_datum_transakce_x': ww.logical_types.Datetime,
    'id': ww.logical_types.Integer
}
dobito_train_pd.ww.init(logical_types=dobito_pd_logical_types, index='index', name='dobito')
dobito_test_pd.ww.init(logical_types=dobito_pd_logical_types, index='index', name='dobito')

add_index(probehnuto_train_pd)
add_index(probehnuto_test_pd)
probehnuto_pd_logical_types = {
    'index': ww.logical_types.Integer,
    'sluzba': ww.logical_types.Categorical,
    'kc_proklikano': ww.logical_types.Double,
    'month_year_datum_transakce_x': ww.logical_types.Datetime,
    'id': ww.logical_types.Integer
}
probehnuto_train_pd.ww.init(logical_types=probehnuto_pd_logical_types, index='index', name='probehnuto')
probehnuto_test_pd.ww.init(logical_types=probehnuto_pd_logical_types, index='index', name='probehnuto')

add_index(probehnuto_mimo_penezenku_train_pd)
add_index(probehnuto_mimo_penezenku_test_pd)
probehnuto_mimo_penezenku_pd_logical_types = {
    'index': ww.logical_types.Integer,
    'Month/Year': ww.logical_types.Datetime,
    'id': ww.logical_types.Integer
}
probehnuto_mimo_penezenku_train_pd.ww.init(logical_types=probehnuto_mimo_penezenku_pd_logical_types, index='index', name='probehnuto_mimo_penezenku')
probehnuto_mimo_penezenku_test_pd.ww.init(logical_types=probehnuto_mimo_penezenku_pd_logical_types, index='index', name='probehnuto_mimo_penezenku')
def add_index(df): df.insert(0, "index", range(len(df))) population_pd_logical_types = { 'id': ww.logical_types.Integer, 'sluzba': ww.logical_types.Categorical, 'kc_proklikano': ww.logical_types.Double, 'month_year_datum_transakce': ww.logical_types.Datetime } population_train_pd.ww.init(logical_types=population_pd_logical_types, index='id', name='population') population_test_pd.ww.init(logical_types=population_pd_logical_types, index='id', name='population') add_index(dobito_train_pd) add_index(dobito_test_pd) dobito_pd_logical_types = { 'index': ww.logical_types.Integer, 'sluzba': ww.logical_types.Categorical, 'kc_dobito': ww.logical_types.Double, 'month_year_datum_transakce_x': ww.logical_types.Datetime, 'id': ww.logical_types.Integer } dobito_train_pd.ww.init(logical_types=dobito_pd_logical_types, index='index', name='dobito') dobito_test_pd.ww.init(logical_types=dobito_pd_logical_types, index='index', name='dobito') add_index(probehnuto_train_pd) add_index(probehnuto_test_pd) probehnuto_pd_logical_types = { 'index': ww.logical_types.Integer, 'sluzba': ww.logical_types.Categorical, 'kc_proklikano': ww.logical_types.Double, 'month_year_datum_transakce_x': ww.logical_types.Datetime, 'id': ww.logical_types.Integer } probehnuto_train_pd.ww.init(logical_types=probehnuto_pd_logical_types, index='index', name='probehnuto') probehnuto_test_pd.ww.init(logical_types=probehnuto_pd_logical_types, index='index', name='probehnuto') add_index(probehnuto_mimo_penezenku_train_pd) add_index(probehnuto_mimo_penezenku_test_pd) probehnuto_mimo_penezenku_pd_logical_types = { 'index': ww.logical_types.Integer, 'Month/Year': ww.logical_types.Datetime, 'id': ww.logical_types.Integer } probehnuto_mimo_penezenku_train_pd.ww.init(logical_types=probehnuto_mimo_penezenku_pd_logical_types, index='index', name='probehnuto_mimo_penezenku') probehnuto_mimo_penezenku_test_pd.ww.init(logical_types=probehnuto_mimo_penezenku_pd_logical_types, index='index', name='probehnuto_mimo_penezenku')

In [32]:

  Copied!     
 
dataframes_train = {
    "population" : (population_train_pd, ),
    "dobito": (dobito_train_pd, ),
    "probehnuto": (probehnuto_train_pd, ),
    "probehnuto_mimo_penezenku": (probehnuto_mimo_penezenku_train_pd, ),
}
dataframes_train = { "population" : (population_train_pd, ), "dobito": (dobito_train_pd, ), "probehnuto": (probehnuto_train_pd, ), "probehnuto_mimo_penezenku": (probehnuto_mimo_penezenku_train_pd, ), }

In [33]:

  Copied!     
 
dataframes_test = {
    "population" : (population_test_pd, ),
    "dobito": (dobito_test_pd, ),
    "probehnuto": (probehnuto_test_pd, ),
    "probehnuto_mimo_penezenku": (probehnuto_mimo_penezenku_test_pd, ),
}
dataframes_test = { "population" : (population_test_pd, ), "dobito": (dobito_test_pd, ), "probehnuto": (probehnuto_test_pd, ), "probehnuto_mimo_penezenku": (probehnuto_mimo_penezenku_test_pd, ), }

In [34]:

  Copied!     
 
relationships = [
    ("population", "id", "dobito", "id"),
    ("population", "id", "probehnuto", "id"),
    ("population", "id", "probehnuto_mimo_penezenku", "id"),
]
relationships = [ ("population", "id", "dobito", "id"), ("population", "id", "probehnuto", "id"), ("population", "id", "probehnuto_mimo_penezenku", "id"), ]

In [35]:

  Copied!     
 
featuretools_train_pd = featuretools.dfs(
    dataframes=dataframes_train,
    relationships=relationships,
    target_dataframe_name="population")[0]
featuretools_train_pd = featuretools.dfs( dataframes=dataframes_train, relationships=relationships, target_dataframe_name="population")[0]

In [36]:

  Copied!     
 
featuretools_test_pd = featuretools.dfs(
    dataframes=dataframes_test,
    relationships=relationships,
    target_dataframe_name="population")[0]
featuretools_test_pd = featuretools.dfs( dataframes=dataframes_test, relationships=relationships, target_dataframe_name="population")[0]

In [37]:

  Copied!     
 
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train")
featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")
featuretools_train = getml.data.DataFrame.from_pandas(featuretools_train_pd, "featuretools_train") featuretools_test = getml.data.DataFrame.from_pandas(featuretools_test_pd, "featuretools_test")

In [38]:

  Copied!     
 
featuretools_train.set_role("kc_proklikano", getml.data.roles.target)
featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical)
featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical)

featuretools_train
featuretools_train.set_role("kc_proklikano", getml.data.roles.target) featuretools_train.set_role(featuretools_train.roles.unused_float, getml.data.roles.numerical) featuretools_train.set_role(featuretools_train.roles.unused_string, getml.data.roles.categorical) featuretools_train

Out[38]:

name	kc_proklikano	sluzba	COUNT(dobito)	MODE(dobito.sluzba)	NUM_UNIQUE(dobito.sluzba)	COUNT(probehnuto)	MODE(probehnuto.sluzba)	NUM_UNIQUE(probehnuto.sluzba)	COUNT(probehnuto_mimo_penezenku)	DAY(month_year_datum_transakce)	MONTH(month_year_datum_transakce)	WEEKDAY(month_year_datum_transakce)	YEAR(month_year_datum_transakce)	MODE(dobito.DAY(month_year_datum_transakce_x))	MODE(dobito.MONTH(month_year_datum_transakce_x))	MODE(dobito.WEEKDAY(month_year_datum_transakce_x))	MODE(dobito.YEAR(month_year_datum_transakce_x))	NUM_UNIQUE(dobito.DAY(month_year_datum_transakce_x))	NUM_UNIQUE(dobito.MONTH(month_year_datum_transakce_x))	NUM_UNIQUE(dobito.WEEKDAY(month_year_datum_transakce_x))	NUM_UNIQUE(dobito.YEAR(month_year_datum_transakce_x))	MODE(probehnuto.DAY(month_year_datum_transakce_x))	MODE(probehnuto.MONTH(month_year_datum_transakce_x))	MODE(probehnuto.WEEKDAY(month_year_datum_transakce_x))	MODE(probehnuto.YEAR(month_year_datum_transakce_x))	NUM_UNIQUE(probehnuto.DAY(month_year_datum_transakce_x))	NUM_UNIQUE(probehnuto.MONTH(month_year_datum_transakce_x))	NUM_UNIQUE(probehnuto.WEEKDAY(month_year_datum_transakce_x))	NUM_UNIQUE(probehnuto.YEAR(month_year_datum_transakce_x))	MODE(probehnuto_mimo_penezenku.DAY(Month/Year))	MODE(probehnuto_mimo_penezenku.MONTH(Month/Year))	MODE(probehnuto_mimo_penezenku.WEEKDAY(Month/Year))	MODE(probehnuto_mimo_penezenku.YEAR(Month/Year))	NUM_UNIQUE(probehnuto_mimo_penezenku.DAY(Month/Year))	NUM_UNIQUE(probehnuto_mimo_penezenku.MONTH(Month/Year))	NUM_UNIQUE(probehnuto_mimo_penezenku.WEEKDAY(Month/Year))	NUM_UNIQUE(probehnuto_mimo_penezenku.YEAR(Month/Year))	MAX(dobito.kc_dobito)	MEAN(dobito.kc_dobito)	MIN(dobito.kc_dobito)	SKEW(dobito.kc_dobito)	STD(dobito.kc_dobito)	SUM(dobito.kc_dobito)	MAX(probehnuto.kc_proklikano)	MEAN(probehnuto.kc_proklikano)	MIN(probehnuto.kc_proklikano)	SKEW(probehnuto.kc_proklikano)	STD(probehnuto.kc_proklikano)	SUM(probehnuto.kc_proklikano)
role	target	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical
0	-31.4	c	1	c	1	13	d	1	0	1	6	5	2013	1	12	5	2012	1	1	1	1	1	8	0	2012	1	10	6	2	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1306.24	1306.24	1306.24	nan	nan	1306.24	351.68	155.7923	9.42	0.5817	79.3799	2025.3
1	725.34	h	4	h	1	5	h	1	0	1	10	3	2015	1	5	0	2015	1	4	4	1	1	5	0	2015	1	5	5	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1036.2	614.655	257.48	0.5563	324.3624	2458.62	634.28	388.732	131.88	0.09478	205.8605	1943.66
2	8550.22	h	7	h	2	11	h	1	0	1	10	3	2015	1	8	2	2015	1	6	5	2	1	1	2	2015	1	11	7	2	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	62800	20907.9143	0	1.214	22375.0788	146355.4	39752.4	13280.7727	3215.36	1.4845	12240.1205	146088.5
3	2408.38	h	4	h	1	5	h	1	0	1	10	3	2015	1	5	1	2015	1	4	4	1	1	5	0	2015	1	5	5	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1554.3	1361.19	1296.82	2	128.74	5444.76	1635.94	1092.72	15.7	-1.0846	711.3296	5463.6
4	1893.42	h	12	h	4	22	d	3	0	1	10	3	2015	1	2	4	2015	1	5	6	2	1	2	2	2015	1	9	6	2	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	2615.62	1078.3283	310.86	0.8429	835.3628	12939.94	1510.34	471	0	0.8615	531.1884	10362
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
292153	153.86	NULL	12	c	1	34	d	1	0	1	3	6	2015	1	1	0	2013	1	8	6	3	1	1	5	2013	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	518.1	337.8117	153.86	0.02976	99.6511	4053.74	188.4	141.9465	0	-2.9319	44.2498	4826.18
292154	153.86	NULL	6	f	1	34	f	1	0	1	5	4	2015	1	2	1	2013	1	6	4	3	1	1	5	2013	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1296.82	913.74	31.4	-2.1286	444.4448	5482.44	188.4	150.8124	-31.4	-5.2155	33.226	5127.62
292155	13545.96	NULL	214	c	8	283	d	8	0	1	6	0	2015	1	1	5	2014	1	12	7	4	1	1	5	2014	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	7131326	794524.2239	-11422331	-2.0115	1432748.8604	170028183.91	6965622.14	640600.4428	-866.64	2.2138	1006527.2489	181289925.3
292156	153.86	NULL	14	c	1	0	NULL	NULL	0	1	6	0	2015	1	4	6	2014	1	12	7	2	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	518.1	179.8771	153.86	3.7417	97.3472	2518.28	nan	nan	nan	nan	nan	0
292157	153.86	NULL	8	c	1	36	NULL	0	0	1	8	5	2015	1	8	1	2013	1	7	5	4	1	1	0	2013	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	942	764.1975	518.1	-1.1545	115.2358	6113.58	188.4	155.1683	153.86	5.7312	5.7838	5586.06

292158 rows x 49 columns
memory usage: 72.46 MB
name: featuretools_train
type: getml.DataFrame

In [39]:

  Copied!     
 
featuretools_test.set_role("kc_proklikano", getml.data.roles.target)
featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical)
featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical)

featuretools_test
featuretools_test.set_role("kc_proklikano", getml.data.roles.target) featuretools_test.set_role(featuretools_test.roles.unused_float, getml.data.roles.numerical) featuretools_test.set_role(featuretools_test.roles.unused_string, getml.data.roles.categorical) featuretools_test

Out[39]:

name	kc_proklikano	sluzba	COUNT(dobito)	MODE(dobito.sluzba)	NUM_UNIQUE(dobito.sluzba)	COUNT(probehnuto)	MODE(probehnuto.sluzba)	NUM_UNIQUE(probehnuto.sluzba)	COUNT(probehnuto_mimo_penezenku)	DAY(month_year_datum_transakce)	MONTH(month_year_datum_transakce)	WEEKDAY(month_year_datum_transakce)	YEAR(month_year_datum_transakce)	MODE(dobito.DAY(month_year_datum_transakce_x))	MODE(dobito.MONTH(month_year_datum_transakce_x))	MODE(dobito.WEEKDAY(month_year_datum_transakce_x))	MODE(dobito.YEAR(month_year_datum_transakce_x))	NUM_UNIQUE(dobito.DAY(month_year_datum_transakce_x))	NUM_UNIQUE(dobito.MONTH(month_year_datum_transakce_x))	NUM_UNIQUE(dobito.WEEKDAY(month_year_datum_transakce_x))	NUM_UNIQUE(dobito.YEAR(month_year_datum_transakce_x))	MODE(probehnuto.DAY(month_year_datum_transakce_x))	MODE(probehnuto.MONTH(month_year_datum_transakce_x))	MODE(probehnuto.WEEKDAY(month_year_datum_transakce_x))	MODE(probehnuto.YEAR(month_year_datum_transakce_x))	NUM_UNIQUE(probehnuto.DAY(month_year_datum_transakce_x))	NUM_UNIQUE(probehnuto.MONTH(month_year_datum_transakce_x))	NUM_UNIQUE(probehnuto.WEEKDAY(month_year_datum_transakce_x))	NUM_UNIQUE(probehnuto.YEAR(month_year_datum_transakce_x))	MODE(probehnuto_mimo_penezenku.DAY(Month/Year))	MODE(probehnuto_mimo_penezenku.MONTH(Month/Year))	MODE(probehnuto_mimo_penezenku.WEEKDAY(Month/Year))	MODE(probehnuto_mimo_penezenku.YEAR(Month/Year))	NUM_UNIQUE(probehnuto_mimo_penezenku.DAY(Month/Year))	NUM_UNIQUE(probehnuto_mimo_penezenku.MONTH(Month/Year))	NUM_UNIQUE(probehnuto_mimo_penezenku.WEEKDAY(Month/Year))	NUM_UNIQUE(probehnuto_mimo_penezenku.YEAR(Month/Year))	MAX(dobito.kc_dobito)	MEAN(dobito.kc_dobito)	MIN(dobito.kc_dobito)	SKEW(dobito.kc_dobito)	STD(dobito.kc_dobito)	SUM(dobito.kc_dobito)	MAX(probehnuto.kc_proklikano)	MEAN(probehnuto.kc_proklikano)	MIN(probehnuto.kc_proklikano)	SKEW(probehnuto.kc_proklikano)	STD(probehnuto.kc_proklikano)	SUM(probehnuto.kc_proklikano)
role	target	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical
0	194.68	h	2	d	2	2	d	2	0	1	10	3	2015	1	9	1	2015	1	1	1	1	1	9	1	2015	1	1	1	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1296.82	777.15	257.48	nan	734.9244	1554.3	763.02	401.92	40.82	nan	510.6725	803.84
1	405.06	h	1	h	1	2	h	1	0	1	10	3	2015	1	8	5	2015	1	1	1	1	1	8	1	2015	1	2	2	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1296.82	1296.82	1296.82	nan	nan	1296.82	565.2	452.16	339.12	nan	159.8627	904.32
2	580.9	h	4	d	2	5	d	2	0	1	10	3	2015	1	9	1	2015	1	3	3	1	1	9	1	2015	1	4	4	1	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1296.82	1231.665	1036.2	-2.	130.31	4926.66	913.74	454.044	34.54	0.2893	328.7162	2270.22
3	106.76	h	0	NULL	NULL	0	NULL	NULL	0	1	10	3	2015	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	nan	nan	nan	nan	nan	0	nan	nan	nan	nan	nan	0
4	1927.96	h	15	d	2	21	d	2	0	1	10	3	2015	1	9	0	2015	1	10	6	2	1	9	0	2015	1	12	7	2	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	7784.06	1850.088	257.48	2.4789	1898.9207	27751.32	5199.84	1148.1933	25.12	1.8651	1342.4638	24112.06
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
292828	153.86	NULL	5	c	1	36	d	2	0	1	4	2	2015	1	12	5	2013	1	4	4	3	1	8	5	2013	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1306.24	1045.62	31.4	-2.2358	566.9809	5228.1	351.68	150.3711	-31.4	-0.2998	56.2491	5413.36
292829	153.86	NULL	3	c	1	35	c	1	0	1	6	0	2015	1	4	1	2012	1	3	3	3	1	1	5	2013	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	2615.62	1757.3533	62.8	-1.7316	1467.5674	5272.06	188.4	150.0023	-62.8	-5.4539	37.9032	5250.08
292830	153.86	NULL	6	f	1	35	NULL	0	0	1	7	2	2015	1	3	5	2014	1	4	3	4	1	1	0	2013	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1884	831.5767	518.1	2.1524	530.786	4989.46	188.4	155.2057	153.86	5.6511	5.8638	5432.2
292831	310.86	NULL	3	c	2	38	NULL	0	0	1	10	3	2015	1	8	2	2012	1	2	3	3	1	8	5	2013	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	5024	4442.0533	4151.08	1.7321	503.9806	13326.16	376.8	312.9258	310.86	6.0854	10.6864	11891.18
292832	153.86	NULL	4	c	1	35	NULL	0	0	1	10	3	2015	1	1	1	2013	1	4	3	3	1	1	5	2013	1	12	7	4	NULL	NULL	NULL	NULL	NULL	NULL	NULL	NULL	1868.3	1415.355	628	-0.9865	589.9991	5661.42	157	154.0394	153.86	3.9889	0.7395	5391.38

292833 rows x 49 columns
memory usage: 72.62 MB
name: featuretools_test
type: getml.DataFrame

We train an untuned XGBoostRegressor on top of featuretools' features, just like we have done for getML's features.

Since some of featuretools features are categorical, we allow the pipeline to include these features as well. Other features contain NaN values, which is why we also apply getML's Imputation preprocessor.

In [40]:

  Copied!     
 
data_model = getml.data.DataModel("population")
data_model = getml.data.DataModel("population")

In [41]:

  Copied!     
 
imputation = getml.preprocessors.Imputation()

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

pipe2 = getml.Pipeline(
    tags=['featuretools'],
    data_model=data_model,
    preprocessors=[imputation],
    predictors=[predictor],
    include_categorical=True,
)

pipe2
imputation = getml.preprocessors.Imputation() predictor = getml.predictors.XGBoostRegressor(n_jobs=1) pipe2 = getml.Pipeline( tags=['featuretools'], data_model=data_model, preprocessors=[imputation], predictors=[predictor], include_categorical=True, ) pipe2

Out[41]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=True,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=['Imputation'],
         share_selected_features=0.5,
         tags=['featuretools'])

In [42]:

  Copied!     
 
pipe2.fit(featuretools_train)
pipe2.fit(featuretools_train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 0 issues labeled INFO and 7 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:32

Trained pipeline.

Time taken: 0:00:33.622797.

Out[42]:

Pipeline(data_model='population',
         feature_learners=[],
         feature_selectors=[],
         include_categorical=True,
         loss_function='SquareLoss',
         peripheral=[],
         predictors=['XGBoostRegressor'],
         preprocessors=['Imputation'],
         share_selected_features=0.5,
         tags=['featuretools'])

In [43]:

  Copied!     
 
featuretools_score = pipe2.score(featuretools_test)
featuretools_score
featuretools_score = pipe2.score(featuretools_test) featuretools_score

⠸ Staging...                                            0% • 00:32

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02

Out[43]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 12:37:42	featuretools_train	kc_proklikano	5024.2643	23362.8008	0.8394
1	2024-09-12 12:37:45	featuretools_test	kc_proklikano	5183.7763	34050.186	0.5751

2.6 Features¶

The most important feature looks as follows:

In [44]:

  Copied!     
 
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]

Out[44]:

DROP TABLE IF EXISTS "FEATURE_1_64";

CREATE TABLE "FEATURE_1_64" AS
SELECT EWMA_TREND_1H( t2."kc_proklikano", t1."month_year_datum_transakce" - t2."month_year_datum_transakce__1_000000_days" ) AS "feature_1_64",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "PROBEHNUTO__STAGING_TABLE_3" t2
ON t1."client_id" = t2."client_id"
WHERE t2."month_year_datum_transakce__1_000000_days" <= t1."month_year_datum_transakce"
AND t1."sluzba" = t2."sluzba"
GROUP BY t1.rowid;

2.7 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [ ]:

  Copied!     
 
# Creates a folder named seznam_pipeline containing
# the SQL code.
pipe1.features.to_sql().save("seznam_pipeline")
# Creates a folder named seznam_pipeline containing # the SQL code. pipe1.features.to_sql().save("seznam_pipeline")

In [ ]:

  Copied!     
 
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("seznam_spark")
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("seznam_spark")

2.8 Discussion¶

For a more convenient overview, we summarize our results into a table.

In [47]:

  Copied!     
 
scores = [fastprop_score, featuretools_score]
pd.DataFrame(data={
    'Name': ['getML: FastProp', 'featuretools'],
    'R-squared': [f'{score.rsquared:.2%}' for score in scores],
    'RMSE': [f'{score.rmse:,.0f}' for score in scores],
    'MAE': [f'{score.mae:,.0f}' for score in scores]
})
scores = [fastprop_score, featuretools_score] pd.DataFrame(data={ 'Name': ['getML: FastProp', 'featuretools'], 'R-squared': [f'{score.rsquared:.2%}' for score in scores], 'RMSE': [f'{score.rmse:,.0f}' for score in scores], 'MAE': [f'{score.mae:,.0f}' for score in scores] })

Out[47]:

	Name	R-squared	RMSE	MAE
0	getML: FastProp	87.51%	18,674	2,999
1	featuretools	57.51%	34,050	5,184

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

We have benchmarked getML against featuretools on a dataset related to online transactions. We have found that getML outperforms featuretools by a wide margin.

References¶

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).