Seznam - Predicting transaction volume¶

Seznam is a Czech company with a scope similar to Google. The purpose of this notebook is to analyze data from Seznam's wallet, predicting the transaction volume.

Summary:

Prediction type: Regression model
Domain: E-commerce
Prediction target: Transaction volume
Population size: 1,462,078

Author: Dr. Patrick Urbanke

Background¶

Seznam is a Czech company with a scope similar to Google. The purpose of this notebook is to analyze data from Seznam's wallet, predicting the transaction volume.

Since the dataset is in Czech, we will quickly translate the meaning of the main tables:

dobito: contains data on prepayments into a wallet
probehnuto: contains data on charges from a wallet
probehnuto_mimo_penezenku: contains data on charges, from sources other than a wallet

The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).

Analysis¶

Let's get started with the analysis and set up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0"

import getml

getml.engine.launch()
getml.set_project('seznam')
%pip install -q "getml==1.5.0" import getml getml.engine.launch() getml.set_project('seznam')

Note: you may need to restart the kernel to use updated packages.
Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false in /home/user/.getML/getml-1.5.0-x64-community-edition-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912123743.log.

Connected to project 'seznam'.

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data:

In [2]:

  Copied!     
 
conn = getml.database.connect_mariadb(
    host="relational.fel.cvut.cz",
    dbname="Seznam",
    port=3306,
    user="guest",
    password="relational"
)

conn
conn = getml.database.connect_mariadb( host="relational.fel.cvut.cz", dbname="Seznam", port=3306, user="guest", password="relational" ) conn

Out[2]:

Connection(dbname='Seznam', dialect='mysql', host='relational.fel.cvut.cz', port=3306)

In [3]:

  Copied!     
 
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
def load_if_needed(name): """ Loads the data from the relational learning repository, if the data frame has not already been loaded. """ if not getml.data.exists(name): data_frame = getml.DataFrame.from_db( name=name, table_name=name, conn=conn ) data_frame.save() else: data_frame = getml.data.load_data_frame(name) return data_frame

In [4]:

  Copied!     
 
dobito = load_if_needed("dobito")
probehnuto = load_if_needed("probehnuto")
probehnuto_mimo_penezenku = load_if_needed("probehnuto_mimo_penezenku")
dobito = load_if_needed("dobito") probehnuto = load_if_needed("probehnuto") probehnuto_mimo_penezenku = load_if_needed("probehnuto_mimo_penezenku")

In [5]:

  Copied!     
 
dobito
dobito

Out[5]:

name	client_id	month_year_datum_transakce	sluzba	kc_dobito
role	unused_float	unused_string	unused_string	unused_string
0	7157857	2012-10-01	c	1045.62
1	109700	2015-10-01	c	5187.28
2	51508	2015-08-01	c	408.20
3	9573550	2012-10-01	c	521.24
4	9774621	2014-11-01	c	386.22
	...	...	...	...
554341	65283	2012-09-01	g	7850.00
554342	6091446	2012-08-01	g	31400.00
554343	1264806	2013-08-01	g	-8220.52
554344	101103	2012-08-01	g	3140.00
554345	8674551	2012-08-01	g	6280.00

554346 rows x 4 columns
memory usage: 29.59 MB
name: dobito
type: getml.DataFrame

In [6]:

  Copied!     
 
probehnuto
probehnuto

Out[6]:

name	client_id	month_year_datum_transakce	sluzba	kc_proklikano
role	unused_float	unused_string	unused_string	unused_string
0	109145	2013-06-01	c	-31.40
1	9804394	2015-10-01	h	37.68
2	9803353	2015-10-01	h	725.34
3	9801753	2015-10-01	h	194.68
4	9800425	2015-10-01	h	1042.48
	...	...	...	...
1462073	98857	2015-08-01	NULL	153.86
1462074	95776	2015-09-01	NULL	153.86
1462075	98857	2015-09-01	NULL	153.86
1462076	90001	2015-10-01	NULL	310.86
1462077	946957	2015-10-01	NULL	153.86

1462078 rows x 4 columns
memory usage: 77.07 MB
name: probehnuto
type: getml.DataFrame

In [7]:

  Copied!     
 
probehnuto_mimo_penezenku
probehnuto_mimo_penezenku

Out[7]:

name	client_id	Month/Year	probehla_inzerce_mimo_penezenku
role	unused_float	unused_string	unused_string
0	3901	2012-08-01	ANO
1	3901	2012-09-01	ANO
2	3901	2012-10-01	ANO
3	3901	2012-11-01	ANO
4	3901	2012-12-01	ANO
	...	...	...
599381	9804086	2015-10-01	ANO
599382	9804238	2015-10-01	ANO
599383	9804782	2015-10-01	ANO
599384	9804810	2015-10-01	ANO
599385	9805032	2015-10-01	ANO

599386 rows x 3 columns
memory usage: 23.38 MB
name: probehnuto_mimo_penezenku
type: getml.DataFrame

1.2 Prepare data for getML¶

getML requires that we define roles for each of the columns.

In [8]:

  Copied!     
 
dobito.set_role("client_id", getml.data.roles.join_key)
dobito.set_role("month_year_datum_transakce", getml.data.roles.time_stamp)
dobito.set_role("sluzba", getml.data.roles.categorical)
dobito.set_role("kc_dobito", getml.data.roles.numerical)

dobito.set_unit("sluzba", "service")

dobito
dobito.set_role("client_id", getml.data.roles.join_key) dobito.set_role("month_year_datum_transakce", getml.data.roles.time_stamp) dobito.set_role("sluzba", getml.data.roles.categorical) dobito.set_role("kc_dobito", getml.data.roles.numerical) dobito.set_unit("sluzba", "service") dobito

Out[8]:

name	month_year_datum_transakce	client_id	sluzba	kc_dobito
role	time_stamp	join_key	categorical	numerical
unit	time stamp, comparison only		service
0	2012-10-01	7157857	c	1045.62
1	2015-10-01	109700	c	5187.28
2	2015-08-01	51508	c	408.2
3	2012-10-01	9573550	c	521.24
4	2014-11-01	9774621	c	386.22
	...	...	...	...
554341	2012-09-01	65283	g	7850
554342	2012-08-01	6091446	g	31400
554343	2013-08-01	1264806	g	-8220.52
554344	2012-08-01	101103	g	3140
554345	2012-08-01	8674551	g	6280

554346 rows x 4 columns
memory usage: 13.30 MB
name: dobito
type: getml.DataFrame

In [9]:

  Copied!     
 
probehnuto.set_role("client_id", getml.data.roles.join_key)
probehnuto.set_role("month_year_datum_transakce", getml.data.roles.time_stamp)
probehnuto.set_role("sluzba", getml.data.roles.categorical)
probehnuto.set_role("kc_proklikano", getml.data.roles.target)

probehnuto.set_unit("sluzba", "service")

probehnuto
probehnuto.set_role("client_id", getml.data.roles.join_key) probehnuto.set_role("month_year_datum_transakce", getml.data.roles.time_stamp) probehnuto.set_role("sluzba", getml.data.roles.categorical) probehnuto.set_role("kc_proklikano", getml.data.roles.target) probehnuto.set_unit("sluzba", "service") probehnuto

Out[9]:

name	month_year_datum_transakce	client_id	kc_proklikano	sluzba
role	time_stamp	join_key	target	categorical
unit	time stamp, comparison only			service
0	2013-06-01	109145	-31.4	c
1	2015-10-01	9804394	37.68	h
2	2015-10-01	9803353	725.34	h
3	2015-10-01	9801753	194.68	h
4	2015-10-01	9800425	1042.48	h
	...	...	...	...
1462073	2015-08-01	98857	153.86	NULL
1462074	2015-09-01	95776	153.86	NULL
1462075	2015-09-01	98857	153.86	NULL
1462076	2015-10-01	90001	310.86	NULL
1462077	2015-10-01	946957	153.86	NULL

1462078 rows x 4 columns
memory usage: 35.09 MB
name: probehnuto
type: getml.DataFrame

In [10]:

  Copied!     
 
probehnuto_mimo_penezenku.set_role("client_id", getml.data.roles.join_key)
probehnuto_mimo_penezenku.set_role("Month/Year", getml.data.roles.time_stamp)

probehnuto_mimo_penezenku
probehnuto_mimo_penezenku.set_role("client_id", getml.data.roles.join_key) probehnuto_mimo_penezenku.set_role("Month/Year", getml.data.roles.time_stamp) probehnuto_mimo_penezenku

Out[10]:

name	Month/Year	client_id	probehla_inzerce_mimo_penezenku
role	time_stamp	join_key	unused_string
unit	time stamp, comparison only
0	2012-08-01	3901	ANO
1	2012-09-01	3901	ANO
2	2012-10-01	3901	ANO
3	2012-11-01	3901	ANO
4	2012-12-01	3901	ANO
	...	...	...
599381	2015-10-01	9804086	ANO
599382	2015-10-01	9804238	ANO
599383	2015-10-01	9804782	ANO
599384	2015-10-01	9804810	ANO
599385	2015-10-01	9805032	ANO

599386 rows x 3 columns
memory usage: 14.39 MB
name: probehnuto_mimo_penezenku
type: getml.DataFrame

In [11]:

  Copied!     
 
split = getml.data.split.random(train=0.8, test=0.2)
split
split = getml.data.split.random(train=0.8, test=0.2) split

Out[11]:


0	train
1	train
2	train
3	test
4	train
	...

infinite number of rows
type: StringColumnView

2. Predictive modeling¶

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model¶

In [12]:

  Copied!     
 
star_schema = getml.data.StarSchema(population=probehnuto, alias="population", split=split)

star_schema.join(
    probehnuto,
    on="client_id",
    time_stamps="month_year_datum_transakce",
    lagged_targets=True,
    horizon=getml.data.time.days(1),
)

star_schema.join(
    dobito,
    on="client_id",
    time_stamps="month_year_datum_transakce",
)

star_schema.join(
    probehnuto_mimo_penezenku,
    on="client_id", 
    time_stamps=("month_year_datum_transakce",  "Month/Year"),
)

star_schema
star_schema = getml.data.StarSchema(population=probehnuto, alias="population", split=split) star_schema.join( probehnuto, on="client_id", time_stamps="month_year_datum_transakce", lagged_targets=True, horizon=getml.data.time.days(1), ) star_schema.join( dobito, on="client_id", time_stamps="month_year_datum_transakce", ) star_schema.join( probehnuto_mimo_penezenku, on="client_id", time_stamps=("month_year_datum_transakce", "Month/Year"), ) star_schema

Out[12]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	dobito	DOBITO__STAGING_TABLE_2
2	probehnuto	PROBEHNUTO__STAGING_TABLE_3
3	probehnuto_mimo_penezenku	PROBEHNUTO_MIMO_PENEZENKU__STAGING_TABLE_4

container

population

	subset	name	rows	type
0	test	probehnuto	unknown	View
1	train	probehnuto	unknown	View

peripheral

	name	rows	type
0	probehnuto	1462078	DataFrame
1	dobito	554346	DataFrame
2	probehnuto_mimo_penezenku	599386	DataFrame

2.2 getML pipeline¶

Set-up the feature learner & predictor

In [13]:

  Copied!     
 
fast_prop = getml.feature_learning.FastProp(
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,    
    sampling_factor=0.1,
)

feature_selector = getml.predictors.XGBoostRegressor(n_jobs=1, external_memory=True)

predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
fast_prop = getml.feature_learning.FastProp( aggregation=getml.feature_learning.FastProp.agg_sets.All, loss_function=getml.feature_learning.loss_functions.SquareLoss, num_threads=1, sampling_factor=0.1, ) feature_selector = getml.predictors.XGBoostRegressor(n_jobs=1, external_memory=True) predictor = getml.predictors.XGBoostRegressor(n_jobs=1)

Build the pipeline

In [14]:

  Copied!     
 
pipe1 = getml.Pipeline(
    tags=['fast_prop'],
    data_model=star_schema.data_model,
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=[predictor],
    include_categorical=True,
)

pipe1
pipe1 = getml.Pipeline( tags=['fast_prop'], data_model=star_schema.data_model, feature_learners=[fast_prop], feature_selectors=[feature_selector], predictors=[predictor], include_categorical=True, ) pipe1

Out[14]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostRegressor'],
         include_categorical=True,
         loss_function='SquareLoss',
         peripheral=['dobito', 'probehnuto', 'probehnuto_mimo_penezenku'],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['fast_prop'])

2.3 Model training¶

In [15]:

  Copied!     
 
pipe1.check(star_schema.train)
pipe1.check(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.

Out[15]:

	type	label	message
0	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and DOBITO__STAGING_TABLE_2 over 'client_id' and 'client_id', there are no corresponding entries for 2.228789% of entries in 'client_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
1	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and PROBEHNUTO_MIMO_PENEZENKU__STAGING_TABLE_4 over 'client_id' and 'client_id', there are no corresponding entries for 26.543966% of entries in 'client_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.

In [16]:

  Copied!     
 
pipe1.fit(star_schema.train)
pipe1.fit(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 721 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:39
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:28
  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 16:04
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 09:06

Trained pipeline.

Time taken: 0:27:20.028950.

Out[16]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostRegressor'],
         include_categorical=True,
         loss_function='SquareLoss',
         peripheral=['dobito', 'probehnuto', 'probehnuto_mimo_penezenku'],
         predictors=['XGBoostRegressor'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-mlm7a1'])

2.4 Model evaluation¶

In [17]:

  Copied!     
 
pipe1.score(star_schema.test)
pipe1.score(star_schema.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:20

Out[17]:

	date time	set used	target	mae	rmse	rsquared
0	2024-09-12 13:05:20	train	kc_proklikano	2922.93	14339.2964	0.9427
1	2024-09-12 13:05:41	test	kc_proklikano	2981.8108	18545.8816	0.8766

2.5 Features¶

The most important feature looks as follows:

In [18]:

  Copied!     
 
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]

Out[18]:

DROP TABLE IF EXISTS "FEATURE_1_38";

CREATE TABLE "FEATURE_1_38" AS
SELECT EWMA_TREND_1H( t2."kc_proklikano" ORDER BY t1."month_year_datum_transakce" - t2."month_year_datum_transakce, '+1.000000 days'" ) AS "feature_1_38",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "PROBEHNUTO__STAGING_TABLE_3" t2
ON t1."client_id" = t2."client_id"
WHERE t2."month_year_datum_transakce, '+1.000000 days'" <= t1."month_year_datum_transakce"
AND t1."sluzba" = t2."sluzba"
GROUP BY t1.rowid;

In [19]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

In this notebook, we successfully demonstrated the process of predicting transaction volume for Seznam, a Czech company similar to Google, using the getML library. The key steps we covered include:

Background and Data Preparation:
- Introduced Seznam and the purpose of the analysis.
- Loaded and prepared the data from Seznam's wallet, including prepayments and charges.
Data Visualization and Preparation:
- Translated and understood the main tables in the dataset.
- Defined roles and units for the dataset columns to prepare them for modeling.
Predictive Modeling:
- Created a relational model using getML's StarSchema to capture the relationships between different tables.
- Built a getML pipeline using FastProp for feature learning and XGBoostRegressor for prediction.
Model Training and Evaluation:
- Trained the model on the prepared data.
- Evaluated the model's performance, achieving an R-squared of 87.01%, RMSE of 14,783, and MAE of 2,974.

By leveraging getML's capabilities, we efficiently handled the complexities of relational data and built a robust regression model to predict transaction volume. This approach can be extended to other e-commerce datasets and prediction tasks, providing valuable insights and accurate forecasts.

References¶

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).