Seznam - Predicting transaction volume¶
Seznam is a Czech company with a scope similar to Google. The purpose of this notebook is to analyze data from Seznam's wallet, predicting the transaction volume.
Summary:
- Prediction type: Regression model
- Domain: E-commerce
- Prediction target: Transaction volume
- Population size: 1,462,078
Author: Dr. Patrick Urbanke
Background¶
Seznam is a Czech company with a scope similar to Google. The purpose of this notebook is to analyze data from Seznam's wallet, predicting the transaction volume.
Since the dataset is in Czech, we will quickly translate the meaning of the main tables:
- dobito: contains data on prepayments into a wallet
- probehnuto: contains data on charges from a wallet
- probehnuto_mimo_penezenku: contains data on charges, from sources other than a wallet
The dataset has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.5.0"
import getml
getml.engine.launch()
getml.set_project('seznam')
Note: you may need to restart the kernel to use updated packages. Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false in /home/user/.getML/getml-1.5.0-x64-community-edition-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912123743.log.
Connected to project 'seznam'.
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data:
conn = getml.database.connect_mariadb(
host="db.relational-data.org",
dbname="Seznam",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='Seznam', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
dobito = load_if_needed("dobito")
probehnuto = load_if_needed("probehnuto")
probehnuto_mimo_penezenku = load_if_needed("probehnuto_mimo_penezenku")
dobito
name | client_id | month_year_datum_transakce | sluzba | kc_dobito |
---|---|---|---|---|
role | unused_float | unused_string | unused_string | unused_string |
0 | 7157857 | 2012-10-01 | c | 1045.62 |
1 | 109700 | 2015-10-01 | c | 5187.28 |
2 | 51508 | 2015-08-01 | c | 408.20 |
3 | 9573550 | 2012-10-01 | c | 521.24 |
4 | 9774621 | 2014-11-01 | c | 386.22 |
... | ... | ... | ... | |
554341 | 65283 | 2012-09-01 | g | 7850.00 |
554342 | 6091446 | 2012-08-01 | g | 31400.00 |
554343 | 1264806 | 2013-08-01 | g | -8220.52 |
554344 | 101103 | 2012-08-01 | g | 3140.00 |
554345 | 8674551 | 2012-08-01 | g | 6280.00 |
554346 rows x 4 columns
memory usage: 29.59 MB
name: dobito
type: getml.DataFrame
probehnuto
name | client_id | month_year_datum_transakce | sluzba | kc_proklikano |
---|---|---|---|---|
role | unused_float | unused_string | unused_string | unused_string |
0 | 109145 | 2013-06-01 | c | -31.40 |
1 | 9804394 | 2015-10-01 | h | 37.68 |
2 | 9803353 | 2015-10-01 | h | 725.34 |
3 | 9801753 | 2015-10-01 | h | 194.68 |
4 | 9800425 | 2015-10-01 | h | 1042.48 |
... | ... | ... | ... | |
1462073 | 98857 | 2015-08-01 | NULL | 153.86 |
1462074 | 95776 | 2015-09-01 | NULL | 153.86 |
1462075 | 98857 | 2015-09-01 | NULL | 153.86 |
1462076 | 90001 | 2015-10-01 | NULL | 310.86 |
1462077 | 946957 | 2015-10-01 | NULL | 153.86 |
1462078 rows x 4 columns
memory usage: 77.07 MB
name: probehnuto
type: getml.DataFrame
probehnuto_mimo_penezenku
name | client_id | Month/Year | probehla_inzerce_mimo_penezenku |
---|---|---|---|
role | unused_float | unused_string | unused_string |
0 | 3901 | 2012-08-01 | ANO |
1 | 3901 | 2012-09-01 | ANO |
2 | 3901 | 2012-10-01 | ANO |
3 | 3901 | 2012-11-01 | ANO |
4 | 3901 | 2012-12-01 | ANO |
... | ... | ... | |
599381 | 9804086 | 2015-10-01 | ANO |
599382 | 9804238 | 2015-10-01 | ANO |
599383 | 9804782 | 2015-10-01 | ANO |
599384 | 9804810 | 2015-10-01 | ANO |
599385 | 9805032 | 2015-10-01 | ANO |
599386 rows x 3 columns
memory usage: 23.38 MB
name: probehnuto_mimo_penezenku
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
dobito.set_role("client_id", getml.data.roles.join_key)
dobito.set_role("month_year_datum_transakce", getml.data.roles.time_stamp)
dobito.set_role("sluzba", getml.data.roles.categorical)
dobito.set_role("kc_dobito", getml.data.roles.numerical)
dobito.set_unit("sluzba", "service")
dobito
name | month_year_datum_transakce | client_id | sluzba | kc_dobito |
---|---|---|---|---|
role | time_stamp | join_key | categorical | numerical |
unit | time stamp, comparison only | service | ||
0 | 2012-10-01 | 7157857 | c | 1045.62 |
1 | 2015-10-01 | 109700 | c | 5187.28 |
2 | 2015-08-01 | 51508 | c | 408.2 |
3 | 2012-10-01 | 9573550 | c | 521.24 |
4 | 2014-11-01 | 9774621 | c | 386.22 |
... | ... | ... | ... | |
554341 | 2012-09-01 | 65283 | g | 7850 |
554342 | 2012-08-01 | 6091446 | g | 31400 |
554343 | 2013-08-01 | 1264806 | g | -8220.52 |
554344 | 2012-08-01 | 101103 | g | 3140 |
554345 | 2012-08-01 | 8674551 | g | 6280 |
554346 rows x 4 columns
memory usage: 13.30 MB
name: dobito
type: getml.DataFrame
probehnuto.set_role("client_id", getml.data.roles.join_key)
probehnuto.set_role("month_year_datum_transakce", getml.data.roles.time_stamp)
probehnuto.set_role("sluzba", getml.data.roles.categorical)
probehnuto.set_role("kc_proklikano", getml.data.roles.target)
probehnuto.set_unit("sluzba", "service")
probehnuto
name | month_year_datum_transakce | client_id | kc_proklikano | sluzba |
---|---|---|---|---|
role | time_stamp | join_key | target | categorical |
unit | time stamp, comparison only | service | ||
0 | 2013-06-01 | 109145 | -31.4 | c |
1 | 2015-10-01 | 9804394 | 37.68 | h |
2 | 2015-10-01 | 9803353 | 725.34 | h |
3 | 2015-10-01 | 9801753 | 194.68 | h |
4 | 2015-10-01 | 9800425 | 1042.48 | h |
... | ... | ... | ... | |
1462073 | 2015-08-01 | 98857 | 153.86 | NULL |
1462074 | 2015-09-01 | 95776 | 153.86 | NULL |
1462075 | 2015-09-01 | 98857 | 153.86 | NULL |
1462076 | 2015-10-01 | 90001 | 310.86 | NULL |
1462077 | 2015-10-01 | 946957 | 153.86 | NULL |
1462078 rows x 4 columns
memory usage: 35.09 MB
name: probehnuto
type: getml.DataFrame
probehnuto_mimo_penezenku.set_role("client_id", getml.data.roles.join_key)
probehnuto_mimo_penezenku.set_role("Month/Year", getml.data.roles.time_stamp)
probehnuto_mimo_penezenku
name | Month/Year | client_id | probehla_inzerce_mimo_penezenku |
---|---|---|---|
role | time_stamp | join_key | unused_string |
unit | time stamp, comparison only | ||
0 | 2012-08-01 | 3901 | ANO |
1 | 2012-09-01 | 3901 | ANO |
2 | 2012-10-01 | 3901 | ANO |
3 | 2012-11-01 | 3901 | ANO |
4 | 2012-12-01 | 3901 | ANO |
... | ... | ... | |
599381 | 2015-10-01 | 9804086 | ANO |
599382 | 2015-10-01 | 9804238 | ANO |
599383 | 2015-10-01 | 9804782 | ANO |
599384 | 2015-10-01 | 9804810 | ANO |
599385 | 2015-10-01 | 9805032 | ANO |
599386 rows x 3 columns
memory usage: 14.39 MB
name: probehnuto_mimo_penezenku
type: getml.DataFrame
split = getml.data.split.random(train=0.8, test=0.2)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | test |
4 | train |
... |
infinite number of rows
type: StringColumnView
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
2.1 Define relational model¶
star_schema = getml.data.StarSchema(population=probehnuto, alias="population", split=split)
star_schema.join(
probehnuto,
on="client_id",
time_stamps="month_year_datum_transakce",
lagged_targets=True,
horizon=getml.data.time.days(1),
)
star_schema.join(
dobito,
on="client_id",
time_stamps="month_year_datum_transakce",
)
star_schema.join(
probehnuto_mimo_penezenku,
on="client_id",
time_stamps=("month_year_datum_transakce", "Month/Year"),
)
star_schema
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | dobito | DOBITO__STAGING_TABLE_2 |
2 | probehnuto | PROBEHNUTO__STAGING_TABLE_3 |
3 | probehnuto_mimo_penezenku | PROBEHNUTO_MIMO_PENEZENKU__STAGING_TABLE_4 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | probehnuto | unknown | View |
1 | train | probehnuto | unknown | View |
name | rows | type | |
---|---|---|---|
0 | probehnuto | 1462078 | DataFrame |
1 | dobito | 554346 | DataFrame |
2 | probehnuto_mimo_penezenku | 599386 | DataFrame |
2.2 getML pipeline¶
Set-up the feature learner & predictor
fast_prop = getml.feature_learning.FastProp(
aggregation=getml.feature_learning.FastProp.agg_sets.All,
loss_function=getml.feature_learning.loss_functions.SquareLoss,
num_threads=1,
sampling_factor=0.1,
)
feature_selector = getml.predictors.XGBoostRegressor(n_jobs=1, external_memory=True)
predictor = getml.predictors.XGBoostRegressor(n_jobs=1)
Build the pipeline
pipe1 = getml.Pipeline(
tags=['fast_prop'],
data_model=star_schema.data_model,
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=[predictor],
include_categorical=True,
)
pipe1
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=['XGBoostRegressor'], include_categorical=True, loss_function='SquareLoss', peripheral=['dobito', 'probehnuto', 'probehnuto_mimo_penezenku'], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['fast_prop'])
2.3 Model training¶
pipe1.check(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and DOBITO__STAGING_TABLE_2 over 'client_id' and 'client_id', there are no corresponding entries for 2.228789% of entries in 'client_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and PROBEHNUTO_MIMO_PENEZENKU__STAGING_TABLE_4 over 'client_id' and 'client_id', there are no corresponding entries for 26.543966% of entries in 'client_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
pipe1.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 721 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:39 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:28 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 16:04 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 09:06
Trained pipeline.
Time taken: 0:27:20.028950.
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=['XGBoostRegressor'], include_categorical=True, loss_function='SquareLoss', peripheral=['dobito', 'probehnuto', 'probehnuto_mimo_penezenku'], predictors=['XGBoostRegressor'], preprocessors=[], share_selected_features=0.5, tags=['fast_prop', 'container-mlm7a1'])
2.4 Model evaluation¶
pipe1.score(star_schema.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:20
date time | set used | target | mae | rmse | rsquared | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:05:20 | train | kc_proklikano | 2922.93 | 14339.2964 | 0.9427 |
1 | 2024-09-12 13:05:41 | test | kc_proklikano | 2981.8108 | 18545.8816 | 0.8766 |
2.5 Features¶
The most important feature looks as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_38";
CREATE TABLE "FEATURE_1_38" AS
SELECT EWMA_TREND_1H( t2."kc_proklikano" ORDER BY t1."month_year_datum_transakce" - t2."month_year_datum_transakce, '+1.000000 days'" ) AS "feature_1_38",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "PROBEHNUTO__STAGING_TABLE_3" t2
ON t1."client_id" = t2."client_id"
WHERE t2."month_year_datum_transakce, '+1.000000 days'" <= t1."month_year_datum_transakce"
AND t1."sluzba" = t2."sluzba"
GROUP BY t1.rowid;
getml.engine.shutdown()
3. Conclusion¶
In this notebook, we successfully demonstrated the process of predicting transaction volume for Seznam, a Czech company similar to Google, using the getML library. The key steps we covered include:
Background and Data Preparation:
- Introduced Seznam and the purpose of the analysis.
- Loaded and prepared the data from Seznam's wallet, including prepayments and charges.
Data Visualization and Preparation:
- Translated and understood the main tables in the dataset.
- Defined roles and units for the dataset columns to prepare them for modeling.
Predictive Modeling:
- Created a relational model using getML's
StarSchema
to capture the relationships between different tables. - Built a getML pipeline using
FastProp
for feature learning andXGBoostRegressor
for prediction.
- Created a relational model using getML's
Model Training and Evaluation:
- Trained the model on the prepared data.
- Evaluated the model's performance, achieving an R-squared of 87.01%, RMSE of 14,783, and MAE of 2,974.
By leveraging getML's capabilities, we efficiently handled the complexities of relational data and built a robust regression model to predict transaction volume. This approach can be extended to other e-commerce datasets and prediction tasks, providing valuable insights and accurate forecasts.
References¶
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).