Consumer expenditure - Why relational learning matters¶
This example demonstrates how powerful a real relational learning algorithm can be. Based on a public-domain dataset on consumer behavior, we use a propostionalization algorithm to predict whether purchases were made as a gift. We show that with relational learning, we can get an AUC of over 90%. The generated features would have been impossible to build by hand or by using brute-force approaches.
Summary:
- Prediction type: Classification model
- Domain: Retail
- Prediction target: If a purchase is a gift
- Source data: Relational data set, 4 tables
- Population size: 2.020.634
Background¶
Relational learning is one of the most underappreciated fields of machine learning. Even though relational learning is very relevant to many real world data science projects, many data scientists don't even know what relational learning is.
There are many subdomains of relational learning, but the most important one is extracting features from relational data: Most business data is relational, meaning that it is spread out over several relational tables. However, most machine learning algorithms require that the data be presented in the form of a single flat table. So we need to extract features from our relational data. Some people also call this data wrangling.
Most data scientists we know extract features from relational data manually or by using crude, brute-force approaches (randomly generate thousands of features and then do a feature selection). This is very time-consuming and does not produce good features.
The challenge¶
The Consumer Expenditure Data Set is a public domain data set provided by the American Bureau of Labor Statistics (https://www.bls.gov/cex/pumd.htm). It includes the diary entries, where American consumers are asked to keep diaries of the products they have purchased each month.
These consumer goods are categorized using a six-digit classification system the UCC. This system is hierarchical, meaning that every digit represents an increasingly granular category.
For instance, all UCC codes beginning with ‘200’ represent beverages. UCC codes beginning with ‘20011’ represents beer and ‘200111’ represents ‘beer and ale’ and ‘200112’ represents ‘nonalcoholic beer’ (https://www.bls.gov/cex/pumd/ce_pumd_interview_diary_dictionary.xlsx).
The diaries also contain a flag that indicates whether the product was purchased as a gift. The challenge is to predict that flag using other information in the diary entries.
This can be done based on the following considerations:
Some items are less likely to be purchased as gifts than others (for instance, it is unlikely that toilet paper is ever purchased as a gift).
Items that diverge from the usual consumption patterns are more likely to be gifts.
In total, there are three tables which we find interesting:
EXPD, which contains information on the consumer expenditures, including the target variable GIFT.
FMLD, which contains socio-demographic information on the households.
MEMD, which contains socio-demographic information on each member of the households.
Analysis¶
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"
import matplotlib.pyplot as plt
import getml
%matplotlib inline
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(in_memory=False, allow_remote_ips=True, token='token')
getml.engine.set_project("consumer_expenditures")
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=false --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240918134630.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'consumer_expenditures'.
1. Loading data¶
1.1 Download from source¶
conn = getml.database.connect_mysql(
host="relational.fel.cvut.cz",
dbname="ConsumerExpenditures",
port=3306,
user="guest",
password="ctu-relational"
)
conn
Connection(dbname='ConsumerExpenditures',
dialect='mysql',
host='relational.fel.cvut.cz',
port=3306) def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if getml.data.exists(name):
return getml.data.load_data_frame(name)
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
return data_frame
households = load_if_needed("HOUSEHOLDS")
household_members = load_if_needed("HOUSEHOLD_MEMBERS")
expenditures = load_if_needed("EXPENDITURES")
households
| name | YEAR | INCOME_RANK | INCOME_RANK_1 | INCOME_RANK_2 | INCOME_RANK_3 | INCOME_RANK_4 | INCOME_RANK_5 | INCOME_RANK_MEAN | AGE_REF | HOUSEHOLD_ID |
|---|---|---|---|---|---|---|---|---|---|---|
| role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string |
| 0 | 2015 | 0.3044 | 0.1448 | 0.1427 | 0.1432 | 0.1422 | 0.1382 | 0.127 | 66 | 03111041 |
| 1 | 2015 | 0.3063 | 0.1462 | 0.1444 | 0.1446 | 0.1435 | 0.1395 | 0.1283 | 66 | 03111042 |
| 2 | 2015 | 0.6931 | 0.6222 | 0.6204 | 0.623 | 0.6131 | 0.6123 | 0.6207 | 48 | 03111051 |
| 3 | 2015 | 0.6926 | 0.6216 | 0.6198 | 0.6224 | 0.6125 | 0.6117 | 0.6201 | 48 | 03111052 |
| 4 | 2015 | 0.2817 | 0.113 | 0.1128 | 0.1098 | 0.1116 | 0.1092 | 0.0951 | 37 | 03111061 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| 56807 | 2019 | 0.4828 | 0.4106 | 0.3603 | 0.3958 | 0.377 | 0.3984 | 0.3769 | 67 | 04362582 |
| 56808 | 2019 | 0.6644 | 0.5975 | 0.6026 | 0.5949 | 0.596 | 0.6002 | 0.6 | 52 | 04362661 |
| 56809 | 2019 | 0.6639 | 0.597 | 0.6021 | 0.5944 | 0.5955 | 0.5997 | 0.5995 | 52 | 04362662 |
| 56810 | 2019 | 0.162 | 0.05217 | 0.03955 | 0.04507 | 0.04607 | 0.02436 | 0.03558 | 72 | 04362671 |
| 56811 | 2019 | 0.1616 | 0.03925 | 0.05741 | 0.04595 | 0.03789 | 0.05746 | 0.03931 | 72 | 04362672 |
56812 rows x 10 columns
memory usage: 5.06 MB
name: HOUSEHOLDS
type: getml.DataFrame
household_members
| name | YEAR | AGE | HOUSEHOLD_ID | MARITAL | SEX | WORK_STATUS |
|---|---|---|---|---|---|---|
| role | unused_float | unused_float | unused_string | unused_string | unused_string | unused_string |
| 0 | 2015 | 66 | 03111041 | 1 | 1 | NULL |
| 1 | 2015 | 66 | 03111042 | 1 | 1 | NULL |
| 2 | 2015 | 56 | 03111091 | 1 | 1 | NULL |
| 3 | 2015 | 56 | 03111092 | 1 | 1 | NULL |
| 4 | 2015 | 50 | 03111111 | 1 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | |
| 137350 | 2019 | 22 | 04362422 | 5 | 2 | NULL |
| 137351 | 2019 | 11 | 04362431 | 5 | 2 | NULL |
| 137352 | 2019 | 11 | 04362432 | 5 | 2 | NULL |
| 137353 | 2019 | 72 | 04362671 | 5 | 2 | NULL |
| 137354 | 2019 | 72 | 04362672 | 5 | 2 | NULL |
137355 rows x 6 columns
memory usage: 8.59 MB
name: HOUSEHOLD_MEMBERS
type: getml.DataFrame
expenditures
| name | YEAR | MONTH | COST | GIFT | IS_TRAINING | EXPENDITURE_ID | HOUSEHOLD_ID | PRODUCT_CODE |
|---|---|---|---|---|---|---|---|---|
| role | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
| 0 | 2015 | 1 | 3.89 | 0 | 1 | 1 | 03111041 | 010210 |
| 1 | 2015 | 1 | 4.66 | 0 | 1 | 10 | 03111041 | 120310 |
| 2 | 2015 | 2 | 9.79 | 0 | 1 | 100 | 03111051 | 190211 |
| 3 | 2015 | 2 | 2.95 | 0 | 1 | 1000 | 03111402 | 040510 |
| 4 | 2015 | 1 | 2.12 | 0 | 1 | 10000 | 03114161 | 190321 |
| ... | ... | ... | ... | ... | ... | ... | ... | |
| 2020629 | 2017 | 6 | 1.99 | 0 | 1 | 999995 | 03708582 | 150110 |
| 2020630 | 2017 | 6 | 3.619 | 0 | 1 | 999996 | 03708582 | 150110 |
| 2020631 | 2017 | 6 | 5.2727 | 0 | 1 | 999997 | 03708582 | 150211 |
| 2020632 | 2017 | 6 | 4.6894 | 0 | 1 | 999998 | 03708582 | 150310 |
| 2020633 | 2017 | 6 | 5.7177 | 0 | 1 | 999999 | 03708582 | 160310 |
2020634 rows x 8 columns
memory usage: 176.70 MB
name: EXPENDITURES
type: getml.DataFrame
1.2 Prepare data for getML¶
We now have to assign roles to the columns.
households.set_role("HOUSEHOLD_ID", getml.data.roles.join_key)
households.set_role(households.roles.unused_float, getml.data.roles.numerical)
households
| name | HOUSEHOLD_ID | YEAR | INCOME_RANK | INCOME_RANK_1 | INCOME_RANK_2 | INCOME_RANK_3 | INCOME_RANK_4 | INCOME_RANK_5 | INCOME_RANK_MEAN | AGE_REF |
|---|---|---|---|---|---|---|---|---|---|---|
| role | join_key | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical |
| 0 | 03111041 | 2015 | 0.3044 | 0.1448 | 0.1427 | 0.1432 | 0.1422 | 0.1382 | 0.127 | 66 |
| 1 | 03111042 | 2015 | 0.3063 | 0.1462 | 0.1444 | 0.1446 | 0.1435 | 0.1395 | 0.1283 | 66 |
| 2 | 03111051 | 2015 | 0.6931 | 0.6222 | 0.6204 | 0.623 | 0.6131 | 0.6123 | 0.6207 | 48 |
| 3 | 03111052 | 2015 | 0.6926 | 0.6216 | 0.6198 | 0.6224 | 0.6125 | 0.6117 | 0.6201 | 48 |
| 4 | 03111061 | 2015 | 0.2817 | 0.113 | 0.1128 | 0.1098 | 0.1116 | 0.1092 | 0.0951 | 37 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| 56807 | 04362582 | 2019 | 0.4828 | 0.4106 | 0.3603 | 0.3958 | 0.377 | 0.3984 | 0.3769 | 67 |
| 56808 | 04362661 | 2019 | 0.6644 | 0.5975 | 0.6026 | 0.5949 | 0.596 | 0.6002 | 0.6 | 52 |
| 56809 | 04362662 | 2019 | 0.6639 | 0.597 | 0.6021 | 0.5944 | 0.5955 | 0.5997 | 0.5995 | 52 |
| 56810 | 04362671 | 2019 | 0.162 | 0.05217 | 0.03955 | 0.04507 | 0.04607 | 0.02436 | 0.03558 | 72 |
| 56811 | 04362672 | 2019 | 0.1616 | 0.03925 | 0.05741 | 0.04595 | 0.03789 | 0.05746 | 0.03931 | 72 |
56812 rows x 10 columns
memory usage: 4.32 MB
name: HOUSEHOLDS
type: getml.DataFrame
household_members.set_role("HOUSEHOLD_ID", getml.data.roles.join_key)
household_members.set_role(household_members.roles.unused_float, getml.data.roles.numerical)
household_members.set_role(household_members.roles.unused_string, getml.data.roles.categorical)
household_members
| name | HOUSEHOLD_ID | MARITAL | SEX | WORK_STATUS | YEAR | AGE |
|---|---|---|---|---|---|---|
| role | join_key | categorical | categorical | categorical | numerical | numerical |
| 0 | 03111041 | 1 | 1 | NULL | 2015 | 66 |
| 1 | 03111042 | 1 | 1 | NULL | 2015 | 66 |
| 2 | 03111091 | 1 | 1 | NULL | 2015 | 56 |
| 3 | 03111092 | 1 | 1 | NULL | 2015 | 56 |
| 4 | 03111111 | 1 | 1 | 1 | 2015 | 50 |
| ... | ... | ... | ... | ... | ... | |
| 137350 | 04362422 | 5 | 2 | NULL | 2019 | 22 |
| 137351 | 04362431 | 5 | 2 | NULL | 2019 | 11 |
| 137352 | 04362432 | 5 | 2 | NULL | 2019 | 11 |
| 137353 | 04362671 | 5 | 2 | NULL | 2019 | 72 |
| 137354 | 04362672 | 5 | 2 | NULL | 2019 | 72 |
137355 rows x 6 columns
memory usage: 4.40 MB
name: HOUSEHOLD_MEMBERS
type: getml.DataFrame
year = expenditures["YEAR"]
month = expenditures["MONTH"]
ts_strings = year + "/" + month
expenditures["TIME_STAMP"] = ts_strings.as_ts(["%Y/%n"])
expenditures.set_role("HOUSEHOLD_ID", getml.data.roles.join_key)
expenditures.set_role("GIFT", getml.data.roles.target)
expenditures.set_role("COST", getml.data.roles.numerical)
expenditures.set_role(["PRODUCT_CODE", "MONTH", "YEAR"], getml.data.roles.categorical)
expenditures.set_role("TIME_STAMP", getml.data.roles.time_stamp)
expenditures.set_subroles("PRODUCT_CODE", getml.data.subroles.include.substring)
expenditures
| name | TIME_STAMP | HOUSEHOLD_ID | GIFT | MONTH | YEAR | PRODUCT_CODE | COST | IS_TRAINING | EXPENDITURE_ID |
|---|---|---|---|---|---|---|---|---|---|
| role | time_stamp | join_key | target | categorical | categorical | categorical | numerical | unused_float | unused_string |
| unit | time stamp, comparison only | ||||||||
| subroles: | |||||||||
| - include | substring | ||||||||
| 0 | 2015-01-01 | 03111041 | 0 | 1 | 2015 | 010210 | 3.89 | 1 | 1 |
| 1 | 2015-01-01 | 03111041 | 0 | 1 | 2015 | 120310 | 4.66 | 1 | 10 |
| 2 | 2015-02-01 | 03111051 | 0 | 2 | 2015 | 190211 | 9.79 | 1 | 100 |
| 3 | 2015-02-01 | 03111402 | 0 | 2 | 2015 | 040510 | 2.95 | 1 | 1000 |
| 4 | 2015-01-01 | 03114161 | 0 | 1 | 2015 | 190321 | 2.12 | 1 | 10000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | |
| 2020629 | 2017-06-01 | 03708582 | 0 | 6 | 2017 | 150110 | 1.99 | 1 | 999995 |
| 2020630 | 2017-06-01 | 03708582 | 0 | 6 | 2017 | 150110 | 3.619 | 1 | 999996 |
| 2020631 | 2017-06-01 | 03708582 | 0 | 6 | 2017 | 150211 | 5.2727 | 1 | 999997 |
| 2020632 | 2017-06-01 | 03708582 | 0 | 6 | 2017 | 150310 | 4.6894 | 1 | 999998 |
| 2020633 | 2017-06-01 | 03708582 | 0 | 6 | 2017 | 160310 | 5.7177 | 1 | 999999 |
2020634 rows x 9 columns
memory usage: 128.21 MB
name: EXPENDITURES
type: getml.DataFrame
split = expenditures.rowid.as_str().update(expenditures.IS_TRAINING == 1, "train").update(expenditures.IS_TRAINING == 0, "test")
split
| 0 | train |
|---|---|
| 1 | train |
| 2 | train |
| 3 | train |
| 4 | train |
| ... |
2020634 rows
type: StringColumnView
2. Predictive modeling¶
Enough with the data preparation. Let's get to the fun part: Extracting the features.
2.1 Defining the data model¶
First, we define the data model.
What we want to do is the following:
We want to compare every expenditure made to all expenditures by the same household (EXPD).
We want to check out whether certain kinds of items have been purchased as a gift in the past (EXPD).
We want to aggregate all available information on the individual members of the household (MEMD).
star_schema = getml.data.StarSchema(alias="POPULATION", population=expenditures, split=split)
star_schema.join(
expenditures,
on="HOUSEHOLD_ID",
time_stamps="TIME_STAMP"
)
star_schema.join(
households,
on="HOUSEHOLD_ID",
relationship=getml.data.relationship.many_to_one,
)
star_schema.join(
household_members,
on="HOUSEHOLD_ID",
)
star_schema
| data frames | staging table | |
|---|---|---|
| 0 | POPULATION, HOUSEHOLDS | POPULATION__STAGING_TABLE_1 |
| 1 | EXPENDITURES | EXPENDITURES__STAGING_TABLE_2 |
| 2 | HOUSEHOLD_MEMBERS | HOUSEHOLD_MEMBERS__STAGING_TABLE_3 |
| subset | name | rows | type | |
|---|---|---|---|---|
| 0 | test | EXPENDITURES | unknown | View |
| 1 | train | EXPENDITURES | unknown | View |
| name | rows | type | |
|---|---|---|---|
| 0 | EXPENDITURES | 2020634 | DataFrame |
| 1 | HOUSEHOLDS | 56812 | DataFrame |
| 2 | HOUSEHOLD_MEMBERS | 137355 | DataFrame |
2.2 Setting the hyperparameters¶
We use XGBoost as our predictor and FastProp (short for fast propsitionalization) to generate our features. You are free to play with the hyperparameters.
ucc1 = getml.preprocessors.Substring(0, 1)
ucc2 = getml.preprocessors.Substring(0, 2)
ucc3 = getml.preprocessors.Substring(0, 3)
ucc4 = getml.preprocessors.Substring(0, 4)
ucc5 = getml.preprocessors.Substring(0, 5)
mapping = getml.preprocessors.Mapping(multithreading=False)
fast_prop = getml.feature_learning.FastProp(
aggregation=getml.feature_learning.FastProp.agg_sets.All,
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
sampling_factor=0.1,
num_features=100,
)
relboost = getml.feature_learning.Relboost(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
num_features=20,
)
feature_selector = getml.predictors.XGBoostClassifier()
predictor = getml.predictors.XGBoostClassifier(
booster="gbtree",
n_estimators=100,
max_depth=7,
reg_lambda=0.0,
n_jobs=1
)
pipe1 = getml.pipeline.Pipeline(
tags=["FastProp"],
data_model=star_schema.data_model,
share_selected_features=0.4,
preprocessors=[mapping],
feature_learners=fast_prop,
feature_selectors=feature_selector,
predictors=predictor
)
pipe1
Pipeline(data_model='POPULATION',
feature_learners=['FastProp'],
feature_selectors=['XGBoostClassifier'],
include_categorical=False,
loss_function='CrossEntropyLoss',
peripheral=['EXPENDITURES', 'HOUSEHOLDS', 'HOUSEHOLD_MEMBERS'],
predictors=['XGBoostClassifier'],
preprocessors=['Mapping'],
share_selected_features=0.4,
tags=['FastProp']) pipe2 = getml.pipeline.Pipeline(
tags=["Relboost"],
data_model=star_schema.data_model,
share_selected_features=0.9,
preprocessors=[ucc1, ucc2, ucc3, ucc4, ucc5, mapping],
feature_learners=relboost,
feature_selectors=feature_selector,
predictors=predictor
)
pipe2
Pipeline(data_model='POPULATION',
feature_learners=['Relboost'],
feature_selectors=['XGBoostClassifier'],
include_categorical=False,
loss_function='CrossEntropyLoss',
peripheral=['EXPENDITURES', 'HOUSEHOLDS', 'HOUSEHOLD_MEMBERS'],
predictors=['XGBoostClassifier'],
preprocessors=['Substring', 'Substring', 'Substring', 'Substring', 'Substring',
'Mapping'],
share_selected_features=0.9,
tags=['Relboost']) pipe3 = getml.pipeline.Pipeline(
tags=["FastProp", "Relboost"],
data_model=star_schema.data_model,
share_selected_features=0.2,
preprocessors=[mapping],
feature_learners=[fast_prop, relboost],
feature_selectors=feature_selector,
predictors=predictor
)
pipe3
Pipeline(data_model='POPULATION',
feature_learners=['FastProp', 'Relboost'],
feature_selectors=['XGBoostClassifier'],
include_categorical=False,
loss_function='CrossEntropyLoss',
peripheral=['EXPENDITURES', 'HOUSEHOLDS', 'HOUSEHOLD_MEMBERS'],
predictors=['XGBoostClassifier'],
preprocessors=['Mapping'],
share_selected_features=0.2,
tags=['FastProp', 'Relboost']) .fit(...) will automatically call .check(...), but it is always a good idea to call .check(...) separately, so we still have time for some last-minute fixes.
2.3 Training the pipeline¶
OK, let's go:
pipe1.check(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:23 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
pipe1.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 418 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:33 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:09 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 05:27 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 06:20
Trained pipeline.
Time taken: 0:13:34.027354.
Pipeline(data_model='POPULATION',
feature_learners=['FastProp'],
feature_selectors=['XGBoostClassifier'],
include_categorical=False,
loss_function='CrossEntropyLoss',
peripheral=['EXPENDITURES', 'HOUSEHOLDS', 'HOUSEHOLD_MEMBERS'],
predictors=['XGBoostClassifier'],
preprocessors=['Mapping'],
share_selected_features=0.4,
tags=['FastProp', 'container-xVgT7b']) pipe2.check(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 02:16 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
OK.
pipe2.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:10
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:10 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 21:26 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 18:45 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 04:11 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 06:29
Trained pipeline.
Time taken: 0:51:09.055720.
Pipeline(data_model='POPULATION',
feature_learners=['Relboost'],
feature_selectors=['XGBoostClassifier'],
include_categorical=False,
loss_function='CrossEntropyLoss',
peripheral=['EXPENDITURES', 'HOUSEHOLDS', 'HOUSEHOLD_MEMBERS'],
predictors=['XGBoostClassifier'],
preprocessors=['Substring', 'Substring', 'Substring', 'Substring', 'Substring',
'Mapping'],
share_selected_features=0.9,
tags=['Relboost', 'container-xVgT7b']) pipe3.check(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
pipe3.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Retrieving features from cache... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 05:54 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:08 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 04:50 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 06:31 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 04:03
Trained pipeline.
Time taken: 0:22:31.520442.
Pipeline(data_model='POPULATION',
feature_learners=['FastProp', 'Relboost'],
feature_selectors=['XGBoostClassifier'],
include_categorical=False,
loss_function='CrossEntropyLoss',
peripheral=['EXPENDITURES', 'HOUSEHOLDS', 'HOUSEHOLD_MEMBERS'],
predictors=['XGBoostClassifier'],
preprocessors=['Mapping'],
share_selected_features=0.2,
tags=['FastProp', 'Relboost', 'container-xVgT7b']) 2.4 Evaluating the pipeline¶
We want to know how well we did. We will to an in-sample and an out-of-sample evaluation:
pipe1.score(star_schema.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08
| date time | set used | target | accuracy | auc | cross entropy | |
|---|---|---|---|---|---|---|
| 0 | 2024-09-18 12:08:28 | train | GIFT | 0.9826 | 0.9368 | 0.05986 |
| 1 | 2024-09-18 13:43:03 | test | GIFT | 0.9804 | 0.8649 | 0.07713 |
pipe2.score(star_schema.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 02:13
| date time | set used | target | accuracy | auc | cross entropy | |
|---|---|---|---|---|---|---|
| 0 | 2024-09-18 13:02:15 | train | GIFT | 0.9822 | 0.9231 | 0.06321 |
| 1 | 2024-09-18 13:45:21 | test | GIFT | 0.9805 | 0.863 | 0.07709 |
pipe3.score(star_schema.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08
| date time | set used | target | accuracy | auc | cross entropy | |
|---|---|---|---|---|---|---|
| 0 | 2024-09-18 13:42:53 | train | GIFT | 0.9824 | 0.9331 | 0.06092 |
| 1 | 2024-09-18 13:45:35 | test | GIFT | 0.9805 | 0.8676 | 0.07667 |
2.5 Studying the features¶
It is very important that we get an idea about the features that the propositionalization algorithm has produced.
LENGTH=50
names, correlations = pipe1.features.correlations()
plt.subplots(figsize=(20, 10))
plt.bar(names[:LENGTH], correlations[:LENGTH])
plt.title("feature correlations")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("correlations")
plt.xticks(rotation='vertical')
plt.show()
LENGTH=50
names, correlations = pipe2.features.correlations()
plt.subplots(figsize=(20, 10))
plt.bar(names[:LENGTH], correlations[:LENGTH])
plt.title("feature correlations")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("correlations")
plt.xticks(rotation='vertical')
plt.show()
We can express the features in SQLite3:
Because getML uses a feature learning approach, the concept of feature importances can also be carried over to the individual columns.
names, importances = pipe1.columns.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title("column importances")
plt.grid(True)
plt.xlabel("columns")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
names, importances = pipe2.columns.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title("column importances")
plt.grid(True)
plt.xlabel("columns")
plt.ylabel("importance")
plt.xticks(rotation='vertical')
plt.show()
The most important feature of pipe1 looks as follows:
print(pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name])
DROP TABLE IF EXISTS "PRODUCT_CODE__MAPPING_TARGET_1_AVG";
CREATE TABLE "PRODUCT_CODE__MAPPING_TARGET_1_AVG"("key" TEXT, "value" REAL);
INSERT INTO "PRODUCT_CODE__MAPPING_TARGET_1_AVG" ("key", "value")
VALUES('410901', 0.5265553869499241),
('410140', 0.5248618784530387),
('004190', 0.5073846153846154),
('410120', 0.5013123359580053),
('410110', 0.4444444444444444),
('004100', 0.3336306868867083),
('390110', 0.3132530120481928),
('390120', 0.3067484662576687),
('410130', 0.2967448902346707),
('370110', 0.2948717948717949),
('370212', 0.2944444444444445),
('370220', 0.2920353982300885),
('680140', 0.288135593220339),
('390322', 0.2795918367346939),
('390321', 0.2764227642276423),
('370901', 0.271948608137045),
('390210', 0.2579837194740138),
('370125', 0.2519157088122606),
('390310', 0.2443181818181818),
('390223', 0.2344706911636046),
('390230', 0.2238442822384428),
('370211', 0.2185714285714286),
('370314', 0.2182952182952183),
('400220', 0.2164179104477612),
('610110', 0.2162868883078072),
('360320', 0.2151898734177215),
('590220', 0.2075471698113208),
('370213', 0.2015968063872255),
('400210', 0.1944764096662831),
('430120', 0.194006309148265),
('320130', 0.1899441340782123),
('390901', 0.1797752808988764),
('330410', 0.1751831107281344),
('380410', 0.1386392811296534),
('590230', 0.13469068128426),
('360350', 0.1321279554937413),
('360210', 0.1305555555555556),
('290420', 0.1282051282051282),
('280220', 0.1231884057971015),
('320903', 0.1229724632214259),
('360420', 0.1222091656874266),
('005000', 0.1219512195121951),
('660900', 0.1205479452054795),
('320345', 0.1176205497972059),
('610902', 0.1162790697674419),
('660110', 0.111731843575419),
('600900', 0.1111111111111111),
('670110', 0.1111111111111111),
('320233', 0.1108969866853539),
('610230', 0.11),
('660210', 0.1097922848664688),
('610901', 0.1097560975609756),
('380510', 0.1081081081081081),
('290310', 0.1044776119402985),
('280120', 0.1030640668523677),
('380901', 0.1010141987829615),
('320521', 0.1009174311926606),
('360330', 0.1004366812227074),
('360311', 0.09981167608286252),
('430110', 0.09863945578231292),
('300320', 0.0975609756097561),
('360312', 0.09716599190283401),
('660000', 0.09413886384129846),
('600430', 0.09302325581395349),
('380110', 0.09302325581395349),
('310231', 0.09090909090909091),
('004000', 0.08723998758149643),
('600410', 0.08408408408408409),
('310210', 0.08333333333333333),
('340120', 0.08333333333333333),
('430130', 0.08226221079691516),
('600210', 0.08190476190476191),
('380315', 0.08014981273408239),
('610120', 0.07865168539325842),
('620610', 0.07755102040816327),
('360513', 0.07722969606377678),
('280140', 0.07646356033452807),
('320380', 0.07645788336933046),
('620213', 0.07375643224699828),
('620510', 0.07370393504059962),
('380430', 0.07358390682901006),
('310316', 0.07329842931937172),
('310232', 0.0726950354609929),
('200210', 0.07258064516129033),
('530110', 0.07235621521335807),
('380320', 0.07142857142857142),
('640420', 0.06923076923076923),
('620214', 0.0689900426742532),
('610130', 0.06882591093117409),
('290410', 0.06748466257668712),
('380420', 0.06734816596512327),
('400310', 0.0672059738643435),
('690117', 0.06666666666666667),
('610903', 0.06578947368421052),
('310220', 0.06555863342566944),
('320330', 0.06554307116104868),
('400110', 0.06538692261547691),
('640120', 0.06442953020134229),
('690230', 0.0641025641025641),
('620330', 0.06329113924050633),
('420115', 0.06281407035175879),
('380311', 0.0625),
('310340', 0.06231454005934718),
('320370', 0.06196746707978312),
('380340', 0.06157635467980296),
('380210', 0.06014492753623189),
('620112', 0.05970149253731343),
('340110', 0.05929824561403509),
('320901', 0.05747126436781609),
('280110', 0.05726600985221675),
('290120', 0.05673758865248227),
('320150', 0.05652173913043478),
('240220', 0.05647840531561462),
('340907', 0.05555555555555555),
('600310', 0.05521472392638037),
('320221', 0.05381727158948686),
('320522', 0.05371900826446281),
('620913', 0.05333333333333334),
('340510', 0.052734375),
('640130', 0.05263157894736842),
('310332', 0.05128205128205128),
('320232', 0.05029013539651837),
('380333', 0.0501577287066246),
('690118', 0.05),
('670903', 0.04895104895104895),
('320905', 0.04766031195840555),
('320627', 0.04761904761904762),
('320902', 0.04666666666666667),
('690110', 0.04666666666666667),
('150110', 0.04635643740546312),
('620221', 0.04615384615384616),
('670901', 0.04597701149425287),
('001000', 0.04587155963302753),
('670310', 0.04553734061930783),
('340610', 0.04444444444444445),
('200410', 0.04397394136807817),
('300900', 0.04375),
('610320', 0.04300578034682081),
('300110', 0.0425531914893617),
('002000', 0.0418848167539267),
('680220', 0.04184704184704185),
('570901', 0.04081632653061224),
('280210', 0.04081632653061224),
('600420', 0.04044489383215369),
('320420', 0.0400890868596882),
('290440', 0.038860103626943),
('200310', 0.03872966692486444),
('310900', 0.0380952380952381),
('520550', 0.03773584905660377),
('690116', 0.03773584905660377),
('020410', 0.03773262762506403),
('440130', 0.03759398496240601),
('380902', 0.03571428571428571),
('550320', 0.03547297297297297),
('290110', 0.03539823008849557),
('590210', 0.03476151980598222),
('320904', 0.03454231433506045),
('490311', 0.03448275862068965),
('620310', 0.03422053231939164),
('220000', 0.03418803418803419),
('320120', 0.03355704697986577),
('240310', 0.03343949044585987),
('310351', 0.03333333333333333),
('640310', 0.03329679364209372),
('670902', 0.03174603174603174),
('680903', 0.03137789904502047),
('310140', 0.0308641975308642),
('620420', 0.03061224489795918),
('630220', 0.03052325581395349),
('330610', 0.03022860381636123),
('330510', 0.02971188475390156),
('180620', 0.02942668696093353),
('240900', 0.02941176470588235),
('550330', 0.02935420743639922),
('320610', 0.02929427430093209),
('620710', 0.02877697841726619),
('290320', 0.02877697841726619),
('200111', 0.02867072111207646),
('240320', 0.02842928216062544),
('310352', 0.02838427947598253),
('320410', 0.02791625124626122),
('300218', 0.02777777777777778),
('320110', 0.02768166089965398),
('620121', 0.02765208647561589),
('340210', 0.02722323049001815),
('240210', 0.02707581227436823),
('440150', 0.02702702702702703),
('320140', 0.02697022767075306),
('640220', 0.02683461117196057),
('640410', 0.026232741617357),
('310335', 0.02593659942363112),
('490315', 0.02564102564102564),
('340901', 0.02542372881355932),
('610310', 0.02461584365209608),
('680110', 0.02362204724409449),
('340903', 0.0234375),
('480213', 0.0231811697574893),
('320430', 0.02272727272727273),
('230000', 0.02272727272727273),
('640210', 0.02267002518891688),
('550310', 0.02246796559592768),
('490110', 0.02173913043478261),
('620410', 0.02165087956698241),
('340913', 0.02127659574468085),
('340906', 0.02127659574468085),
('590110', 0.0209366391184573),
('620810', 0.02090592334494774),
('020710', 0.02085600290170475),
('620926', 0.02076875387476751),
('480212', 0.02055622732769045),
('020510', 0.0202097074243193),
('650210', 0.02016868353502017),
('530510', 0.02005730659025788),
('520310', 0.02),
('480110', 0.01970443349753695),
('550110', 0.0194300518134715),
('650110', 0.0190424374319913),
('320511', 0.01829268292682927),
('240120', 0.01818181818181818),
('040610', 0.01785714285714286),
('170531', 0.0177293934681182),
('550210', 0.01761658031088083),
('290430', 0.01748251748251748),
('002100', 0.01715481171548117),
('150310', 0.01708217913204063),
('560310', 0.01682692307692308),
('640110', 0.01674500587544066),
('640430', 0.01648351648351648),
('570000', 0.01633393829401089),
('240110', 0.0162052667116813),
('690119', 0.01618122977346278),
('630110', 0.0158344666796192),
('330310', 0.01570146818923328),
('020820', 0.01567783584383646),
('130320', 0.0156165858912224),
('630210', 0.0155902004454343),
('020610', 0.01553829078801332),
('010120', 0.01547231270358306),
('180310', 0.01535880227155395),
('550410', 0.01529571719918423),
('360110', 0.01515151515151515),
('620114', 0.01492537313432836),
('440210', 0.01488095238095238),
('470220', 0.01478743068391867),
('620111', 0.01471389645776567),
('330210', 0.01441871961769795),
('140320', 0.01423487544483986),
('340520', 0.01411100658513641),
('560210', 0.01355661881977671),
('530311', 0.01341184167484462),
('330110', 0.01330895052321447),
('050900', 0.0131578947368421),
('250900', 0.01309707241910632),
('690120', 0.01305483028720627),
('490300', 0.01298701298701299),
('180320', 0.01298701298701299),
('170533', 0.01296982530439386),
('540000', 0.01271259233808624),
('170510', 0.01269971323228185),
('620930', 0.01252609603340292),
('340410', 0.01241642788920726),
('270000', 0.01241039905852145),
('520110', 0.01237964236588721),
('560400', 0.01210898082744702),
('180612', 0.01201452919810003),
('620320', 0.01185770750988142),
('470211', 0.01179941002949852),
('180520', 0.01179574732267577),
('100410', 0.01164329187615771),
('310331', 0.01162790697674419),
('530412', 0.01158504476040021),
('020810', 0.01154575219713941),
('530210', 0.01152737752161383),
('220110', 0.01149425287356322),
('320630', 0.01142857142857143),
('520531', 0.01112484548825711),
('180710', 0.01103708190322364),
('030810', 0.01092896174863388),
('130310', 0.01086556169429098),
('170210', 0.01082262080178853),
('340620', 0.01075268817204301),
('999900', 0.01062416998671979),
('030210', 0.01055662188099808),
('030510', 0.01044277360066834),
('170110', 0.01034780109226789),
('220210', 0.01027397260273973),
('680902', 0.01025641025641026),
('020310', 0.01021667580910587),
('130212', 0.009969657563935847),
('030710', 0.009891435464414958),
('140420', 0.009844993715961458),
('560330', 0.009771986970684038),
('270210', 0.009420631182289214),
('140220', 0.009351432880844645),
('160320', 0.00933609958506224),
('560110', 0.009322560596643879),
('170520', 0.009291360421578144),
('230110', 0.009202453987730062),
('170310', 0.009154113557358054),
('180110', 0.009134615384615385),
('140210', 0.009130282102305981),
('160212', 0.009098914000587027),
('050410', 0.008833922261484099),
('100210', 0.008741319144525446),
('170532', 0.008554705087798289),
('620912', 0.008553654743390357),
('090210', 0.008506616257088847),
('490000', 0.008489564909798374),
('170410', 0.008431932544539644),
('210210', 0.00823045267489712),
('020620', 0.008152173913043478),
('340310', 0.008032128514056224),
('110410', 0.007990834884720034),
('490312', 0.007977207977207978),
('210110', 0.007972665148063782),
('180420', 0.007866728366496992),
('180220', 0.007703887363853715),
('010210', 0.007637017070979336),
('180510', 0.007588713125267208),
('470111', 0.007556238768484639),
('060310', 0.007518796992481203),
('050310', 0.007514761137949544),
('030610', 0.007317073170731708),
('180611', 0.007287611061195967),
('010320', 0.007257694074414332),
('500110', 0.007106598984771574),
('040510', 0.006984459577440196),
('110310', 0.006973269134982567),
('250220', 0.006944444444444444),
('580000', 0.006857142857142857),
('020210', 0.006824146981627296),
('180210', 0.006806282722513089),
('040410', 0.006790744466800805),
('050110', 0.00675990675990676),
('010110', 0.006644518272425249),
('180410', 0.006634078212290503),
('140230', 0.00663265306122449),
('050210', 0.00662133142448103),
('160310', 0.006574892130675981),
('020110', 0.006501360749924402),
('070110', 0.006377551020408163),
('030310', 0.00625),
('120310', 0.006177540831006178),
('100510', 0.006119326874043855),
('030410', 0.006116207951070336),
('690114', 0.006105834464043419),
('110510', 0.005989518342899925),
('160211', 0.005981308411214953),
('150211', 0.005960568546538285),
('130211', 0.005947955390334572),
('520541', 0.005911778080945885),
('120210', 0.005798018131983976),
('040110', 0.005780346820809248),
('260110', 0.005772763054316453),
('070240', 0.005749668288367979),
('090110', 0.005704227647576519),
('110210', 0.005692403229145104),
('030110', 0.005622410731899783),
('260210', 0.0055542698449433),
('080110', 0.005548549810844893),
('120110', 0.005436931593515224),
('040310', 0.005404077622205846),
('250210', 0.005342831700801425),
('010310', 0.005331627212625293),
('440120', 0.005319148936170213),
('100110', 0.005308219178082192),
('470112', 0.005277044854881266),
('110110', 0.005152378864284149),
('160110', 0.005109489051094891),
('270410', 0.00496031746031746),
('060110', 0.004922542348342262),
('520516', 0.004901960784313725),
('270310', 0.004885574697865775),
('120410', 0.004865350089766607),
('220120', 0.004815409309791332),
('040210', 0.004786324786324786),
('070230', 0.004725554343874954),
('130110', 0.004694835680751174),
('140110', 0.004555336991406978),
('340530', 0.004530011325028313),
('060210', 0.00400114318376679),
('230900', 0.003992015968063872),
('520410', 0.003937007874015748),
('140340', 0.003897369275738876),
('490313', 0.003875968992248062),
('009000', 0.002952029520295203),
('350110', 0.002881844380403458),
('140330', 0.002380952380952381),
('130122', 0.002169197396963124),
('150212', 0.001451378809869376),
('130121', 0.001373626373626374),
('190323', 0.0009389671361502347),
('190311', 0.0008796003096193089),
('200532', 0.0005934718100890207),
('190312', 0.0005761198329252485),
('190314', 0.0004549590536851683),
('190324', 0.0004541326067211626),
('200522', 0.0004464285714285714),
('190212', 0.0004089793692629283),
('190114', 0.0003787878787878788),
('190112', 0.0003610760064993681),
('190322', 0.0002765869174388052),
('190211', 0.0002144925463840132),
('190111', 0.0002058036633052068),
('200512', 0.0001853911753800519),
('190321', 7.427213309566251e-05),
('440140', 0),
('200112', 0),
('620925', 0),
('250110', 0),
('200531', 0),
('310242', 0),
('600130', 0),
('580901', 0),
('200521', 0),
('490316', 0),
('200523', 0),
('190113', 0),
('310241', 0),
('550340', 0),
('450350', 0),
('190214', 0),
('300410', 0),
('530903', 0),
('200513', 0),
('140410', 0),
('002200', 0),
('630900', 0),
('680210', 0),
('290210', 0),
('140310', 0),
('200533', 0),
('440110', 0),
('190313', 0),
('190213', 0),
('270311', 0),
('270900', 0),
('200511', 0);
ALTER TABLE "POPULATION__STAGING_TABLE_1" ADD COLUMN "product_code__mapping_target_1_avg" REAL;
UPDATE "POPULATION__STAGING_TABLE_1" SET "product_code__mapping_target_1_avg" = 0.0;
UPDATE "POPULATION__STAGING_TABLE_1"
SET "product_code__mapping_target_1_avg" = "PRODUCT_CODE__MAPPING_TARGET_1_AVG"."value"
FROM "PRODUCT_CODE__MAPPING_TARGET_1_AVG"
WHERE "POPULATION__STAGING_TABLE_1"."product_code" = "PRODUCT_CODE__MAPPING_TARGET_1_AVG"."key";
DROP TABLE IF EXISTS "PRODUCT_CODE__MAPPING_TARGET_1_AVG";
Similarly their SQL statements can be produced for pipe2 and pipe3 with:
print(pipe2.features.to_sql()[pipe2.features.sort(by="importances")[0].name])
DROP TABLE IF EXISTS "PRODUCT_CODE__MAPPING_TARGET_1_AVG";
CREATE TABLE "PRODUCT_CODE__MAPPING_TARGET_1_AVG"("key" TEXT, "value" REAL);
INSERT INTO "PRODUCT_CODE__MAPPING_TARGET_1_AVG" ("key", "value")
VALUES('410901', 0.5265553869499241),
('410140', 0.5248618784530387),
('004190', 0.5073846153846154),
('410120', 0.5013123359580053),
('410110', 0.4444444444444444),
('004100', 0.3336306868867083),
('390110', 0.3132530120481928),
('390120', 0.3067484662576687),
('410130', 0.2967448902346707),
('370110', 0.2948717948717949),
('370212', 0.2944444444444445),
('370220', 0.2920353982300885),
('680140', 0.288135593220339),
('390322', 0.2795918367346939),
('390321', 0.2764227642276423),
('370901', 0.271948608137045),
('390210', 0.2579837194740138),
('370125', 0.2519157088122606),
('390310', 0.2443181818181818),
('390223', 0.2344706911636046),
('390230', 0.2238442822384428),
('370211', 0.2185714285714286),
('370314', 0.2182952182952183),
('400220', 0.2164179104477612),
('610110', 0.2162868883078072),
('360320', 0.2151898734177215),
('590220', 0.2075471698113208),
('370213', 0.2015968063872255),
('400210', 0.1944764096662831),
('430120', 0.194006309148265),
('320130', 0.1899441340782123),
('390901', 0.1797752808988764),
('330410', 0.1751831107281344),
('380410', 0.1386392811296534),
('590230', 0.13469068128426),
('360350', 0.1321279554937413),
('360210', 0.1305555555555556),
('290420', 0.1282051282051282),
('280220', 0.1231884057971015),
('320903', 0.1229724632214259),
('360420', 0.1222091656874266),
('005000', 0.1219512195121951),
('660900', 0.1205479452054795),
('320345', 0.1176205497972059),
('610902', 0.1162790697674419),
('660110', 0.111731843575419),
('600900', 0.1111111111111111),
('670110', 0.1111111111111111),
('320233', 0.1108969866853539),
('610230', 0.11),
('660210', 0.1097922848664688),
('610901', 0.1097560975609756),
('380510', 0.1081081081081081),
('290310', 0.1044776119402985),
('280120', 0.1030640668523677),
('380901', 0.1010141987829615),
('320521', 0.1009174311926606),
('360330', 0.1004366812227074),
('360311', 0.09981167608286252),
('430110', 0.09863945578231292),
('300320', 0.0975609756097561),
('360312', 0.09716599190283401),
('660000', 0.09413886384129846),
('600430', 0.09302325581395349),
('380110', 0.09302325581395349),
('310231', 0.09090909090909091),
('004000', 0.08723998758149643),
('600410', 0.08408408408408409),
('310210', 0.08333333333333333),
('340120', 0.08333333333333333),
('430130', 0.08226221079691516),
('600210', 0.08190476190476191),
('380315', 0.08014981273408239),
('610120', 0.07865168539325842),
('620610', 0.07755102040816327),
('360513', 0.07722969606377678),
('280140', 0.07646356033452807),
('320380', 0.07645788336933046),
('620213', 0.07375643224699828),
('620510', 0.07370393504059962),
('380430', 0.07358390682901006),
('310316', 0.07329842931937172),
('310232', 0.0726950354609929),
('200210', 0.07258064516129033),
('530110', 0.07235621521335807),
('380320', 0.07142857142857142),
('640420', 0.06923076923076923),
('620214', 0.0689900426742532),
('610130', 0.06882591093117409),
('290410', 0.06748466257668712),
('380420', 0.06734816596512327),
('400310', 0.0672059738643435),
('690117', 0.06666666666666667),
('610903', 0.06578947368421052),
('310220', 0.06555863342566944),
('320330', 0.06554307116104868),
('400110', 0.06538692261547691),
('640120', 0.06442953020134229),
('690230', 0.0641025641025641),
('620330', 0.06329113924050633),
('420115', 0.06281407035175879),
('380311', 0.0625),
('310340', 0.06231454005934718),
('320370', 0.06196746707978312),
('380340', 0.06157635467980296),
('380210', 0.06014492753623189),
('620112', 0.05970149253731343),
('340110', 0.05929824561403509),
('320901', 0.05747126436781609),
('280110', 0.05726600985221675),
('290120', 0.05673758865248227),
('320150', 0.05652173913043478),
('240220', 0.05647840531561462),
('340907', 0.05555555555555555),
('600310', 0.05521472392638037),
('320221', 0.05381727158948686),
('320522', 0.05371900826446281),
('620913', 0.05333333333333334),
('340510', 0.052734375),
('640130', 0.05263157894736842),
('310332', 0.05128205128205128),
('320232', 0.05029013539651837),
('380333', 0.0501577287066246),
('690118', 0.05),
('670903', 0.04895104895104895),
('320905', 0.04766031195840555),
('320627', 0.04761904761904762),
('320902', 0.04666666666666667),
('690110', 0.04666666666666667),
('150110', 0.04635643740546312),
('620221', 0.04615384615384616),
('670901', 0.04597701149425287),
('001000', 0.04587155963302753),
('670310', 0.04553734061930783),
('340610', 0.04444444444444445),
('200410', 0.04397394136807817),
('300900', 0.04375),
('610320', 0.04300578034682081),
('300110', 0.0425531914893617),
('002000', 0.0418848167539267),
('680220', 0.04184704184704185),
('570901', 0.04081632653061224),
('280210', 0.04081632653061224),
('600420', 0.04044489383215369),
('320420', 0.0400890868596882),
('290440', 0.038860103626943),
('200310', 0.03872966692486444),
('310900', 0.0380952380952381),
('520550', 0.03773584905660377),
('690116', 0.03773584905660377),
('020410', 0.03773262762506403),
('440130', 0.03759398496240601),
('380902', 0.03571428571428571),
('550320', 0.03547297297297297),
('290110', 0.03539823008849557),
('590210', 0.03476151980598222),
('320904', 0.03454231433506045),
('490311', 0.03448275862068965),
('620310', 0.03422053231939164),
('220000', 0.03418803418803419),
('320120', 0.03355704697986577),
('240310', 0.03343949044585987),
('310351', 0.03333333333333333),
('640310', 0.03329679364209372),
('670902', 0.03174603174603174),
('680903', 0.03137789904502047),
('310140', 0.0308641975308642),
('620420', 0.03061224489795918),
('630220', 0.03052325581395349),
('330610', 0.03022860381636123),
('330510', 0.02971188475390156),
('180620', 0.02942668696093353),
('240900', 0.02941176470588235),
('550330', 0.02935420743639922),
('320610', 0.02929427430093209),
('620710', 0.02877697841726619),
('290320', 0.02877697841726619),
('200111', 0.02867072111207646),
('240320', 0.02842928216062544),
('310352', 0.02838427947598253),
('320410', 0.02791625124626122),
('300218', 0.02777777777777778),
('320110', 0.02768166089965398),
('620121', 0.02765208647561589),
('340210', 0.02722323049001815),
('240210', 0.02707581227436823),
('440150', 0.02702702702702703),
('320140', 0.02697022767075306),
('640220', 0.02683461117196057),
('640410', 0.026232741617357),
('310335', 0.02593659942363112),
('490315', 0.02564102564102564),
('340901', 0.02542372881355932),
('610310', 0.02461584365209608),
('680110', 0.02362204724409449),
('340903', 0.0234375),
('480213', 0.0231811697574893),
('320430', 0.02272727272727273),
('230000', 0.02272727272727273),
('640210', 0.02267002518891688),
('550310', 0.02246796559592768),
('490110', 0.02173913043478261),
('620410', 0.02165087956698241),
('340913', 0.02127659574468085),
('340906', 0.02127659574468085),
('590110', 0.0209366391184573),
('620810', 0.02090592334494774),
('020710', 0.02085600290170475),
('620926', 0.02076875387476751),
('480212', 0.02055622732769045),
('020510', 0.0202097074243193),
('650210', 0.02016868353502017),
('530510', 0.02005730659025788),
('520310', 0.02),
('480110', 0.01970443349753695),
('550110', 0.0194300518134715),
('650110', 0.0190424374319913),
('320511', 0.01829268292682927),
('240120', 0.01818181818181818),
('040610', 0.01785714285714286),
('170531', 0.0177293934681182),
('550210', 0.01761658031088083),
('290430', 0.01748251748251748),
('002100', 0.01715481171548117),
('150310', 0.01708217913204063),
('560310', 0.01682692307692308),
('640110', 0.01674500587544066),
('640430', 0.01648351648351648),
('570000', 0.01633393829401089),
('240110', 0.0162052667116813),
('690119', 0.01618122977346278),
('630110', 0.0158344666796192),
('330310', 0.01570146818923328),
('020820', 0.01567783584383646),
('130320', 0.0156165858912224),
('630210', 0.0155902004454343),
('020610', 0.01553829078801332),
('010120', 0.01547231270358306),
('180310', 0.01535880227155395),
('550410', 0.01529571719918423),
('360110', 0.01515151515151515),
('620114', 0.01492537313432836),
('440210', 0.01488095238095238),
('470220', 0.01478743068391867),
('620111', 0.01471389645776567),
('330210', 0.01441871961769795),
('140320', 0.01423487544483986),
('340520', 0.01411100658513641),
('560210', 0.01355661881977671),
('530311', 0.01341184167484462),
('330110', 0.01330895052321447),
('050900', 0.0131578947368421),
('250900', 0.01309707241910632),
('690120', 0.01305483028720627),
('490300', 0.01298701298701299),
('180320', 0.01298701298701299),
('170533', 0.01296982530439386),
('540000', 0.01271259233808624),
('170510', 0.01269971323228185),
('620930', 0.01252609603340292),
('340410', 0.01241642788920726),
('270000', 0.01241039905852145),
('520110', 0.01237964236588721),
('560400', 0.01210898082744702),
('180612', 0.01201452919810003),
('620320', 0.01185770750988142),
('470211', 0.01179941002949852),
('180520', 0.01179574732267577),
('100410', 0.01164329187615771),
('310331', 0.01162790697674419),
('530412', 0.01158504476040021),
('020810', 0.01154575219713941),
('530210', 0.01152737752161383),
('220110', 0.01149425287356322),
('320630', 0.01142857142857143),
('520531', 0.01112484548825711),
('180710', 0.01103708190322364),
('030810', 0.01092896174863388),
('130310', 0.01086556169429098),
('170210', 0.01082262080178853),
('340620', 0.01075268817204301),
('999900', 0.01062416998671979),
('030210', 0.01055662188099808),
('030510', 0.01044277360066834),
('170110', 0.01034780109226789),
('220210', 0.01027397260273973),
('680902', 0.01025641025641026),
('020310', 0.01021667580910587),
('130212', 0.009969657563935847),
('030710', 0.009891435464414958),
('140420', 0.009844993715961458),
('560330', 0.009771986970684038),
('270210', 0.009420631182289214),
('140220', 0.009351432880844645),
('160320', 0.00933609958506224),
('560110', 0.009322560596643879),
('170520', 0.009291360421578144),
('230110', 0.009202453987730062),
('170310', 0.009154113557358054),
('180110', 0.009134615384615385),
('140210', 0.009130282102305981),
('160212', 0.009098914000587027),
('050410', 0.008833922261484099),
('100210', 0.008741319144525446),
('170532', 0.008554705087798289),
('620912', 0.008553654743390357),
('090210', 0.008506616257088847),
('490000', 0.008489564909798374),
('170410', 0.008431932544539644),
('210210', 0.00823045267489712),
('020620', 0.008152173913043478),
('340310', 0.008032128514056224),
('110410', 0.007990834884720034),
('490312', 0.007977207977207978),
('210110', 0.007972665148063782),
('180420', 0.007866728366496992),
('180220', 0.007703887363853715),
('010210', 0.007637017070979336),
('180510', 0.007588713125267208),
('470111', 0.007556238768484639),
('060310', 0.007518796992481203),
('050310', 0.007514761137949544),
('030610', 0.007317073170731708),
('180611', 0.007287611061195967),
('010320', 0.007257694074414332),
('500110', 0.007106598984771574),
('040510', 0.006984459577440196),
('110310', 0.006973269134982567),
('250220', 0.006944444444444444),
('580000', 0.006857142857142857),
('020210', 0.006824146981627296),
('180210', 0.006806282722513089),
('040410', 0.006790744466800805),
('050110', 0.00675990675990676),
('010110', 0.006644518272425249),
('180410', 0.006634078212290503),
('140230', 0.00663265306122449),
('050210', 0.00662133142448103),
('160310', 0.006574892130675981),
('020110', 0.006501360749924402),
('070110', 0.006377551020408163),
('030310', 0.00625),
('120310', 0.006177540831006178),
('100510', 0.006119326874043855),
('030410', 0.006116207951070336),
('690114', 0.006105834464043419),
('110510', 0.005989518342899925),
('160211', 0.005981308411214953),
('150211', 0.005960568546538285),
('130211', 0.005947955390334572),
('520541', 0.005911778080945885),
('120210', 0.005798018131983976),
('040110', 0.005780346820809248),
('260110', 0.005772763054316453),
('070240', 0.005749668288367979),
('090110', 0.005704227647576519),
('110210', 0.005692403229145104),
('030110', 0.005622410731899783),
('260210', 0.0055542698449433),
('080110', 0.005548549810844893),
('120110', 0.005436931593515224),
('040310', 0.005404077622205846),
('250210', 0.005342831700801425),
('010310', 0.005331627212625293),
('440120', 0.005319148936170213),
('100110', 0.005308219178082192),
('470112', 0.005277044854881266),
('110110', 0.005152378864284149),
('160110', 0.005109489051094891),
('270410', 0.00496031746031746),
('060110', 0.004922542348342262),
('520516', 0.004901960784313725),
('270310', 0.004885574697865775),
('120410', 0.004865350089766607),
('220120', 0.004815409309791332),
('040210', 0.004786324786324786),
('070230', 0.004725554343874954),
('130110', 0.004694835680751174),
('140110', 0.004555336991406978),
('340530', 0.004530011325028313),
('060210', 0.00400114318376679),
('230900', 0.003992015968063872),
('520410', 0.003937007874015748),
('140340', 0.003897369275738876),
('490313', 0.003875968992248062),
('009000', 0.002952029520295203),
('350110', 0.002881844380403458),
('140330', 0.002380952380952381),
('130122', 0.002169197396963124),
('150212', 0.001451378809869376),
('130121', 0.001373626373626374),
('190323', 0.0009389671361502347),
('190311', 0.0008796003096193089),
('200532', 0.0005934718100890207),
('190312', 0.0005761198329252485),
('190314', 0.0004549590536851683),
('190324', 0.0004541326067211626),
('200522', 0.0004464285714285714),
('190212', 0.0004089793692629283),
('190114', 0.0003787878787878788),
('190112', 0.0003610760064993681),
('190322', 0.0002765869174388052),
('190211', 0.0002144925463840132),
('190111', 0.0002058036633052068),
('200512', 0.0001853911753800519),
('190321', 7.427213309566251e-05),
('440140', 0),
('200112', 0),
('620925', 0),
('250110', 0),
('200531', 0),
('310242', 0),
('600130', 0),
('580901', 0),
('200521', 0),
('490316', 0),
('200523', 0),
('190113', 0),
('310241', 0),
('550340', 0),
('450350', 0),
('190214', 0),
('300410', 0),
('530903', 0),
('200513', 0),
('140410', 0),
('002200', 0),
('630900', 0),
('680210', 0),
('290210', 0),
('140310', 0),
('200533', 0),
('440110', 0),
('190313', 0),
('190213', 0),
('270311', 0),
('270900', 0),
('200511', 0);
ALTER TABLE "POPULATION__STAGING_TABLE_1" ADD COLUMN "product_code__mapping_target_1_avg" REAL;
UPDATE "POPULATION__STAGING_TABLE_1" SET "product_code__mapping_target_1_avg" = 0.0;
UPDATE "POPULATION__STAGING_TABLE_1"
SET "product_code__mapping_target_1_avg" = "PRODUCT_CODE__MAPPING_TARGET_1_AVG"."value"
FROM "PRODUCT_CODE__MAPPING_TARGET_1_AVG"
WHERE "POPULATION__STAGING_TABLE_1"."product_code" = "PRODUCT_CODE__MAPPING_TARGET_1_AVG"."key";
DROP TABLE IF EXISTS "PRODUCT_CODE__MAPPING_TARGET_1_AVG";
print(pipe3.features.to_sql()[pipe2.features.sort(by="importances")[0].name])
DROP TABLE IF EXISTS "PRODUCT_CODE__MAPPING_TARGET_1_AVG";
CREATE TABLE "PRODUCT_CODE__MAPPING_TARGET_1_AVG"("key" TEXT, "value" REAL);
INSERT INTO "PRODUCT_CODE__MAPPING_TARGET_1_AVG" ("key", "value")
VALUES('410901', 0.5265553869499241),
('410140', 0.5248618784530387),
('004190', 0.5073846153846154),
('410120', 0.5013123359580053),
('410110', 0.4444444444444444),
('004100', 0.3336306868867083),
('390110', 0.3132530120481928),
('390120', 0.3067484662576687),
('410130', 0.2967448902346707),
('370110', 0.2948717948717949),
('370212', 0.2944444444444445),
('370220', 0.2920353982300885),
('680140', 0.288135593220339),
('390322', 0.2795918367346939),
('390321', 0.2764227642276423),
('370901', 0.271948608137045),
('390210', 0.2579837194740138),
('370125', 0.2519157088122606),
('390310', 0.2443181818181818),
('390223', 0.2344706911636046),
('390230', 0.2238442822384428),
('370211', 0.2185714285714286),
('370314', 0.2182952182952183),
('400220', 0.2164179104477612),
('610110', 0.2162868883078072),
('360320', 0.2151898734177215),
('590220', 0.2075471698113208),
('370213', 0.2015968063872255),
('400210', 0.1944764096662831),
('430120', 0.194006309148265),
('320130', 0.1899441340782123),
('390901', 0.1797752808988764),
('330410', 0.1751831107281344),
('380410', 0.1386392811296534),
('590230', 0.13469068128426),
('360350', 0.1321279554937413),
('360210', 0.1305555555555556),
('290420', 0.1282051282051282),
('280220', 0.1231884057971015),
('320903', 0.1229724632214259),
('360420', 0.1222091656874266),
('005000', 0.1219512195121951),
('660900', 0.1205479452054795),
('320345', 0.1176205497972059),
('610902', 0.1162790697674419),
('660110', 0.111731843575419),
('600900', 0.1111111111111111),
('670110', 0.1111111111111111),
('320233', 0.1108969866853539),
('610230', 0.11),
('660210', 0.1097922848664688),
('610901', 0.1097560975609756),
('380510', 0.1081081081081081),
('290310', 0.1044776119402985),
('280120', 0.1030640668523677),
('380901', 0.1010141987829615),
('320521', 0.1009174311926606),
('360330', 0.1004366812227074),
('360311', 0.09981167608286252),
('430110', 0.09863945578231292),
('300320', 0.0975609756097561),
('360312', 0.09716599190283401),
('660000', 0.09413886384129846),
('600430', 0.09302325581395349),
('380110', 0.09302325581395349),
('310231', 0.09090909090909091),
('004000', 0.08723998758149643),
('600410', 0.08408408408408409),
('310210', 0.08333333333333333),
('340120', 0.08333333333333333),
('430130', 0.08226221079691516),
('600210', 0.08190476190476191),
('380315', 0.08014981273408239),
('610120', 0.07865168539325842),
('620610', 0.07755102040816327),
('360513', 0.07722969606377678),
('280140', 0.07646356033452807),
('320380', 0.07645788336933046),
('620213', 0.07375643224699828),
('620510', 0.07370393504059962),
('380430', 0.07358390682901006),
('310316', 0.07329842931937172),
('310232', 0.0726950354609929),
('200210', 0.07258064516129033),
('530110', 0.07235621521335807),
('380320', 0.07142857142857142),
('640420', 0.06923076923076923),
('620214', 0.0689900426742532),
('610130', 0.06882591093117409),
('290410', 0.06748466257668712),
('380420', 0.06734816596512327),
('400310', 0.0672059738643435),
('690117', 0.06666666666666667),
('610903', 0.06578947368421052),
('310220', 0.06555863342566944),
('320330', 0.06554307116104868),
('400110', 0.06538692261547691),
('640120', 0.06442953020134229),
('690230', 0.0641025641025641),
('620330', 0.06329113924050633),
('420115', 0.06281407035175879),
('380311', 0.0625),
('310340', 0.06231454005934718),
('320370', 0.06196746707978312),
('380340', 0.06157635467980296),
('380210', 0.06014492753623189),
('620112', 0.05970149253731343),
('340110', 0.05929824561403509),
('320901', 0.05747126436781609),
('280110', 0.05726600985221675),
('290120', 0.05673758865248227),
('320150', 0.05652173913043478),
('240220', 0.05647840531561462),
('340907', 0.05555555555555555),
('600310', 0.05521472392638037),
('320221', 0.05381727158948686),
('320522', 0.05371900826446281),
('620913', 0.05333333333333334),
('340510', 0.052734375),
('640130', 0.05263157894736842),
('310332', 0.05128205128205128),
('320232', 0.05029013539651837),
('380333', 0.0501577287066246),
('690118', 0.05),
('670903', 0.04895104895104895),
('320905', 0.04766031195840555),
('320627', 0.04761904761904762),
('320902', 0.04666666666666667),
('690110', 0.04666666666666667),
('150110', 0.04635643740546312),
('620221', 0.04615384615384616),
('670901', 0.04597701149425287),
('001000', 0.04587155963302753),
('670310', 0.04553734061930783),
('340610', 0.04444444444444445),
('200410', 0.04397394136807817),
('300900', 0.04375),
('610320', 0.04300578034682081),
('300110', 0.0425531914893617),
('002000', 0.0418848167539267),
('680220', 0.04184704184704185),
('570901', 0.04081632653061224),
('280210', 0.04081632653061224),
('600420', 0.04044489383215369),
('320420', 0.0400890868596882),
('290440', 0.038860103626943),
('200310', 0.03872966692486444),
('310900', 0.0380952380952381),
('520550', 0.03773584905660377),
('690116', 0.03773584905660377),
('020410', 0.03773262762506403),
('440130', 0.03759398496240601),
('380902', 0.03571428571428571),
('550320', 0.03547297297297297),
('290110', 0.03539823008849557),
('590210', 0.03476151980598222),
('320904', 0.03454231433506045),
('490311', 0.03448275862068965),
('620310', 0.03422053231939164),
('220000', 0.03418803418803419),
('320120', 0.03355704697986577),
('240310', 0.03343949044585987),
('310351', 0.03333333333333333),
('640310', 0.03329679364209372),
('670902', 0.03174603174603174),
('680903', 0.03137789904502047),
('310140', 0.0308641975308642),
('620420', 0.03061224489795918),
('630220', 0.03052325581395349),
('330610', 0.03022860381636123),
('330510', 0.02971188475390156),
('180620', 0.02942668696093353),
('240900', 0.02941176470588235),
('550330', 0.02935420743639922),
('320610', 0.02929427430093209),
('620710', 0.02877697841726619),
('290320', 0.02877697841726619),
('200111', 0.02867072111207646),
('240320', 0.02842928216062544),
('310352', 0.02838427947598253),
('320410', 0.02791625124626122),
('300218', 0.02777777777777778),
('320110', 0.02768166089965398),
('620121', 0.02765208647561589),
('340210', 0.02722323049001815),
('240210', 0.02707581227436823),
('440150', 0.02702702702702703),
('320140', 0.02697022767075306),
('640220', 0.02683461117196057),
('640410', 0.026232741617357),
('310335', 0.02593659942363112),
('490315', 0.02564102564102564),
('340901', 0.02542372881355932),
('610310', 0.02461584365209608),
('680110', 0.02362204724409449),
('340903', 0.0234375),
('480213', 0.0231811697574893),
('320430', 0.02272727272727273),
('230000', 0.02272727272727273),
('640210', 0.02267002518891688),
('550310', 0.02246796559592768),
('490110', 0.02173913043478261),
('620410', 0.02165087956698241),
('340913', 0.02127659574468085),
('340906', 0.02127659574468085),
('590110', 0.0209366391184573),
('620810', 0.02090592334494774),
('020710', 0.02085600290170475),
('620926', 0.02076875387476751),
('480212', 0.02055622732769045),
('020510', 0.0202097074243193),
('650210', 0.02016868353502017),
('530510', 0.02005730659025788),
('520310', 0.02),
('480110', 0.01970443349753695),
('550110', 0.0194300518134715),
('650110', 0.0190424374319913),
('320511', 0.01829268292682927),
('240120', 0.01818181818181818),
('040610', 0.01785714285714286),
('170531', 0.0177293934681182),
('550210', 0.01761658031088083),
('290430', 0.01748251748251748),
('002100', 0.01715481171548117),
('150310', 0.01708217913204063),
('560310', 0.01682692307692308),
('640110', 0.01674500587544066),
('640430', 0.01648351648351648),
('570000', 0.01633393829401089),
('240110', 0.0162052667116813),
('690119', 0.01618122977346278),
('630110', 0.0158344666796192),
('330310', 0.01570146818923328),
('020820', 0.01567783584383646),
('130320', 0.0156165858912224),
('630210', 0.0155902004454343),
('020610', 0.01553829078801332),
('010120', 0.01547231270358306),
('180310', 0.01535880227155395),
('550410', 0.01529571719918423),
('360110', 0.01515151515151515),
('620114', 0.01492537313432836),
('440210', 0.01488095238095238),
('470220', 0.01478743068391867),
('620111', 0.01471389645776567),
('330210', 0.01441871961769795),
('140320', 0.01423487544483986),
('340520', 0.01411100658513641),
('560210', 0.01355661881977671),
('530311', 0.01341184167484462),
('330110', 0.01330895052321447),
('050900', 0.0131578947368421),
('250900', 0.01309707241910632),
('690120', 0.01305483028720627),
('490300', 0.01298701298701299),
('180320', 0.01298701298701299),
('170533', 0.01296982530439386),
('540000', 0.01271259233808624),
('170510', 0.01269971323228185),
('620930', 0.01252609603340292),
('340410', 0.01241642788920726),
('270000', 0.01241039905852145),
('520110', 0.01237964236588721),
('560400', 0.01210898082744702),
('180612', 0.01201452919810003),
('620320', 0.01185770750988142),
('470211', 0.01179941002949852),
('180520', 0.01179574732267577),
('100410', 0.01164329187615771),
('310331', 0.01162790697674419),
('530412', 0.01158504476040021),
('020810', 0.01154575219713941),
('530210', 0.01152737752161383),
('220110', 0.01149425287356322),
('320630', 0.01142857142857143),
('520531', 0.01112484548825711),
('180710', 0.01103708190322364),
('030810', 0.01092896174863388),
('130310', 0.01086556169429098),
('170210', 0.01082262080178853),
('340620', 0.01075268817204301),
('999900', 0.01062416998671979),
('030210', 0.01055662188099808),
('030510', 0.01044277360066834),
('170110', 0.01034780109226789),
('220210', 0.01027397260273973),
('680902', 0.01025641025641026),
('020310', 0.01021667580910587),
('130212', 0.009969657563935847),
('030710', 0.009891435464414958),
('140420', 0.009844993715961458),
('560330', 0.009771986970684038),
('270210', 0.009420631182289214),
('140220', 0.009351432880844645),
('160320', 0.00933609958506224),
('560110', 0.009322560596643879),
('170520', 0.009291360421578144),
('230110', 0.009202453987730062),
('170310', 0.009154113557358054),
('180110', 0.009134615384615385),
('140210', 0.009130282102305981),
('160212', 0.009098914000587027),
('050410', 0.008833922261484099),
('100210', 0.008741319144525446),
('170532', 0.008554705087798289),
('620912', 0.008553654743390357),
('090210', 0.008506616257088847),
('490000', 0.008489564909798374),
('170410', 0.008431932544539644),
('210210', 0.00823045267489712),
('020620', 0.008152173913043478),
('340310', 0.008032128514056224),
('110410', 0.007990834884720034),
('490312', 0.007977207977207978),
('210110', 0.007972665148063782),
('180420', 0.007866728366496992),
('180220', 0.007703887363853715),
('010210', 0.007637017070979336),
('180510', 0.007588713125267208),
('470111', 0.007556238768484639),
('060310', 0.007518796992481203),
('050310', 0.007514761137949544),
('030610', 0.007317073170731708),
('180611', 0.007287611061195967),
('010320', 0.007257694074414332),
('500110', 0.007106598984771574),
('040510', 0.006984459577440196),
('110310', 0.006973269134982567),
('250220', 0.006944444444444444),
('580000', 0.006857142857142857),
('020210', 0.006824146981627296),
('180210', 0.006806282722513089),
('040410', 0.006790744466800805),
('050110', 0.00675990675990676),
('010110', 0.006644518272425249),
('180410', 0.006634078212290503),
('140230', 0.00663265306122449),
('050210', 0.00662133142448103),
('160310', 0.006574892130675981),
('020110', 0.006501360749924402),
('070110', 0.006377551020408163),
('030310', 0.00625),
('120310', 0.006177540831006178),
('100510', 0.006119326874043855),
('030410', 0.006116207951070336),
('690114', 0.006105834464043419),
('110510', 0.005989518342899925),
('160211', 0.005981308411214953),
('150211', 0.005960568546538285),
('130211', 0.005947955390334572),
('520541', 0.005911778080945885),
('120210', 0.005798018131983976),
('040110', 0.005780346820809248),
('260110', 0.005772763054316453),
('070240', 0.005749668288367979),
('090110', 0.005704227647576519),
('110210', 0.005692403229145104),
('030110', 0.005622410731899783),
('260210', 0.0055542698449433),
('080110', 0.005548549810844893),
('120110', 0.005436931593515224),
('040310', 0.005404077622205846),
('250210', 0.005342831700801425),
('010310', 0.005331627212625293),
('440120', 0.005319148936170213),
('100110', 0.005308219178082192),
('470112', 0.005277044854881266),
('110110', 0.005152378864284149),
('160110', 0.005109489051094891),
('270410', 0.00496031746031746),
('060110', 0.004922542348342262),
('520516', 0.004901960784313725),
('270310', 0.004885574697865775),
('120410', 0.004865350089766607),
('220120', 0.004815409309791332),
('040210', 0.004786324786324786),
('070230', 0.004725554343874954),
('130110', 0.004694835680751174),
('140110', 0.004555336991406978),
('340530', 0.004530011325028313),
('060210', 0.00400114318376679),
('230900', 0.003992015968063872),
('520410', 0.003937007874015748),
('140340', 0.003897369275738876),
('490313', 0.003875968992248062),
('009000', 0.002952029520295203),
('350110', 0.002881844380403458),
('140330', 0.002380952380952381),
('130122', 0.002169197396963124),
('150212', 0.001451378809869376),
('130121', 0.001373626373626374),
('190323', 0.0009389671361502347),
('190311', 0.0008796003096193089),
('200532', 0.0005934718100890207),
('190312', 0.0005761198329252485),
('190314', 0.0004549590536851683),
('190324', 0.0004541326067211626),
('200522', 0.0004464285714285714),
('190212', 0.0004089793692629283),
('190114', 0.0003787878787878788),
('190112', 0.0003610760064993681),
('190322', 0.0002765869174388052),
('190211', 0.0002144925463840132),
('190111', 0.0002058036633052068),
('200512', 0.0001853911753800519),
('190321', 7.427213309566251e-05),
('440140', 0),
('200112', 0),
('620925', 0),
('250110', 0),
('200531', 0),
('310242', 0),
('600130', 0),
('580901', 0),
('200521', 0),
('490316', 0),
('200523', 0),
('190113', 0),
('310241', 0),
('550340', 0),
('450350', 0),
('190214', 0),
('300410', 0),
('530903', 0),
('200513', 0),
('140410', 0),
('002200', 0),
('630900', 0),
('680210', 0),
('290210', 0),
('140310', 0),
('200533', 0),
('440110', 0),
('190313', 0),
('190213', 0),
('270311', 0),
('270900', 0),
('200511', 0);
ALTER TABLE "POPULATION__STAGING_TABLE_1" ADD COLUMN "product_code__mapping_target_1_avg" REAL;
UPDATE "POPULATION__STAGING_TABLE_1" SET "product_code__mapping_target_1_avg" = 0.0;
UPDATE "POPULATION__STAGING_TABLE_1"
SET "product_code__mapping_target_1_avg" = "PRODUCT_CODE__MAPPING_TARGET_1_AVG"."value"
FROM "PRODUCT_CODE__MAPPING_TARGET_1_AVG"
WHERE "POPULATION__STAGING_TABLE_1"."product_code" = "PRODUCT_CODE__MAPPING_TARGET_1_AVG"."key";
DROP TABLE IF EXISTS "PRODUCT_CODE__MAPPING_TARGET_1_AVG";
2.6 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 module.
# Creates a folder named containing the SQL code.
pipe3.features.to_sql().save("consumer_expenditures_pipeline")
pipe3.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("consumer_expenditures_spark")
getml.engine.shutdown()
3. Conclusion¶
In this notebook, we have shown how you can use relational learning to predict whether items were purchased as a gift. We did this to highlight the importance of relational learning. Relational learning can be used in many real-world data science applications, but unfortunately most data scientists don't even know what relation learning is.