Loans - Predicting loan default risk of Czech bank customers¶
This notebook demonstrates the application of our relational learning algorithm to predict if a customer of a bank will default on his loan. We train the predictor on customer metadata, transaction history, as well as other successful and unsuccessful loans.
Summary:
- Prediction type: Binary classification
- Domain: Finance
- Prediction target: Loan default
- Source data: 8 tables, 78.8 MB
- Population size: 682
Background¶
This notebook features a textbook example of predictive analytics applied to the financial sector. A loan is the lending of money to companies or individuals. Banks grant loans in exchange for the promise of repayment. Loan default is defined as the failure to meet this legal obligation, for example, when a home buyer fails to make a mortgage payment. A bank needs to estimate the risk it carries when granting loans to potentially non-performing customers.
The analysis is based on the financial dataset from the the CTU Prague Relational Learning Repository (Motl and Schulte, 2015) (Now residing at relational-data.org.) .
Analysis¶
Let's get started with the analysis and set-up your session:
%pip install -q "getml==1.5.0" "matplotlib==3.9.2"
import matplotlib.pyplot as plt
from IPython.display import Image
import getml
%matplotlib inline
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("loans")
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912214744.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'loans'.
1. Loading data¶
1.1 Download from source¶
Downloading the raw data from the CTU Prague Relational Learning Repository into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It will be made available in a future version.
population_train, population_test, order, trans, meta = getml.datasets.load_loans(
roles=True, units=True
)
1.2 Prepare data for getML¶
The getml.datasets.load_loans
method took care of the entire data lifting:
- Downloads csv's from our servers in python
- Converts csv's to getML DataFrames
- Sets roles to columns inside getML DataFrames
The only thing left is to set units to columns that the relational learning algorithm is allowed to compare to each other.
Data visualization
To simplify the notebook, original data model (image below) is condensed into 4 tables, by resolving the trivial one-to-one and many-to-one joins:
- A population table
population_{train, test}
, consiting ofloan
andaccount
tables - Three peripheral tables:
order
,trans
, andmeta
. - Whereas
meta
is made up ofcard
,client
,disp
anddistrict
Image("assets/loans-schema.png", width=500)
Population table
- Information on the loan itself (duration, amount, date, ...)
- Geo-information about the branch where the loans was granted (A**)
- Column
status
contains binary target. Levels [A, C] := loan paid back and [B, D] := loan default; we recoded status to our binary target:default
population_train.set_role("date_loan", "time_stamp")
population_test.set_role("date_loan", "time_stamp")
population_test
name | date_loan | account_id | default | frequency | duration | payments | amount | loan_id | district_id | date_account | status |
---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | target | categorical | numerical | numerical | numerical | unused_float | unused_float | unused_string | unused_string |
unit | time stamp, comparison only | money | |||||||||
0 | 1996-04-29 | 19 | 1 | POPLATEK MESICNE | 12 | 2523 | 30276 | 4961 | 21 | 1995-04-07 | B |
1 | 1998-10-14 | 37 | 1 | POPLATEK MESICNE | 60 | 5308 | 318480 | 4967 | 20 | 1997-08-18 | D |
2 | 1998-04-19 | 38 | 0 | POPLATEK TYDNE | 48 | 2307 | 110736 | 4968 | 19 | 1997-08-08 | C |
3 | 1997-08-10 | 97 | 0 | POPLATEK MESICNE | 12 | 8573 | 102876 | 4986 | 74 | 1996-05-05 | A |
4 | 1996-11-06 | 132 | 0 | POPLATEK PO OBRATU | 12 | 7370 | 88440 | 4996 | 40 | 1996-05-11 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
218 | 1995-12-04 | 11042 | 0 | POPLATEK MESICNE | 36 | 6032 | 217152 | 7243 | 72 | 1995-01-29 | A |
219 | 1996-08-20 | 11054 | 0 | POPLATEK TYDNE | 60 | 2482 | 148920 | 7246 | 59 | 1996-02-01 | C |
220 | 1994-01-31 | 11111 | 0 | POPLATEK MESICNE | 36 | 3004 | 108144 | 7259 | 1 | 1993-05-20 | A |
221 | 1998-11-22 | 11317 | 0 | POPLATEK MESICNE | 60 | 5291 | 317460 | 7292 | 50 | 1997-07-11 | C |
222 | 1996-12-27 | 11362 | 0 | POPLATEK MESICNE | 24 | 5392 | 129408 | 7308 | 67 | 1995-10-14 | A |
223 rows x 11 columns
memory usage: 0.02 MB
name: population_test
type: getml.DataFrame
Peripheral tables
meta
- Meta info about the client (card_type, gender, ...)
- Geo-information about the client
order
- Permanent orders related to a loan (amount, balance, ...)
trans
- Transactions related to a given loan (amount, ...)
While the contents of meta
and order
are omitted for brevity, here are contents of trans
:
trans
name | date | account_id | type | k_symbol | bank | operation | amount | balance | trans_id | account |
---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | categorical | categorical | categorical | categorical | numerical | numerical | unused_float | unused_float |
unit | time stamp, comparison only | money | ||||||||
0 | 1995-03-24 | 1 | PRIJEM | NULL | NULL | VKLAD | 1000 | 1000 | 1 | nan |
1 | 1995-04-13 | 1 | PRIJEM | NULL | AB | PREVOD Z UCTU | 3679 | 4679 | 5 | 41403269 |
2 | 1995-05-13 | 1 | PRIJEM | NULL | AB | PREVOD Z UCTU | 3679 | 20977 | 6 | 41403269 |
3 | 1995-06-13 | 1 | PRIJEM | NULL | AB | PREVOD Z UCTU | 3679 | 26835 | 7 | 41403269 |
4 | 1995-07-13 | 1 | PRIJEM | NULL | AB | PREVOD Z UCTU | 3679 | 30415 | 8 | 41403269 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
1056315 | 1998-08-31 | 10451 | PRIJEM | UROK | NULL | NULL | 62 | 17300 | 3682983 | nan |
1056316 | 1998-09-30 | 10451 | PRIJEM | UROK | NULL | NULL | 49 | 13442 | 3682984 | nan |
1056317 | 1998-10-31 | 10451 | PRIJEM | UROK | NULL | NULL | 34 | 10118 | 3682985 | nan |
1056318 | 1998-11-30 | 10451 | PRIJEM | UROK | NULL | NULL | 26 | 8398 | 3682986 | nan |
1056319 | 1998-12-31 | 10451 | PRIJEM | UROK | NULL | NULL | 42 | 13695 | 3682987 | nan |
1056320 rows x 10 columns
memory usage: 63.38 MB
name: trans
type: getml.DataFrame
1.3 Define relational model¶
To start with relational learning, we need to specify an abstract data model. Here, we use the high-level star schema API that allows us to define the abstract data model and construct a container with the concrete data at one-go. While a simple StarSchema
indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's full-fledged data model and container APIs respectively.
star_schema = getml.data.StarSchema(
train=population_train, test=population_test, alias="population"
)
star_schema.join(
trans,
on="account_id",
time_stamps=("date_loan", "date"),
)
star_schema.join(
order,
on="account_id",
)
star_schema.join(
meta,
on="account_id",
)
star_schema
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | meta | META__STAGING_TABLE_2 |
2 | order | ORDER__STAGING_TABLE_3 |
3 | trans | TRANS__STAGING_TABLE_4 |
subset | name | rows | type | |
---|---|---|---|---|
0 | train | population_train | 459 | DataFrame |
1 | test | population_test | 223 | DataFrame |
name | rows | type | |
---|---|---|---|
0 | trans | 1056320 | DataFrame |
1 | order | 6471 | DataFrame |
2 | meta | 5369 | DataFrame |
meta
name | account_id | type_disp | type_card | gender | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | A15 | A16 | disp_id | client_id | card_id | district_id | issued | birth_date | A2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | join_key | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 1 | OWNER | NULL | F | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.8 | 3.35 | 131 | 1740 | 1910 | 1 | 1 | nan | 18 | NULL | 1970-12-13 | Pisek |
1 | 2 | OWNER | NULL | M | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100 | 12541 | 0.2 | 0.43 | 167 | 85677 | 99107 | 2 | 2 | nan | 1 | NULL | 1945-02-04 | Hl.m. Praha |
2 | 2 | DISPONENT | NULL | F | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100 | 12541 | 0.2 | 0.43 | 167 | 85677 | 99107 | 3 | 3 | nan | 1 | NULL | 1940-10-09 | Hl.m. Praha |
3 | 3 | OWNER | NULL | M | central Bohemia | 95616 | 65 | 30 | 4 | 1 | 6 | 51.4 | 9307 | 3.8 | 4.43 | 118 | 2616 | 3040 | 4 | 4 | nan | 5 | NULL | 1956-12-01 | Kolin |
4 | 3 | DISPONENT | NULL | F | central Bohemia | 95616 | 65 | 30 | 4 | 1 | 6 | 51.4 | 9307 | 3.8 | 4.43 | 118 | 2616 | 3040 | 5 | 5 | nan | 5 | NULL | 1960-07-03 | Kolin |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
5364 | 11349 | OWNER | NULL | F | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100 | 12541 | 0.2 | 0.43 | 167 | 85677 | 99107 | 13647 | 13955 | nan | 1 | NULL | 1945-10-30 | Hl.m. Praha |
5365 | 11349 | DISPONENT | NULL | M | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100 | 12541 | 0.2 | 0.43 | 167 | 85677 | 99107 | 13648 | 13956 | nan | 1 | NULL | 1943-04-06 | Hl.m. Praha |
5366 | 11359 | OWNER | classic | M | south Moravia | 117897 | 139 | 28 | 5 | 1 | 6 | 53.8 | 8814 | 4.7 | 5.74 | 107 | 2112 | 2059 | 13660 | 13968 | 1247 | 61 | 1995-06-13 | 1968-04-13 | Trebic |
5367 | 11362 | OWNER | NULL | F | north Moravia | 106054 | 38 | 25 | 6 | 2 | 6 | 63.1 | 8110 | 5.7 | 6.55 | 109 | 3244 | 3079 | 13663 | 13971 | nan | 67 | NULL | 1962-10-19 | Bruntal |
5368 | 11382 | OWNER | NULL | F | north Moravia | 323870 | 0 | 0 | 0 | 1 | 1 | 100 | 10673 | 4.7 | 5.44 | 100 | 18782 | 18347 | 13690 | 13998 | nan | 74 | NULL | 1953-08-12 | Ostrava - mesto |
5369 rows x 25 columns
memory usage: 1.10 MB
name: meta
type: getml.DataFrame
2. Predictive modeling¶
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
2.1 getML Pipeline¶
Set-up of feature learners, selectors & predictor
mapping = getml.preprocessors.Mapping(min_freq=100)
fast_prop = getml.feature_learning.FastProp(
aggregation=getml.feature_learning.FastProp.agg_sets.All,
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
feature_selector = getml.predictors.XGBoostClassifier(n_jobs=1)
# the population is really small, so we set gamma to mitigate overfitting
predictor = getml.predictors.XGBoostClassifier(
gamma=2,
n_jobs=1,
)
Build the pipeline
pipe = getml.pipeline.Pipeline(
data_model=star_schema.data_model,
preprocessors=[mapping],
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=predictor,
)
2.2 Model training¶
pipe.fit(star_schema.train)
Checking data model...
⠧ Staging... 0% • --:--
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 808 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Trained pipeline.
Time taken: 0:00:01.070034.
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=['XGBoostClassifier'], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['meta', 'order', 'trans'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['container-CE93xg'])
2.3 Model evaluation¶
pipe.score(star_schema.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 21:48:14 | train | default | 0.9978 | 1. | 0.0671 |
1 | 2024-09-12 21:48:15 | test | default | 0.9686 | 0.9465 | 0.13814 |
2.4 Studying features¶
Visualizing the learned features
The feature with the highest importance is:
by_importances = pipe.features.sort(by="importances")
by_importances[0].sql
DROP TABLE IF EXISTS "FEATURE_1_21";
CREATE TABLE "FEATURE_1_21" AS
SELECT Q1( t2."balance" ) AS "feature_1_21",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRANS__STAGING_TABLE_4" t2
ON t1."account_id" = t2."account_id"
WHERE t2."date" <= t1."date_loan"
GROUP BY t1.rowid;
Feature correlations
We want to analyze how the features are correlated with the target variable.
names, correlations = pipe.features[:50].correlations()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, correlations)
ax.set_title("feature correlations")
ax.set_xlabel("feature")
ax.set_ylabel("correlation")
ax.tick_params(axis="x", rotation=90)
Feature importances
Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.
names, importances = pipe.features[:50].importances()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, importances)
ax.set_title("feature importances")
ax.set_xlabel("feature")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)
Column importances
Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.
As we can see, a lot of the predictive power stems from the account balance. This is unsurprising: People with less money on their bank accounts are more likely to default on their loans.
names, importances = pipe.columns.importances()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, importances)
ax.set_title("column importances")
ax.set_xlabel("column")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)
The most important feature looks as follows:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_21";
CREATE TABLE "FEATURE_1_21" AS
SELECT Q1( t2."balance" ) AS "feature_1_21",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRANS__STAGING_TABLE_4" t2
ON t1."account_id" = t2."account_id"
WHERE t2."date" <= t1."date_loan"
GROUP BY t1.rowid;
2.5 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named loans_pipeline containing
# the SQL code.
pipe.features.to_sql().save("loans_pipeline", remove=True)
# Creates a folder named baseball_pipeline_spark containing
# the SQL code.
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save(
"loans_pipeline_spark", remove=True
)
getml.engine.shutdown()
3. Conclusion¶
By applying getML to the PKDD'99 Financial dataset, we were able to show the power and relevance of Relational Learning on a real-world data set. Within a training time below 1 minute, we outperformed almost all approaches based on manually generated features. This makes getML the prime choice when dealing with complex relational data schemes. This result holds independent of the problem domain since no expertise in the financial sector was used in this analysis.
The present analysis could be improved in two directions. By performing an extensive hyperparameter optimization, the out of sample AUC could be further improved. On the other hand, the hyperparameters could be tuned to produce less complex features that result in worse performance (in terms of AUC) but are better interpretable by humans.
References¶
Schulte, Oliver, et al. "A hierarchy of independence assumptions for multi-relational Bayes net classifiers." 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, 2013.