Loans - Predicting loan default risk¶
This notebook demonstrates the application of our relational learning algorithm to predict if a customer of a bank will default on his loan. We train the predictor on customer metadata, transaction history, as well as other successful and unsuccessful loans.
Summary:
- Prediction type: Binary classification
- Domain: Finance
- Prediction target: Loan default
- Source data: 8 tables, 78.8 MB
- Population size: 682
Author: Dr. Johannes King, Dr. Patrick Urbanke
Background¶
This notebook features a textbook example of predictive analytics applied to the financial sector. A loan is the lending of money to companies or individuals. Banks grant loans in exchange for the promise of repayment. Loan default is defined as the failure to meet this legal obligation, for example, when a home buyer fails to make a mortgage payment. A bank needs to estimate the risk it carries when granting loans to potentially non-performing customers.
The analysis is based on the financial dataset from the the CTU Prague Relational Learning Repository (Motl and Schulte, 2015).
Analysis¶
Let's get started with the analysis and set-up your session:
%pip install -q "getml==1.5.0" "matplotlib~=3.9"
import matplotlib.pyplot as plt
%matplotlib inline
import getml
getml.engine.launch()
getml.engine.set_project("loans")
Note: you may need to restart the kernel to use updated packages. Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false in /home/user/.getML/getml-1.5.0-x64-community-edition-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912123619.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'loans'.
1. Loading data¶
1.1 Download from source¶
Downloading the raw data from the CTU Prague Relational Learning Repository into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It will be made available in a future version.
population_train, population_test, order, trans, meta = getml.datasets.load_loans(roles=True, units=True)
Downloading population_train... ━━━━━━━━━━━━━━━━━━ 100% • 36.0/36.0 kB • 00:00 Downloading population_test... ━━━━━━━━━━━━━━━━━━━ 100% • 17.5/17.5 kB • 00:00 Downloading order... ━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 239.1/239.1 kB • 00:00 Downloading trans... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 69.3/69.3 MB • 00:02 Downloading meta... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 651.0/651.0 kB • 00:00
1.2 Prepare data for getML¶
The getml.datasets.load_loans
method took care of the entire data lifting:
- Downloads csv's from our servers in python
- Converts csv's to getML DataFrames
- Sets roles to columns inside getML DataFrames
The only thing left is to set units to columns that the relational learning algorithm is allowed to compare to each other.
Population table¶
- Information on the loan itself (duration, amount, date, ...)
- Geo-information about the branch where the loans was granted (A**)
- Column
status
contains binary target. Levels [A, C] := loan paid back and [B, D] := loan default; we recoded status to our binary target:default
population_train.set_role("date_loan", "time_stamp")
population_test.set_role("date_loan", "time_stamp")
population_test
name | date_loan | account_id | default | frequency | duration | payments | amount | loan_id | district_id | date_account | status |
---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | target | categorical | numerical | numerical | numerical | unused_float | unused_float | unused_string | unused_string |
unit | time stamp, comparison only | money | |||||||||
0 | 1996-04-29 | 19 | 1 | POPLATEK MESICNE | 12 | 2523 | 30276 | 4961 | 21 | 1995-04-07 | B |
1 | 1998-10-14 | 37 | 1 | POPLATEK MESICNE | 60 | 5308 | 318480 | 4967 | 20 | 1997-08-18 | D |
2 | 1998-04-19 | 38 | 0 | POPLATEK TYDNE | 48 | 2307 | 110736 | 4968 | 19 | 1997-08-08 | C |
3 | 1997-08-10 | 97 | 0 | POPLATEK MESICNE | 12 | 8573 | 102876 | 4986 | 74 | 1996-05-05 | A |
4 | 1996-11-06 | 132 | 0 | POPLATEK PO OBRATU | 12 | 7370 | 88440 | 4996 | 40 | 1996-05-11 | A |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
218 | 1995-12-04 | 11042 | 0 | POPLATEK MESICNE | 36 | 6032 | 217152 | 7243 | 72 | 1995-01-29 | A |
219 | 1996-08-20 | 11054 | 0 | POPLATEK TYDNE | 60 | 2482 | 148920 | 7246 | 59 | 1996-02-01 | C |
220 | 1994-01-31 | 11111 | 0 | POPLATEK MESICNE | 36 | 3004 | 108144 | 7259 | 1 | 1993-05-20 | A |
221 | 1998-11-22 | 11317 | 0 | POPLATEK MESICNE | 60 | 5291 | 317460 | 7292 | 50 | 1997-07-11 | C |
222 | 1996-12-27 | 11362 | 0 | POPLATEK MESICNE | 24 | 5392 | 129408 | 7308 | 67 | 1995-10-14 | A |
223 rows x 11 columns
memory usage: 0.02 MB
name: population_test
type: getml.DataFrame
Peripheral tables¶
meta
- Meta info about the client (card_type, gender, ...)
- Geo-information about the client
order
- Permanent orders related to a loan (amount, balance, ...)
trans
- Transactions related to a given loan (amount, ...)
While the contents of meta
and order
are omitted for brevity, here are contents of trans
:
order
name | account_id | bank_to | k_symbol | amount | account_to | order_id |
---|---|---|---|---|---|---|
role | join_key | categorical | categorical | numerical | unused_float | unused_float |
unit | money | |||||
0 | 1 | YZ | SIPO | 2452 | 87144583 | 29401 |
1 | 2 | ST | UVER | 3372.7 | 89597016 | 29402 |
2 | 2 | QR | SIPO | 7266 | 13943797 | 29403 |
3 | 3 | WX | SIPO | 1135 | 83084338 | 29404 |
4 | 3 | CD | NULL | 327 | 24485939 | 29405 |
... | ... | ... | ... | ... | ... | |
6466 | 11362 | YZ | SIPO | 4780 | 70641225 | 46334 |
6467 | 11362 | MN | NULL | 56 | 78507822 | 46335 |
6468 | 11362 | ST | POJISTNE | 330 | 40799850 | 46336 |
6469 | 11362 | KL | NULL | 129 | 20009470 | 46337 |
6470 | 11362 | MN | UVER | 5392 | 61540514 | 46338 |
6471 rows x 6 columns
memory usage: 0.23 MB
name: order
type: getml.DataFrame
trans
name | date | account_id | type | k_symbol | bank | operation | amount | balance | trans_id | account |
---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | categorical | categorical | categorical | categorical | numerical | numerical | unused_float | unused_float |
unit | time stamp, comparison only | money | ||||||||
0 | 1995-03-24 | 1 | PRIJEM | NULL | NULL | VKLAD | 1000 | 1000 | 1 | nan |
1 | 1995-04-13 | 1 | PRIJEM | NULL | AB | PREVOD Z UCTU | 3679 | 4679 | 5 | 41403269 |
2 | 1995-05-13 | 1 | PRIJEM | NULL | AB | PREVOD Z UCTU | 3679 | 20977 | 6 | 41403269 |
3 | 1995-06-13 | 1 | PRIJEM | NULL | AB | PREVOD Z UCTU | 3679 | 26835 | 7 | 41403269 |
4 | 1995-07-13 | 1 | PRIJEM | NULL | AB | PREVOD Z UCTU | 3679 | 30415 | 8 | 41403269 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
1056315 | 1998-08-31 | 10451 | PRIJEM | UROK | NULL | NULL | 62 | 17300 | 3682983 | nan |
1056316 | 1998-09-30 | 10451 | PRIJEM | UROK | NULL | NULL | 49 | 13442 | 3682984 | nan |
1056317 | 1998-10-31 | 10451 | PRIJEM | UROK | NULL | NULL | 34 | 10118 | 3682985 | nan |
1056318 | 1998-11-30 | 10451 | PRIJEM | UROK | NULL | NULL | 26 | 8398 | 3682986 | nan |
1056319 | 1998-12-31 | 10451 | PRIJEM | UROK | NULL | NULL | 42 | 13695 | 3682987 | nan |
1056320 rows x 10 columns
memory usage: 63.38 MB
name: trans
type: getml.DataFrame
meta
name | account_id | type_disp | type_card | gender | A3 | A4 | A5 | A6 | A7 | A8 | A9 | A10 | A11 | A12 | A13 | A14 | A15 | A16 | disp_id | client_id | card_id | district_id | issued | birth_date | A2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | join_key | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
0 | 1 | OWNER | NULL | F | south Bohemia | 70699 | 60 | 13 | 2 | 1 | 4 | 65.3 | 8968 | 2.8 | 3.35 | 131 | 1740 | 1910 | 1 | 1 | nan | 18 | NULL | 1970-12-13 | Pisek |
1 | 2 | OWNER | NULL | M | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100 | 12541 | 0.2 | 0.43 | 167 | 85677 | 99107 | 2 | 2 | nan | 1 | NULL | 1945-02-04 | Hl.m. Praha |
2 | 2 | DISPONENT | NULL | F | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100 | 12541 | 0.2 | 0.43 | 167 | 85677 | 99107 | 3 | 3 | nan | 1 | NULL | 1940-10-09 | Hl.m. Praha |
3 | 3 | OWNER | NULL | M | central Bohemia | 95616 | 65 | 30 | 4 | 1 | 6 | 51.4 | 9307 | 3.8 | 4.43 | 118 | 2616 | 3040 | 4 | 4 | nan | 5 | NULL | 1956-12-01 | Kolin |
4 | 3 | DISPONENT | NULL | F | central Bohemia | 95616 | 65 | 30 | 4 | 1 | 6 | 51.4 | 9307 | 3.8 | 4.43 | 118 | 2616 | 3040 | 5 | 5 | nan | 5 | NULL | 1960-07-03 | Kolin |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
5364 | 11349 | OWNER | NULL | F | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100 | 12541 | 0.2 | 0.43 | 167 | 85677 | 99107 | 13647 | 13955 | nan | 1 | NULL | 1945-10-30 | Hl.m. Praha |
5365 | 11349 | DISPONENT | NULL | M | Prague | 1204953 | 0 | 0 | 0 | 1 | 1 | 100 | 12541 | 0.2 | 0.43 | 167 | 85677 | 99107 | 13648 | 13956 | nan | 1 | NULL | 1943-04-06 | Hl.m. Praha |
5366 | 11359 | OWNER | classic | M | south Moravia | 117897 | 139 | 28 | 5 | 1 | 6 | 53.8 | 8814 | 4.7 | 5.74 | 107 | 2112 | 2059 | 13660 | 13968 | 1247 | 61 | 1995-06-13 | 1968-04-13 | Trebic |
5367 | 11362 | OWNER | NULL | F | north Moravia | 106054 | 38 | 25 | 6 | 2 | 6 | 63.1 | 8110 | 5.7 | 6.55 | 109 | 3244 | 3079 | 13663 | 13971 | nan | 67 | NULL | 1962-10-19 | Bruntal |
5368 | 11382 | OWNER | NULL | F | north Moravia | 323870 | 0 | 0 | 0 | 1 | 1 | 100 | 10673 | 4.7 | 5.44 | 100 | 18782 | 18347 | 13690 | 13998 | nan | 74 | NULL | 1953-08-12 | Ostrava - mesto |
5369 rows x 25 columns
memory usage: 1.10 MB
name: meta
type: getml.DataFrame
1.3 Define relational model¶
To start with relational learning, we need to specify an abstract data model. Here, we use the high-level star schema API that allows us to define the abstract data model and construct a container with the concrete data at one-go. While a simple StarSchema
indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's full-fledged data model and container APIs respectively.
star_schema = getml.data.StarSchema(
train=population_train, test=population_test, alias="population"
)
star_schema.join(
trans,
on="account_id",
time_stamps=("date_loan", "date"),
)
star_schema.join(
order,
on="account_id",
)
star_schema.join(
meta,
on="account_id",
)
star_schema
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | meta | META__STAGING_TABLE_2 |
2 | order | ORDER__STAGING_TABLE_3 |
3 | trans | TRANS__STAGING_TABLE_4 |
subset | name | rows | type | |
---|---|---|---|---|
0 | train | population_train | 459 | DataFrame |
1 | test | population_test | 223 | DataFrame |
name | rows | type | |
---|---|---|---|
0 | trans | 1056320 | DataFrame |
1 | order | 6471 | DataFrame |
2 | meta | 5369 | DataFrame |
2. Predictive modeling¶
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
2.1 getML Pipeline¶
Set-up of feature learners, selectors & predictor
fast_prop = getml.feature_learning.FastProp(
aggregation=getml.feature_learning.FastProp.agg_sets.All,
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
feature_selector = getml.predictors.XGBoostClassifier(n_jobs=1)
# the population is really small, so we set gamma to mitigate overfitting
predictor = getml.predictors.XGBoostClassifier(gamma=2, n_jobs=1,)
Build the pipeline
pipe = getml.pipeline.Pipeline(
data_model=star_schema.data_model,
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=predictor,
)
2.2 Model training¶
pipe.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
OK.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 548 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Trained pipeline.
Time taken: 0:00:01.068837.
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=['XGBoostClassifier'], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['meta', 'order', 'trans'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.5, tags=['container-yVbiNK'])
2.3 Model evaluation¶
pipe.score(star_schema.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 12:36:27 | train | default | 1.0 | 1. | 0.06359 |
1 | 2024-09-12 12:36:28 | test | default | 0.9641 | 0.934 | 0.15445 |
2.4 Studying features¶
Visualizing the learned features
The feature with the highest importance is:
by_importances = pipe.features.sort(by="importances")
by_importances[0].sql
DROP TABLE IF EXISTS "FEATURE_1_24";
CREATE TABLE "FEATURE_1_24" AS
SELECT Q1( t2."balance" ) AS "feature_1_24",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRANS__STAGING_TABLE_4" t2
ON t1."account_id" = t2."account_id"
WHERE t2."date" <= t1."date_loan"
GROUP BY t1.rowid;
Feature correlations
We want to analyze how the features are correlated with the target variable.
names, correlations = pipe.features[:50].correlations()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, correlations, color="#6829c2")
ax.set_title("feature correlations")
ax.set_xlabel("feature")
ax.set_ylabel("correlation")
ax.tick_params(axis="x", rotation=90)
Feature importances
Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.
names, importances = pipe.features[:50].importances()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, importances, color='#6829c2')
ax.set_title("feature importances")
ax.set_xlabel("feature")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)
Column importances
Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.
As we can see, a lot of the predictive power stems from the account balance. This is unsurprising: People with less money on their bank accounts are more likely to default on their loans.
names, importances = pipe.columns.importances()
fig, ax = plt.subplots(figsize=(20, 10))
ax.bar(names, importances, color="#6829c2")
ax.set_title("column importances")
ax.set_xlabel("column")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)
The most important feature looks as follows:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_24";
CREATE TABLE "FEATURE_1_24" AS
SELECT Q1( t2."balance" ) AS "feature_1_24",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRANS__STAGING_TABLE_4" t2
ON t1."account_id" = t2."account_id"
WHERE t2."date" <= t1."date_loan"
GROUP BY t1.rowid;
getml.engine.shutdown()
3. Conclusion¶
By applying getML to the PKDD'99 Financial dataset, we were able to show the power and relevance of Relational Learning on a real-world data set. Within a training time below 2 seconds, we outperformed almost all approaches based on manually generated features. This makes getML the prime choice when dealing with complex relational data schemes. This result holds independent of the problem domain since no expertise in the financial sector was used in this analysis.
The present analysis could be improved in two directions. By performing an extensive hyperparameter optimization, the out of sample AUC could be further improved. On the other hand, the hyperparameters could be tuned to produce less complex features that result in worse performance (in terms of AUC) but are better interpretable by humans.
References¶
Schulte, Oliver, et al. "A hierarchy of independence assumptions for multi-relational Bayes net classifiers." 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, 2013.