Loans - Predicting loan default risk of Czech bank customers¶

This notebook demonstrates the application of our relational learning algorithm to predict if a customer of a bank will default on his loan. We train the predictor on customer metadata, transaction history, as well as other successful and unsuccessful loans.

Summary:

Prediction type: Binary classification
Domain: Finance
Prediction target: Loan default
Source data: 8 tables, 78.8 MB
Population size: 682

Background¶

This notebook features a textbook example of predictive analytics applied to the financial sector. A loan is the lending of money to companies or individuals. Banks grant loans in exchange for the promise of repayment. Loan default is defined as the failure to meet this legal obligation, for example, when a home buyer fails to make a mortgage payment. A bank needs to estimate the risk it carries when granting loans to potentially non-performing customers.

The analysis is based on the financial dataset from the the CTU Prague Relational Learning Repository (Motl and Schulte, 2015) (Now residing at relational-data.org.) .

Analysis¶

Let's get started with the analysis and set-up your session:

In [ ]:

  Copied!     
 
%pip install -q "getml==1.5.0" "matplotlib==3.9.2"
%pip install -q "getml==1.5.0" "matplotlib==3.9.2"

In [ ]:

  Copied!     
 
import matplotlib.pyplot as plt
from IPython.display import Image

import getml

%matplotlib inline
import matplotlib.pyplot as plt from IPython.display import Image import getml %matplotlib inline

In [ ]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("loans")
getml.engine.launch(allow_remote_ips=True, token="token") getml.engine.set_project("loans")

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912214744.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'loans'.

1. Loading data¶

1.1 Download from source¶

Downloading the raw data from the CTU Prague Relational Learning Repository into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It will be made available in a future version.

In [ ]:

  Copied!     
 
population_train, population_test, order, trans, meta = getml.datasets.load_loans(
    roles=True, units=True
)
population_train, population_test, order, trans, meta = getml.datasets.load_loans( roles=True, units=True )

1.2 Prepare data for getML¶

The getml.datasets.load_loans method took care of the entire data lifting:

Downloads csv's from our servers in python
Converts csv's to getML DataFrames
Sets roles to columns inside getML DataFrames

The only thing left is to set units to columns that the relational learning algorithm is allowed to compare to each other.

Data visualization

To simplify the notebook, original data model (image below) is condensed into 4 tables, by resolving the trivial one-to-one and many-to-one joins:

A population table population_{train, test}, consiting of loan and account tables
Three peripheral tables: order, trans, and meta.
Whereas meta is made up of card, client, disp and district

In [5]:

  Copied!     
 
Image("assets/loans-schema.png", width=500)
Image("assets/loans-schema.png", width=500)

Out[5]:

No description has been provided for this image

Population table

Information on the loan itself (duration, amount, date, ...)
Geo-information about the branch where the loans was granted (A**)
Column status contains binary target. Levels [A, C] := loan paid back and [B, D] := loan default; we recoded status to our binary target: default

In [6]:

  Copied!     
 
population_train.set_role("date_loan", "time_stamp")
population_test.set_role("date_loan", "time_stamp")
population_train.set_role("date_loan", "time_stamp") population_test.set_role("date_loan", "time_stamp")

In [7]:

  Copied!     
 
population_test
population_test

Out[7]:

name	date_loan	account_id	default	frequency	duration	payments	amount	loan_id	district_id	date_account	status
role	time_stamp	join_key	target	categorical	numerical	numerical	numerical	unused_float	unused_float	unused_string	unused_string
unit	time stamp, comparison only						money
0	1996-04-29	19	1	POPLATEK MESICNE	12	2523	30276	4961	21	1995-04-07	B
1	1998-10-14	37	1	POPLATEK MESICNE	60	5308	318480	4967	20	1997-08-18	D
2	1998-04-19	38	0	POPLATEK TYDNE	48	2307	110736	4968	19	1997-08-08	C
3	1997-08-10	97	0	POPLATEK MESICNE	12	8573	102876	4986	74	1996-05-05	A
4	1996-11-06	132	0	POPLATEK PO OBRATU	12	7370	88440	4996	40	1996-05-11	A
	...	...	...	...	...	...	...	...	...	...	...
218	1995-12-04	11042	0	POPLATEK MESICNE	36	6032	217152	7243	72	1995-01-29	A
219	1996-08-20	11054	0	POPLATEK TYDNE	60	2482	148920	7246	59	1996-02-01	C
220	1994-01-31	11111	0	POPLATEK MESICNE	36	3004	108144	7259	1	1993-05-20	A
221	1998-11-22	11317	0	POPLATEK MESICNE	60	5291	317460	7292	50	1997-07-11	C
222	1996-12-27	11362	0	POPLATEK MESICNE	24	5392	129408	7308	67	1995-10-14	A

223 rows x 11 columns
memory usage: 0.02 MB
name: population_test
type: getml.DataFrame

Peripheral tables

meta
- Meta info about the client (card_type, gender, ...)
- Geo-information about the client
order
- Permanent orders related to a loan (amount, balance, ...)
trans
- Transactions related to a given loan (amount, ...)

While the contents of meta and order are omitted for brevity, here are contents of trans:

In [8]:

  Copied!     
 
trans
trans

Out[8]:

name	date	account_id	type	k_symbol	bank	operation	amount	balance	trans_id	account
role	time_stamp	join_key	categorical	categorical	categorical	categorical	numerical	numerical	unused_float	unused_float
unit	time stamp, comparison only							money
0	1995-03-24	1	PRIJEM	NULL	NULL	VKLAD	1000	1000	1	nan
1	1995-04-13	1	PRIJEM	NULL	AB	PREVOD Z UCTU	3679	4679	5	41403269
2	1995-05-13	1	PRIJEM	NULL	AB	PREVOD Z UCTU	3679	20977	6	41403269
3	1995-06-13	1	PRIJEM	NULL	AB	PREVOD Z UCTU	3679	26835	7	41403269
4	1995-07-13	1	PRIJEM	NULL	AB	PREVOD Z UCTU	3679	30415	8	41403269
	...	...	...	...	...	...	...	...	...	...
1056315	1998-08-31	10451	PRIJEM	UROK	NULL	NULL	62	17300	3682983	nan
1056316	1998-09-30	10451	PRIJEM	UROK	NULL	NULL	49	13442	3682984	nan
1056317	1998-10-31	10451	PRIJEM	UROK	NULL	NULL	34	10118	3682985	nan
1056318	1998-11-30	10451	PRIJEM	UROK	NULL	NULL	26	8398	3682986	nan
1056319	1998-12-31	10451	PRIJEM	UROK	NULL	NULL	42	13695	3682987	nan

1056320 rows x 10 columns
memory usage: 63.38 MB
name: trans
type: getml.DataFrame

1.3 Define relational model¶

To start with relational learning, we need to specify an abstract data model. Here, we use the high-level star schema API that allows us to define the abstract data model and construct a container with the concrete data at one-go. While a simple StarSchema indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's full-fledged data model and container APIs respectively.

In [9]:

  Copied!     
 
star_schema = getml.data.StarSchema(
    train=population_train, test=population_test, alias="population"
)

star_schema.join(
    trans,
    on="account_id",
    time_stamps=("date_loan", "date"),
)

star_schema.join(
    order,
    on="account_id",
)

star_schema.join(
    meta,
    on="account_id",
)

star_schema
star_schema = getml.data.StarSchema( train=population_train, test=population_test, alias="population" ) star_schema.join( trans, on="account_id", time_stamps=("date_loan", "date"), ) star_schema.join( order, on="account_id", ) star_schema.join( meta, on="account_id", ) star_schema

Out[9]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	meta	META__STAGING_TABLE_2
2	order	ORDER__STAGING_TABLE_3
3	trans	TRANS__STAGING_TABLE_4

container

population

	subset	name	rows	type
0	train	population_train	459	DataFrame
1	test	population_test	223	DataFrame

peripheral

	name	rows	type
0	trans	1056320	DataFrame
1	order	6471	DataFrame
2	meta	5369	DataFrame

In [10]:

  Copied!     
 
meta
meta

Out[10]:

name	account_id	type_disp	type_card	gender	A3	A4	A5	A6	A7	A8	A9	A10	A11	A12	A13	A14	A15	A16	disp_id	client_id	card_id	district_id	issued	birth_date	A2
role	join_key	categorical	categorical	categorical	categorical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string	unused_string
0	1	OWNER	NULL	F	south Bohemia	70699	60	13	2	1	4	65.3	8968	2.8	3.35	131	1740	1910	1	1	nan	18	NULL	1970-12-13	Pisek
1	2	OWNER	NULL	M	Prague	1204953	0	0	0	1	1	100	12541	0.2	0.43	167	85677	99107	2	2	nan	1	NULL	1945-02-04	Hl.m. Praha
2	2	DISPONENT	NULL	F	Prague	1204953	0	0	0	1	1	100	12541	0.2	0.43	167	85677	99107	3	3	nan	1	NULL	1940-10-09	Hl.m. Praha
3	3	OWNER	NULL	M	central Bohemia	95616	65	30	4	1	6	51.4	9307	3.8	4.43	118	2616	3040	4	4	nan	5	NULL	1956-12-01	Kolin
4	3	DISPONENT	NULL	F	central Bohemia	95616	65	30	4	1	6	51.4	9307	3.8	4.43	118	2616	3040	5	5	nan	5	NULL	1960-07-03	Kolin
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5364	11349	OWNER	NULL	F	Prague	1204953	0	0	0	1	1	100	12541	0.2	0.43	167	85677	99107	13647	13955	nan	1	NULL	1945-10-30	Hl.m. Praha
5365	11349	DISPONENT	NULL	M	Prague	1204953	0	0	0	1	1	100	12541	0.2	0.43	167	85677	99107	13648	13956	nan	1	NULL	1943-04-06	Hl.m. Praha
5366	11359	OWNER	classic	M	south Moravia	117897	139	28	5	1	6	53.8	8814	4.7	5.74	107	2112	2059	13660	13968	1247	61	1995-06-13	1968-04-13	Trebic
5367	11362	OWNER	NULL	F	north Moravia	106054	38	25	6	2	6	63.1	8110	5.7	6.55	109	3244	3079	13663	13971	nan	67	NULL	1962-10-19	Bruntal
5368	11382	OWNER	NULL	F	north Moravia	323870	0	0	0	1	1	100	10673	4.7	5.44	100	18782	18347	13690	13998	nan	74	NULL	1953-08-12	Ostrava - mesto

5369 rows x 25 columns
memory usage: 1.10 MB
name: meta
type: getml.DataFrame

2. Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 getML Pipeline¶

Set-up of feature learners, selectors & predictor

In [ ]:

  Copied!     
 
mapping = getml.preprocessors.Mapping(min_freq=100)

fast_prop = getml.feature_learning.FastProp(
    aggregation=getml.feature_learning.FastProp.agg_sets.All,
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
)

feature_selector = getml.predictors.XGBoostClassifier(n_jobs=1)

# the population is really small, so we set gamma to mitigate overfitting
predictor = getml.predictors.XGBoostClassifier(
    gamma=2,
    n_jobs=1,
)
mapping = getml.preprocessors.Mapping(min_freq=100) fast_prop = getml.feature_learning.FastProp( aggregation=getml.feature_learning.FastProp.agg_sets.All, loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, num_threads=1, ) feature_selector = getml.predictors.XGBoostClassifier(n_jobs=1) # the population is really small, so we set gamma to mitigate overfitting predictor = getml.predictors.XGBoostClassifier( gamma=2, n_jobs=1, )

Build the pipeline

In [12]:

  Copied!     
 
pipe = getml.pipeline.Pipeline(
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=predictor,
)
pipe = getml.pipeline.Pipeline( data_model=star_schema.data_model, preprocessors=[mapping], feature_learners=[fast_prop], feature_selectors=[feature_selector], predictors=predictor, )

2.2 Model training¶

In [13]:

  Copied!     
 
pipe.fit(star_schema.train)
pipe.fit(star_schema.train)

Checking data model...

⠧ Staging...                                            0% • --:--

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

OK.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 808 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Trained pipeline.

Time taken: 0:00:01.070034.

Out[13]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['meta', 'order', 'trans'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['container-CE93xg'])

2.3 Model evaluation¶

In [14]:

  Copied!     
 
pipe.score(star_schema.test)
pipe.score(star_schema.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[14]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 21:48:14	train	default	0.9978	1.	0.0671
1	2024-09-12 21:48:15	test	default	0.9686	0.9465	0.13814

2.4 Studying features¶

Visualizing the learned features

The feature with the highest importance is:

In [15]:

  Copied!     
 
by_importances = pipe.features.sort(by="importances")
by_importances[0].sql
by_importances = pipe.features.sort(by="importances") by_importances[0].sql

Out[15]:

DROP TABLE IF EXISTS "FEATURE_1_21";

CREATE TABLE "FEATURE_1_21" AS
SELECT Q1( t2."balance" ) AS "feature_1_21",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRANS__STAGING_TABLE_4" t2
ON t1."account_id" = t2."account_id"
WHERE t2."date" <= t1."date_loan"
GROUP BY t1.rowid;

Feature correlations

We want to analyze how the features are correlated with the target variable.

In [16]:

  Copied!     
 
names, correlations = pipe.features[:50].correlations()

fig, ax = plt.subplots(figsize=(20, 10))

ax.bar(names, correlations)

ax.set_title("feature correlations")
ax.set_xlabel("feature")
ax.set_ylabel("correlation")
ax.tick_params(axis="x", rotation=90)
names, correlations = pipe.features[:50].correlations() fig, ax = plt.subplots(figsize=(20, 10)) ax.bar(names, correlations) ax.set_title("feature correlations") ax.set_xlabel("feature") ax.set_ylabel("correlation") ax.tick_params(axis="x", rotation=90)

Feature importances

Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.

In [17]:

  Copied!     
 
names, importances = pipe.features[:50].importances()

fig, ax = plt.subplots(figsize=(20, 10))

ax.bar(names, importances)

ax.set_title("feature importances")
ax.set_xlabel("feature")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)
names, importances = pipe.features[:50].importances() fig, ax = plt.subplots(figsize=(20, 10)) ax.bar(names, importances) ax.set_title("feature importances") ax.set_xlabel("feature") ax.set_ylabel("importance") ax.tick_params(axis="x", rotation=90)

Column importances

Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.

As we can see, a lot of the predictive power stems from the account balance. This is unsurprising: People with less money on their bank accounts are more likely to default on their loans.

In [18]:

  Copied!     
 
names, importances = pipe.columns.importances()

fig, ax = plt.subplots(figsize=(20, 10))

ax.bar(names, importances)

ax.set_title("column importances")
ax.set_xlabel("column")
ax.set_ylabel("importance")
ax.tick_params(axis="x", rotation=90)
names, importances = pipe.columns.importances() fig, ax = plt.subplots(figsize=(20, 10)) ax.bar(names, importances) ax.set_title("column importances") ax.set_xlabel("column") ax.set_ylabel("importance") ax.tick_params(axis="x", rotation=90)

The most important feature looks as follows:

In [19]:

  Copied!     
 
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]

Out[19]:

DROP TABLE IF EXISTS "FEATURE_1_21";

CREATE TABLE "FEATURE_1_21" AS
SELECT Q1( t2."balance" ) AS "feature_1_21",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "TRANS__STAGING_TABLE_4" t2
ON t1."account_id" = t2."account_id"
WHERE t2."date" <= t1."date_loan"
GROUP BY t1.rowid;

2.5 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [20]:

  Copied!     
 
# Creates a folder named loans_pipeline containing
# the SQL code.
pipe.features.to_sql().save("loans_pipeline", remove=True)
# Creates a folder named loans_pipeline containing # the SQL code. pipe.features.to_sql().save("loans_pipeline", remove=True)

In [ ]:

  Copied!     
 
# Creates a folder named baseball_pipeline_spark containing
# the SQL code.
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save(
    "loans_pipeline_spark", remove=True
)
# Creates a folder named baseball_pipeline_spark containing # the SQL code. pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save( "loans_pipeline_spark", remove=True )

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

By applying getML to the PKDD'99 Financial dataset, we were able to show the power and relevance of Relational Learning on a real-world data set. Within a training time below 1 minute, we outperformed almost all approaches based on manually generated features. This makes getML the prime choice when dealing with complex relational data schemes. This result holds independent of the problem domain since no expertise in the financial sector was used in this analysis.

The present analysis could be improved in two directions. By performing an extensive hyperparameter optimization, the out of sample AUC could be further improved. On the other hand, the hyperparameters could be tuned to produce less complex features that result in worse performance (in terms of AUC) but are better interpretable by humans.

References¶

Schulte, Oliver, et al. "A hierarchy of independence assumptions for multi-relational Bayes net classifiers." 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, 2013.