article

Getting started with loan default prediction (nb article, loans.md)

Author: Johannes King

This example project walks you through the textbook data science problem of loan default prediction.

Loan default prediction

This example project walks you through the textbook data science problem of loan default prediction. It demonstrates how to automate the most tedious parts of the process - data wrangling. A production-ready solution is developed in just a few hours - still beating most other models.

This example is well-suited if you want to learn how to

  • approach a real-world dataset
  • train a basic getML pipeline
  • perform a hyperparameter optimization
  • interpret features generated using getML

Dataset

This tutorial features a use case from the financial sector. The analysis is based on the financial dataset from the the CTU Prague Relational Learning Repository. It contains information on 606 successful and 76 unsuccessful loans and consists of 8 tables:

Results

The loans data set is one of the most commonly used data sets in the relational learning literature. Nonetheless, getML outperforms practically all peer-reviewed academic papers based on the loans data set. The most important feature is found using Relboost. Transpiled to SQL it looks like this:

CREATE TABLE "FEATURE_2_4" AS
SELECT AVG( 
    CASE
        WHEN ( t2."balance" > 10430.000000 ) AND ( t2."bank" IN ( 'YZ', 'ST', 'QR', 'WX', 'CD', 'AB', 'UV', 'GH', 'IJ', 'KL', 'EF', 'MN', 'OP' ) ) AND ( datetime( t1."date" ) > datetime( t2."date", '+32.951613 days' ) ) THEN -19.024116
        WHEN ( t2."balance" > 10430.000000 ) AND ( t2."bank" IN ( 'YZ', 'ST', 'QR', 'WX', 'CD', 'AB', 'UV', 'GH', 'IJ', 'KL', 'EF', 'MN', 'OP' ) ) AND ( datetime( t1."date" ) <= datetime( t2."date", '+32.951613 days' ) OR t1."date" IS NULL OR t2."date" IS NULL ) THEN -32.262087
        WHEN ( t2."balance" > 10430.000000 ) AND ( t2."bank" NOT IN ( 'YZ', 'ST', 'QR', 'WX', 'CD', 'AB', 'UV', 'GH', 'IJ', 'KL', 'EF', 'MN', 'OP' ) ) AND ( t2."k_symbol" IN ( 'SIPO' ) ) THEN 32.559286
        WHEN ( t2."balance" > 10430.000000 ) AND ( t2."bank" NOT IN ( 'YZ', 'ST', 'QR', 'WX', 'CD', 'AB', 'UV', 'GH', 'IJ', 'KL', 'EF', 'MN', 'OP' ) ) AND ( t2."k_symbol" NOT IN ( 'SIPO' ) ) THEN -18.534438
        WHEN ( t2."balance" <= 10430.000000 OR t2."balance" IS NULL ) AND ( t2."balance" > 8451.000000 ) AND ( t2."bank" IN ( 'KL' ) ) THEN 394.075942
        WHEN ( t2."balance" <= 10430.000000 OR t2."balance" IS NULL ) AND ( t2."balance" > 8451.000000 ) AND ( t2."bank" NOT IN ( 'KL' ) ) THEN -32.813926
        WHEN ( t2."balance" <= 10430.000000 OR t2."balance" IS NULL ) AND ( t2."balance" <= 8451.000000 OR t2."balance" IS NULL ) AND ( datetime( t1."date" ) > datetime( t2."date", '+31.086957 days' ) ) THEN -0.130672
        WHEN ( t2."balance" <= 10430.000000 OR t2."balance" IS NULL ) AND ( t2."balance" <= 8451.000000 OR t2."balance" IS NULL ) AND ( datetime( t1."date" ) <= datetime( t2."date", '+31.086957 days' ) OR t1."date" IS NULL OR t2."date" IS NULL ) THEN 39.412646
        ELSE NULL
    END
) AS "feature_2_4",
     t1.rowid AS "rownum"
FROM "population" t1
LEFT JOIN "trans" t2
ON t1."account_id" = t2."account_id"
WHERE datetime( t2."date" ) <= datetime( t1."date" )
GROUP BY t1.rowid;