Atherosclerosis - Disease lethality prediction¶

With this notebook we give a brief introduction to feature engineering on relational data with many columns. We discuss why feature engineering on such data is particularly challenging and what we can do to overcome these problems.

Summary:

Prediction type: Binary classification
Domain: Health
Prediction target: Mortality within one year
Source data: 146 columns in 2 tables, 22 MB
Population size: 28433

Feature engineering and the curse of dimensionality¶

The problem¶

To illustrate the point, we give a simplified example based on the real data used in the analysis below. When we engineer features from relational data, we usually write something like this:

SELECT AVG(t2.HDL)
FROM population_training t1
LEFT JOIN contr t2
ON t1.ICO = t2.ICO
WHERE t1.AGE >= 60 AND t1.ALKOHOL IN ('1', '2')
GROUP BY t1.ICO;

Think about that for a second. This feature aggregates high-density lipoprotein (HDL) cholesterol values recorded during control dates conditional on age and alcohol consumption. We arbitrarily choose both, the column to aggregate over (HDL) and the set of columns to construct conditions on (AGE and ALKOHOL) out of a greater set of 146 columns.

Every column that we have can either be aggregated (here HDL) or it can be used for our conditions (here AGE and ALKOHOL). That means if we have n columns to aggregate, we can potentially build conditions for $n$ other columns. In other words, the computational complexity is $n^2$ in the number of columns.

Note that this problem occurs regardless of whether you automate feature engineering or you do it by hand. The size of the search space is $n^2$ in the number of columns in either case, unless you can rule out something a-priori.

The solution¶

So when we have relational data sets with many columns, what do we do? The answer is to write different features. Specifically, suppose we had features like this:

SELECT AVG(
    CASE WHEN t1.AGE >= THEN weight1
    CASE WHEN t1.ALKOHOL IN ('1', '2') THEN weight2
    END
)
FROM population_training t1
LEFT JOIN contr t2
ON t1.ICO = t2.ICO
GROUP BY t1.ICO;

weight1 and weight2 are learnable weights. An algorithm that generates features like this can only use columns for conditions, it is not allowed to aggregate columns – and it doesn't need to do so.

That means the computational complexity is linear instead of quadratic. For data sets with a large number of columns this can make all the difference in the world. For instance, if you have 100 columns the size of the search space of the second approach is only 1% of the size of the search space of the first one.

Background¶

To illustrate the problem of dimensionality in predictive analytics on relational data, we use the STULONG 1 dataset. It is a longitudinal study of atherosclerosis patients.

One of its defining features is that it contains many columns, which makes it a good candidate to illustrate the problem discussed in this notebook.

The are some academic studies related to this dataset:

The way these studies handle the large number of columns in the data set is to divide the columns into subgroups and then handling each subgroup separately. Even though this is one way to overcome the curse of dimensionality, it is not a very satisfying approach. We would like to be able to handle a large number of columns at once.

The analysis is based on the STULONG 1 dataset. It is publicly available and can be downloaded the the CTU Prague Relational Learning Repository (Now residing at relational-data.org.).

Analysis¶

Let's get started with the analysis and set up your session:

In [23]:

  Copied!     
 
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3" "matplotlib==3.9.2" "ipywidgets==8.1.5"
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3" "matplotlib==3.9.2" "ipywidgets==8.1.5"

In [2]:

  Copied!     
 
import matplotlib.pyplot as plt

import getml

%matplotlib inline  

print(f"getML API version: {getml.__version__}\n")
import matplotlib.pyplot as plt import getml %matplotlib inline print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [24]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('atherosclerosis')
getml.engine.launch(allow_remote_ips=True, token='token') getml.engine.set_project('atherosclerosis')

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912133457.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'atherosclerosis'.

1. Loading data¶

1.1 Download from source¶

Downloading the raw data and convert it into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It is made available in the example notebook featuring the full analysis.

In [4]:

  Copied!     
 
population, contr = getml.datasets.load_atherosclerosis()
population, contr = getml.datasets.load_atherosclerosis()

1.2 Prepare data for getML¶

The getml.datasets.load_atherosclerosis method took care of the entire data lifting:

Downloads csv's from our servers in python
Converts csv's to getML DataFrames
Sets roles to columns inside getML DataFrames

Data visualization

The original data (image below) model is condensed into 2 tables:

A population table population_{train/test/validate}, based on death table
A peripheral table: contr.

Death: population table

Reference Date: Period of time from 1976 to 1999
Target: If the patient dies within one year after each reference date

In [5]:

  Copied!     
 
population
population

Out[5]:

name	REFERENCE_DATE	ENTRY_DATE	ICO	TARGET	KONSKUP	STAV	VZDELANI	ZODPOV	TELAKTZA	AKTPOZAM	DOPRAVA	DOPRATRV	ALKOHOL	BOLHR	BOLDK	DUSNOST	MOC	RARISK	OBEZRISK	KOURRISK	HTRISK	CHOLRISK	KOURENI	DOBAKOUR	BYVKURAK	PIVOMN	VINOMN	LIHMN	KAVA	CAJ	CUKR	VYSKA	VAHA	SYST1	DIAST1	SYST2	DIAST2	TRIC	SUBSC	CHLST	TRIGL	AGE	PARTICIPATION	ROKNAR	ROKVSTUP	MESVSTUP	PIVO7	PIVO10	PIVO12	VINO	LIHOV	IM	IML	HT	HTD	HTL	ICT	DIABET	DIABD	DIABL	HYPLIP	HYPLD	HYPLL	IMTRV	HTTRV	DIABTRV	HYPLTRV	DENUMR	MESUMR	ROKUMR	PRICUMR	DUMMY	YEAR	ICTL	ICTTRV	DEATH_DATE
role	time_stamp	time_stamp	join_key	target	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	numerical	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_string	unused_string	unused_string
unit	time stamp	time stamp
0	1977-01-01	1976-10-01	10001	0	4	1	3	1	1	2	3	6	2	1	1	1	1	0	0	1	0	0	4	10	nan	1	4	9	2	4	3	169	71	120	85	120	90	4	12	209	86	48	-1899	29	1976	10	nan	nan	nan	nan	12	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1977	NULL	NULL	NULL
1	1978-01-01	1976-10-01	10001	0	4	1	3	1	1	2	3	6	2	1	1	1	1	0	0	1	0	0	4	10	nan	1	4	9	2	4	3	169	71	120	85	120	90	4	12	209	86	49	-1898	29	1976	10	nan	nan	nan	nan	12	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1978	NULL	NULL	NULL
2	1979-01-01	1976-10-01	10001	0	4	1	3	1	1	2	3	6	2	1	1	1	1	0	0	1	0	0	4	10	nan	1	4	9	2	4	3	169	71	120	85	120	90	4	12	209	86	50	-1897	29	1976	10	nan	nan	nan	nan	12	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1979	NULL	NULL	NULL
3	1980-01-01	1976-10-01	10001	0	4	1	3	1	1	2	3	6	2	1	1	1	1	0	0	1	0	0	4	10	nan	1	4	9	2	4	3	169	71	120	85	120	90	4	12	209	86	51	-1896	29	1976	10	nan	nan	nan	nan	12	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1980	NULL	NULL	NULL
4	1981-01-01	1976-10-01	10001	0	4	1	3	1	1	2	3	6	2	1	1	1	1	0	0	1	0	0	4	10	nan	1	4	9	2	4	3	169	71	120	85	120	90	4	12	209	86	52	-1895	29	1976	10	nan	nan	nan	nan	12	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1981	NULL	NULL	NULL
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
28428	1995-01-01	1977-03-01	30065	0	5	2	2	3	1	2	1	5	2	3	2	2	1	0	0	1	0	0	4	10	nan	3	4	7	1	5	2	179	69	110	80	120	80	10	18	235	733	67	-1882	28	1977	3	nan	9	nan	nan	nan	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1995	NULL	NULL	NULL
28429	1996-01-01	1977-03-01	30065	0	5	2	2	3	1	2	1	5	2	3	2	2	1	0	0	1	0	0	4	10	nan	3	4	7	1	5	2	179	69	110	80	120	80	10	18	235	733	68	-1881	28	1977	3	nan	9	nan	nan	nan	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1996	NULL	NULL	NULL
28430	1997-01-01	1977-03-01	30065	0	5	2	2	3	1	2	1	5	2	3	2	2	1	0	0	1	0	0	4	10	nan	3	4	7	1	5	2	179	69	110	80	120	80	10	18	235	733	69	-1880	28	1977	3	nan	9	nan	nan	nan	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1997	NULL	NULL	NULL
28431	1998-01-01	1977-03-01	30065	0	5	2	2	3	1	2	1	5	2	3	2	2	1	0	0	1	0	0	4	10	nan	3	4	7	1	5	2	179	69	110	80	120	80	10	18	235	733	70	-1879	28	1977	3	nan	9	nan	nan	nan	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1998	NULL	NULL	NULL
28432	1999-01-01	1977-03-01	30065	0	5	2	2	3	1	2	1	5	2	3	2	2	1	0	0	1	0	0	4	10	nan	3	4	7	1	5	2	179	69	110	80	120	80	10	18	235	733	71	-1878	28	1977	3	nan	9	nan	nan	nan	2	nan	2	nan	nan	2	2	nan	nan	2	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	1	1999	NULL	NULL	NULL

28433 rows x 76 columns
memory usage: 15.27 MB
name: population
type: getml.DataFrame

Contr: peripheral table

In [6]:

  Copied!     
 
contr
contr

Out[6]:

name	CONTROL_DATE	ICO	ZMTELAKT	AKTPOZAM	ZMDIET	LEKTLAK	ZMKOUR	POCCIG	PRACNES	JINAONE	BOLHR	BOLDK	DUSN	HODNSK	HYPERD	HYPCHL	HYPTGL	HMOT	CHLST	HDL	HDLMG	LDL	ROKVYS	MESVYS	PORADK	ZMCHARZA	LEKCHOL	SRDCE	HYPERT	CEVMOZ	DIAB	HODN0	ROK0	HODN1	ROK1	HODN2	ROK2	HODN3	ROK3	HODN4	ROK4	HODN11	ROK11	HODN12	ROK12	HODN13	ROK13	HODN14	ROK14	HODN15	ROK15	HODN21	ROK21	HODN23	ROK23	SYST	DIAST	TRIC	SUBSC	HYPERSD	HYPERS	CHLSTMG	TRIGL	TRIGLMG	MOC	GLYKEMIE	KYSMOC
role	time_stamp	join_key	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	categorical	numerical	numerical	numerical	numerical	numerical	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float	unused_float
unit	time stamp
0	1977-09-01	10001	3	2	1	2	2	0	1	2	1	1	1	2	2	2	2	71	5.61	nan	nan	nan	1977	9	1	20	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	130	90	4	12	2	2	217	1.22	108	1	nan	nan
1	1979-01-01	10001	1	1	1	2	1	0	2	NULL	1	2	1	2	2	2	2	72	6	nan	nan	nan	1979	1	2	20	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	140	90	4	11	2	2	232	4.4	389	3	nan	nan
2	1980-04-01	10001	2	1	1	2	1	0	2	NULL	2	1	1	2	2	2	2	71	6.23	nan	nan	nan	1980	4	3	20	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	130	90	5	22	2	2	241	1.51	134	1	nan	nan
3	1982-01-01	10001	2	1	1	2	1	0	1	2	1	1	1	2	2	2	2	74	5.2	nan	nan	nan	1982	1	4	20	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	150	100	9	18	2	2	201	1.42	126	1	nan	nan
4	1983-02-01	10001	2	2	1	2	1	0	2	NULL	1	2	1	2	2	2	2	73	6.08	1.47	57	4.15	1983	2	5	20	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	165	105	7	15	2	2	235	0.99	88	1	nan	nan
	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
10567	1987-10-01	20358	2	2	1	2	1	0	2	NULL	4	1	1	3	2	2	2	85	4.89	0.98	38	3.11	1987	10	1	20	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	120	80	nan	nan	2	2	189	1.74	154	1	nan	nan
10568	1987-11-01	20359	2	2	1	2	1	14	2	NULL	1	1	1	3	2	2	2	80	4.5	0.98	38	3.16	1987	11	1	20	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	100	70	nan	nan	2	2	174	0.79	70	1	nan	nan
10569	1987-10-01	20360	2	2	5	2	1	0	2	NULL	2	1	2	3	2	2	2	95	5.51	1.29	50	3.68	1987	10	1	40	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	120	80	nan	nan	2	2	213	1.18	104	1	nan	nan
10570	1987-10-01	20362	1	2	1	2	1	0	2	NULL	6	4	1	3	2	2	2	56	4.89	3	116	1.51	1987	10	1	40	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	140	85	nan	nan	2	2	189	0.82	73	1	nan	nan
10571	1978-09-01	30037	2	1	1	2	3	15	2	NULL	4	1	2	4	2	2	1	72	5.61	nan	nan	nan	1978	9	1	20	nan	nan	nan	nan	nan	0	0	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	nan	125	90	9	17	2	2	217	nan	nan	1	nan	nan

10572 rows x 67 columns
memory usage: 4.99 MB
name: contr
type: getml.DataFrame

1.3 Define relational model¶

To start with relational learning, we need to specify an abstract data model. Here, we use the high-level star schema API that allows us to define the abstract data model and construct a container with the concrete data at one-go. While a simple StarSchema indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's full-fledged data model and container APIs respectively.

In [7]:

  Copied!     
 
split = getml.data.split.random(train=0.7, test=0.3)
split = getml.data.split.random(train=0.7, test=0.3)

In [8]:

  Copied!     
 
star_schema = getml.data.StarSchema(population, split=split)

star_schema.join(
    contr,
    on="ICO",
    time_stamps=("REFERENCE_DATE", "CONTROL_DATE")
)

star_schema
star_schema = getml.data.StarSchema(population, split=split) star_schema.join( contr, on="ICO", time_stamps=("REFERENCE_DATE", "CONTROL_DATE") ) star_schema

Out[8]:

data model

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	contr	CONTR__STAGING_TABLE_2

container

population

	subset	name	rows	type
0	test	population	8557	View
1	train	population	19876	View

peripheral

	name	rows	type
0	contr	10572	DataFrame

2. Predictive modeling¶

We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.

2.1 Fitting the pipeline¶

To illustrate the problem of computational complexity, we fit two getML pipelines using different feature learning algorithms:

The first pipeline uses RelMT and the second pipeline uses Relboost. To demonstrate the power of feature ensembling, we build a third pipeline that uses both algorithms. At the end, we compare the runtime and predictive accuracies of all three pipelines.

In [9]:

  Copied!     
 
relmt = getml.feature_learning.RelMT(
    num_features=30,
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1
)

relboost = getml.feature_learning.Relboost(
    num_features=60,
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    min_num_samples=500,
    max_depth=2,
    num_threads=1
)

feature_selector = getml.predictors.XGBoostClassifier()

xgboost = getml.predictors.XGBoostClassifier(
    max_depth=5,
    reg_lambda=100.0,
    learning_rate=0.1
)
relmt = getml.feature_learning.RelMT( num_features=30, loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, num_threads=1 ) relboost = getml.feature_learning.Relboost( num_features=60, loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, min_num_samples=500, max_depth=2, num_threads=1 ) feature_selector = getml.predictors.XGBoostClassifier() xgboost = getml.predictors.XGBoostClassifier( max_depth=5, reg_lambda=100.0, learning_rate=0.1 )

In [10]:

  Copied!     
 
pipe1 = getml.pipeline.Pipeline(
    tags=["relmt"],
    data_model=star_schema.data_model,
    feature_learners=relmt,
    feature_selectors=feature_selector,
    share_selected_features=0.8,
    predictors=xgboost,
    include_categorical=True
)

pipe1
pipe1 = getml.pipeline.Pipeline( tags=["relmt"], data_model=star_schema.data_model, feature_learners=relmt, feature_selectors=feature_selector, share_selected_features=0.8, predictors=xgboost, include_categorical=True ) pipe1

Out[10]:

Pipeline(data_model='population',
         feature_learners=['RelMT'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=True,
         loss_function='CrossEntropyLoss',
         peripheral=['contr'],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.8,
         tags=['relmt'])

In [11]:

  Copied!     
 
pipe2 = getml.pipeline.Pipeline(
    tags=["relboost"],
    data_model=star_schema.data_model,
    feature_learners=relboost,
    feature_selectors=feature_selector,
    share_selected_features=0.8,    
    predictors=xgboost,
    include_categorical=True
)

pipe2
pipe2 = getml.pipeline.Pipeline( tags=["relboost"], data_model=star_schema.data_model, feature_learners=relboost, feature_selectors=feature_selector, share_selected_features=0.8, predictors=xgboost, include_categorical=True ) pipe2

Out[11]:

Pipeline(data_model='population',
         feature_learners=['Relboost'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=True,
         loss_function='CrossEntropyLoss',
         peripheral=['contr'],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.8,
         tags=['relboost'])

We begin with RelMT. Features generated by RelMT suffer from quadratic complexity. Luckily, the number of columns is not too high, so it is still manageble.

It always a good idea to check before fitting. There is one minor warning: This warning means that about 12% of patients never appear in CONTR (presumably because they never showed up to their health check-ups).

In [12]:

  Copied!     
 
pipe1.check(star_schema.train)
pipe1.check(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.

Out[12]:

	type	label	message
0	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and CONTR__STAGING_TABLE_2 over 'ICO' and 'ICO', there are no corresponding entries for 12.291205% of entries in 'ICO' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.

In [13]:

  Copied!     
 
pipe1.fit(star_schema.train)
pipe1.fit(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  RelMT: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 07:02
  RelMT: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04

Trained pipeline.

Time taken: 0:07:11.228924.

Out[13]:

Pipeline(data_model='population',
         feature_learners=['RelMT'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=True,
         loss_function='CrossEntropyLoss',
         peripheral=['contr'],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.8,
         tags=['relmt', 'container-5Vksxn'])

Let's see how well Relboost does. This algorithm has linear complexity in the number of columns.

In [14]:

  Copied!     
 
pipe2.fit(star_schema.train)
pipe2.fit(star_schema.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:22
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03

Trained pipeline.

Time taken: 0:00:31.813144.

Out[14]:

Pipeline(data_model='population',
         feature_learners=['Relboost'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=True,
         loss_function='CrossEntropyLoss',
         peripheral=['contr'],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.8,
         tags=['relboost', 'container-5Vksxn'])

Note that this runs through in under a minute. This demonstrates the power of computational complexity theory. If we had more columns, the difference between these two algorithms would become even more noticable.

2.2 Model evaluation¶

In [15]:

  Copied!     
 
pipe1.score(star_schema.test)
pipe1.score(star_schema.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  RelMT: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[15]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 13:09:44	train	TARGET	0.9891	0.8904	0.0507
1	2024-09-12 13:10:17	test	TARGET	0.9867	0.7143	0.06851

In [16]:

  Copied!     
 
pipe2.score(star_schema.test)
pipe2.score(star_schema.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

Out[16]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 13:10:16	train	TARGET	0.9878	0.8912	0.0538
1	2024-09-12 13:10:19	test	TARGET	0.9867	0.7107	0.06748

2.3 Studying the features¶

It is always a good idea to study the features generated by the algorithms.

In [17]:

  Copied!     
 
names, correlations = pipe2.features.correlations()

plt.subplots(figsize=(20, 10))

plt.bar(names, correlations)

plt.title("feature correlations")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("correlations")
plt.xticks(rotation='vertical')

plt.show()
names, correlations = pipe2.features.correlations() plt.subplots(figsize=(20, 10)) plt.bar(names, correlations) plt.title("feature correlations") plt.grid(True) plt.xlabel("features") plt.ylabel("correlations") plt.xticks(rotation='vertical') plt.show()

No description has been provided for this image

In [18]:

  Copied!     
 
names, importances = pipe2.features.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances)

plt.title("feature importances")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("importances")
plt.xticks(rotation='vertical')

plt.show()
names, importances = pipe2.features.importances() plt.subplots(figsize=(20, 10)) plt.bar(names, importances) plt.title("feature importances") plt.grid(True) plt.xlabel("features") plt.ylabel("importances") plt.xticks(rotation='vertical') plt.show()

As we can see from these figures we need many features to get a good result. No single feature is very correlated with the target and the feature importance is not concentrated on a small number of features (as is often the case with other data sets).

This implies that the many columns in the data set are actually needed. The reason we emphasize that is that we sometimes see data sets with many columns, but after analyzing them we find and only a handful of these columns are actually needed. This is not one of these times.

The most important features look like this:

In [19]:

  Copied!     
 
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]

Out[19]:

DROP TABLE IF EXISTS "FEATURE_1_7";

CREATE TABLE "FEATURE_1_7" AS
SELECT SUM( 
    CASE
        WHEN ( t1."trigl" > 447.000000 ) AND ( t1."age" > 57.000000 ) THEN COALESCE( t1."koureni" - 3.392245777122775, 0.0 ) * 36.71734233772222 + COALESCE( t1."dobakour" - 6.958281373993878, 0.0 ) * -36.03237447465777 + COALESCE( t1."byvkurak" - 1.664097041151797, 0.0 ) * -4.140363438871325 + COALESCE( t1."pivomn" - 1.800816233987076, 0.0 ) * 2.866334536889667 + COALESCE( t1."vinomn" - 4.290329894569776, 0.0 ) * -3.474866798512964 + COALESCE( t1."lihmn" - 6.945697766693119, 0.0 ) * 42.97484465396469 + COALESCE( t1."kava" - 1.912708309715452, 0.0 ) * 37.67135620384678 + COALESCE( t1."caj" - 4.734043759210974, 0.0 ) * 41.49696018116632 + COALESCE( t1."cukr" - 4.523523410044213, 0.0 ) * 1.728682170938833 + COALESCE( t1."vyska" - 174.5359936515134, 0.0 ) * 2.532893582104208 + COALESCE( t1."vaha" - 79.88028568189547, 0.0 ) * -0.6526932998438458 + COALESCE( t1."syst1" - 132.0999886634168, 0.0 ) * 2.189892170970478 + COALESCE( t1."diast1" - 83.57759891168801, 0.0 ) * 0.3401612628322462 + COALESCE( t1."syst2" - 129.4103843101689, 0.0 ) * -1.979165210528913 + COALESCE( t1."diast2" - 83.1808185013037, 0.0 ) * 1.57192359403996 + COALESCE( t1."tric" - 9.403355628613536, 0.0 ) * -6.18176175433044 + COALESCE( t1."subsc" - 18.14816914182066, 0.0 ) * 12.22343448214527 + COALESCE( t1."chlst" - 232.3951932887428, 0.0 ) * 0.2971207012537834 + COALESCE( t1."trigl" - 134.7867588708763, 0.0 ) * -0.09941197254629983 + COALESCE( t1."age" - 58.58485432490647, 0.0 ) * 10.85819048719916 + COALESCE( t1."participation" - -1887.385670558893, 0.0 ) * 81.75340957666666 + COALESCE( t1."reference_date" - 615908174.0392246, 0.0 ) * -2.926904989988034e-06 + COALESCE( t1."entry_date" - 230998436.6398368, 0.0 ) * 1.154718952316122e-06 + COALESCE( t2."hmot" - 81.23347313760735, 0.0 ) * 0.1333271764688622 + COALESCE( t2."hdlmg" - 35.52386658677851, 0.0 ) * -0.2306312557846993 + COALESCE( t2."chlst" - 5.894816257239805, 0.0 ) * -0.490138288302913 + COALESCE( t2."hdl" - 0.916622728180551, 0.0 ) * 9.056133763122926 + COALESCE( t2."ldl" - 2.466835430397438, 0.0 ) * -0.01190990485510771 + COALESCE( t2."control_date" - 533547919.9520671, 0.0 ) * -1.626971839941278e-08 + 4.4320403142502030e+01
        WHEN ( t1."trigl" > 447.000000 ) AND ( t1."age" <= 57.000000 OR t1."age" IS NULL ) THEN COALESCE( t1."koureni" - 3.392245777122775, 0.0 ) * 68.9362363063014 + COALESCE( t1."dobakour" - 6.958281373993878, 0.0 ) * -49.72699167016215 + COALESCE( t1."byvkurak" - 1.664097041151797, 0.0 ) * -16.57345552532797 + COALESCE( t1."pivomn" - 1.800816233987076, 0.0 ) * 93.14809820598344 + COALESCE( t1."vinomn" - 4.290329894569776, 0.0 ) * 66.54704537730123 + COALESCE( t1."lihmn" - 6.945697766693119, 0.0 ) * 6.98997936976407 + COALESCE( t1."kava" - 1.912708309715452, 0.0 ) * -25.37270111467802 + COALESCE( t1."caj" - 4.734043759210974, 0.0 ) * 44.97434463952945 + COALESCE( t1."cukr" - 4.523523410044213, 0.0 ) * -11.97758628053244 + COALESCE( t1."vyska" - 174.5359936515134, 0.0 ) * -14.41189734039584 + COALESCE( t1."vaha" - 79.88028568189547, 0.0 ) * 9.201053407268812 + COALESCE( t1."syst1" - 132.0999886634168, 0.0 ) * 6.60642275226787 + COALESCE( t1."diast1" - 83.57759891168801, 0.0 ) * 0.2798793062603677 + COALESCE( t1."syst2" - 129.4103843101689, 0.0 ) * -8.618828438293468 + COALESCE( t1."diast2" - 83.1808185013037, 0.0 ) * 2.610294450622261 + COALESCE( t1."tric" - 9.403355628613536, 0.0 ) * 0.05980697404869816 + COALESCE( t1."subsc" - 18.14816914182066, 0.0 ) * -22.03705030242565 + COALESCE( t1."chlst" - 232.3951932887428, 0.0 ) * 0.08847429025390953 + COALESCE( t1."trigl" - 134.7867588708763, 0.0 ) * -0.01251193674433636 + COALESCE( t1."age" - 58.58485432490647, 0.0 ) * 4.86187301570697 + COALESCE( t1."participation" - -1887.385670558893, 0.0 ) * -137.3954135916799 + COALESCE( t1."reference_date" - 615908174.0392246, 0.0 ) * 4.20106983933724e-06 + COALESCE( t1."entry_date" - 230998436.6398368, 0.0 ) * -8.388379539806533e-06 + COALESCE( t2."hmot" - 81.23347313760735, 0.0 ) * 0.1237910720450438 + COALESCE( t2."hdlmg" - 35.52386658677851, 0.0 ) * 3.651474792793678 + COALESCE( t2."chlst" - 5.894816257239805, 0.0 ) * -0.3361774904051968 + COALESCE( t2."hdl" - 0.916622728180551, 0.0 ) * -141.5557739425682 + COALESCE( t2."ldl" - 2.466835430397438, 0.0 ) * -0.7515251196140998 + COALESCE( t2."control_date" - 533547919.9520671, 0.0 ) * 6.744845044264805e-09 + 8.4141958089744847e+01
        WHEN ( t1."trigl" <= 447.000000 OR t1."trigl" IS NULL ) AND ( t2."ldl" > 6.589412 ) THEN COALESCE( t1."koureni" - 3.392245777122775, 0.0 ) * 4.324545000639816 + COALESCE( t1."dobakour" - 6.958281373993878, 0.0 ) * 6.394011893899877 + COALESCE( t1."byvkurak" - 1.664097041151797, 0.0 ) * -0.00772340630819322 + COALESCE( t1."pivomn" - 1.800816233987076, 0.0 ) * 13.07564436701492 + COALESCE( t1."vinomn" - 4.290329894569776, 0.0 ) * 14.71174562700868 + COALESCE( t1."lihmn" - 6.945697766693119, 0.0 ) * -13.02469603877145 + COALESCE( t1."kava" - 1.912708309715452, 0.0 ) * -20.75690206999771 + COALESCE( t1."caj" - 4.734043759210974, 0.0 ) * -42.59571735654135 + COALESCE( t1."cukr" - 4.523523410044213, 0.0 ) * 4.547061088931812 + COALESCE( t1."vyska" - 174.5359936515134, 0.0 ) * 1.552658882464771 + COALESCE( t1."vaha" - 79.88028568189547, 0.0 ) * 1.317608177040875 + COALESCE( t1."syst1" - 132.0999886634168, 0.0 ) * 1.785126658763984 + COALESCE( t1."diast1" - 83.57759891168801, 0.0 ) * -2.546946347046458 + COALESCE( t1."syst2" - 129.4103843101689, 0.0 ) * -0.470414693564651 + COALESCE( t1."diast2" - 83.1808185013037, 0.0 ) * 1.623116203255628 + COALESCE( t1."tric" - 9.403355628613536, 0.0 ) * -6.40858643038764 + COALESCE( t1."subsc" - 18.14816914182066, 0.0 ) * 4.130064218077451 + COALESCE( t1."chlst" - 232.3951932887428, 0.0 ) * -0.1989132193311207 + COALESCE( t1."trigl" - 134.7867588708763, 0.0 ) * 0.2263022952597863 + COALESCE( t1."age" - 58.58485432490647, 0.0 ) * 1.197285989581476 + COALESCE( t1."participation" - -1887.385670558893, 0.0 ) * 19.06271270018097 + COALESCE( t1."reference_date" - 615908174.0392246, 0.0 ) * -6.382150275518887e-07 + COALESCE( t1."entry_date" - 230998436.6398368, 0.0 ) * 2.775023256741417e-07 + COALESCE( t2."hmot" - 81.23347313760735, 0.0 ) * -1.286757301370443 + COALESCE( t2."hdlmg" - 35.52386658677851, 0.0 ) * -16.40518752403113 + COALESCE( t2."chlst" - 5.894816257239805, 0.0 ) * 2.904438249194144 + COALESCE( t2."hdl" - 0.916622728180551, 0.0 ) * 641.5020896438801 + COALESCE( t2."ldl" - 2.466835430397438, 0.0 ) * 0.1447675001171414 + COALESCE( t2."control_date" - 533547919.9520671, 0.0 ) * -7.562988930726452e-09 + -9.0812299642137955e+00
        WHEN ( t1."trigl" <= 447.000000 OR t1."trigl" IS NULL ) AND ( t2."ldl" <= 6.589412 OR t2."ldl" IS NULL ) THEN COALESCE( t1."koureni" - 3.392245777122775, 0.0 ) * 6.861948102582569e-05 + COALESCE( t1."dobakour" - 6.958281373993878, 0.0 ) * 0.01745941706076835 + COALESCE( t1."byvkurak" - 1.664097041151797, 0.0 ) * -0.006563506280726456 + COALESCE( t1."pivomn" - 1.800816233987076, 0.0 ) * 0.002370132374868186 + COALESCE( t1."vinomn" - 4.290329894569776, 0.0 ) * -0.01446308909963534 + COALESCE( t1."lihmn" - 6.945697766693119, 0.0 ) * -0.01721180426547575 + COALESCE( t1."kava" - 1.912708309715452, 0.0 ) * 0.0202459283752464 + COALESCE( t1."caj" - 4.734043759210974, 0.0 ) * 0.009776762328480083 + COALESCE( t1."cukr" - 4.523523410044213, 0.0 ) * 0.001186532318472517 + COALESCE( t1."vyska" - 174.5359936515134, 0.0 ) * 0.001436166322912831 + COALESCE( t1."vaha" - 79.88028568189547, 0.0 ) * -0.00414449771250642 + COALESCE( t1."syst1" - 132.0999886634168, 0.0 ) * -0.001658868199165182 + COALESCE( t1."diast1" - 83.57759891168801, 0.0 ) * 0.002557569464948623 + COALESCE( t1."syst2" - 129.4103843101689, 0.0 ) * 0.002437793550397995 + COALESCE( t1."diast2" - 83.1808185013037, 0.0 ) * -6.174045506422857e-05 + COALESCE( t1."tric" - 9.403355628613536, 0.0 ) * -0.007499631525141079 + COALESCE( t1."subsc" - 18.14816914182066, 0.0 ) * 0.002113558511703661 + COALESCE( t1."chlst" - 232.3951932887428, 0.0 ) * 0.0001341535907906942 + COALESCE( t1."trigl" - 134.7867588708763, 0.0 ) * -1.630868859521257e-05 + COALESCE( t1."age" - 58.58485432490647, 0.0 ) * 0.002356506859856778 + COALESCE( t1."participation" - -1887.385670558893, 0.0 ) * -0.02444683750177154 + COALESCE( t1."reference_date" - 615908174.0392246, 0.0 ) * 2.117099967782411e-09 + COALESCE( t1."entry_date" - 230998436.6398368, 0.0 ) * -2.945219991878199e-10 + COALESCE( t2."hmot" - 81.23347313760735, 0.0 ) * 0.003574474413349186 + COALESCE( t2."hdlmg" - 35.52386658677851, 0.0 ) * -0.003244489062109498 + COALESCE( t2."chlst" - 5.894816257239805, 0.0 ) * -0.03710669507874064 + COALESCE( t2."hdl" - 0.916622728180551, 0.0 ) * 0.088384351075893 + COALESCE( t2."ldl" - 2.466835430397438, 0.0 ) * -0.00342765294998646 + COALESCE( t2."control_date" - 533547919.9520671, 0.0 ) * -2.444170025832969e-09 + -2.5882564116851187e-01
        ELSE NULL
    END
) AS "feature_1_7",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "CONTR__STAGING_TABLE_2" t2
ON t1."ico" = t2."ico"
WHERE t2."control_date" <= t1."reference_date"
GROUP BY t1.rowid;

In [20]:

  Copied!     
 
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[1].name]
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[1].name]

Out[20]:

DROP TABLE IF EXISTS "FEATURE_1_37";

CREATE TABLE "FEATURE_1_37" AS
SELECT AVG( 
    CASE
        WHEN ( t1."reference_date" - t2."control_date" > 352382400.000000 ) AND ( t1."entry_date" > 231194988.679245 ) THEN 1.67596291761292
        WHEN ( t1."reference_date" - t2."control_date" > 352382400.000000 ) AND ( t1."entry_date" <= 231194988.679245 OR t1."entry_date" IS NULL ) THEN -0.1053239490875762
        WHEN ( t1."reference_date" - t2."control_date" <= 352382400.000000 OR t1."reference_date" IS NULL OR t2."control_date" IS NULL ) AND ( t2."poccig" IN ( '20' ) ) THEN 0.7547001150541075
        WHEN ( t1."reference_date" - t2."control_date" <= 352382400.000000 OR t1."reference_date" IS NULL OR t2."control_date" IS NULL ) AND ( t2."poccig" NOT IN ( '20' ) OR t2."poccig" IS NULL ) THEN -0.4272755600104092
        ELSE NULL
    END
) AS "feature_1_37",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "CONTR__STAGING_TABLE_2" t2
ON t1."ico" = t2."ico"
WHERE t2."control_date" <= t1."reference_date"
GROUP BY t1.rowid;

2.4 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [21]:

  Copied!     
 
# Creates a folder named atherosclerosis_pipeline containing
# the SQL code.
pipe2.features.to_sql().save("atherosclerosis_pipeline", remove=True)
# Creates a folder named atherosclerosis_pipeline containing # the SQL code. pipe2.features.to_sql().save("atherosclerosis_pipeline", remove=True)

In [22]:

  Copied!     
 
pipe2.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("atherosclerosis_pipeline_spark", remove=True)
pipe2.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("atherosclerosis_pipeline_spark", remove=True)

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

The runtime benchmark between RelMT & Relboost on the same dataset demonstrates the problem of computational complexity in practice. We find that RelMT takes four times as long as Relboost to learn fewer features.

The purpose of this notebook has been to illustrate the problem of the curse of dimensionality when engineering features from datasets with many columns.

The most important thing to remember is that this problem exists regardless of whether you engineer your features manually or using algorithms. Whether you like it or not: If you write your features in the traditional way, your search space grows quadratically with the number of columns.