Atherosclerosis - Disease lethality prediction¶
With this notebook we give a brief introduction to feature engineering on relational data with many columns. We discuss why feature engineering on such data is particularly challenging and what we can do to overcome these problems.
Summary:
- Prediction type: Binary classification
- Domain: Health
- Prediction target: Mortality within one year
- Source data: 146 columns in 2 tables, 22 MB
- Population size: 28433
Feature engineering and the curse of dimensionality¶
The problem¶
To illustrate the point, we give a simplified example based on the real data used in the analysis below. When we engineer features from relational data, we usually write something like this:
SELECT AVG(t2.HDL)
FROM population_training t1
LEFT JOIN contr t2
ON t1.ICO = t2.ICO
WHERE t1.AGE >= 60 AND t1.ALKOHOL IN ('1', '2')
GROUP BY t1.ICO;
Think about that for a second. This feature aggregates high-density lipoprotein (HDL) cholesterol values recorded during control dates conditional on age and alcohol consumption. We arbitrarily choose both, the column to aggregate over (HDL) and the set of columns to construct conditions on (AGE and ALKOHOL) out of a greater set of 146 columns.
Every column that we have can either be aggregated (here HDL) or it can be used for our conditions (here AGE and ALKOHOL). That means if we have n columns to aggregate, we can potentially build conditions for $n$ other columns. In other words, the computational complexity is $n^2$ in the number of columns.
Note that this problem occurs regardless of whether you automate feature engineering or you do it by hand. The size of the search space is $n^2$ in the number of columns in either case, unless you can rule out something a-priori.
The solution¶
So when we have relational data sets with many columns, what do we do? The answer is to write different features. Specifically, suppose we had features like this:
SELECT AVG(
CASE WHEN t1.AGE >= THEN weight1
CASE WHEN t1.ALKOHOL IN ('1', '2') THEN weight2
END
)
FROM population_training t1
LEFT JOIN contr t2
ON t1.ICO = t2.ICO
GROUP BY t1.ICO;
weight1 and weight2 are learnable weights. An algorithm that generates features like this can only use columns for conditions, it is not allowed to aggregate columns – and it doesn't need to do so.
That means the computational complexity is linear instead of quadratic. For data sets with a large number of columns this can make all the difference in the world. For instance, if you have 100 columns the size of the search space of the second approach is only 1% of the size of the search space of the first one.
Background¶
To illustrate the problem of dimensionality in predictive analytics on relational data, we use the STULONG 1 dataset. It is a longitudinal study of atherosclerosis patients.
One of its defining features is that it contains many columns, which makes it a good candidate to illustrate the problem discussed in this notebook.
The are some academic studies related to this dataset:
- https://www.researchgate.net/publication/228572841_Mining_episode_rules_in_STULONG_dataset
- https://citeseerx.ist.psu.edu/doc_view/pid/3a9cb05b77b631b6fcbe253eb93e053ba8c0719c
The way these studies handle the large number of columns in the data set is to divide the columns into subgroups and then handling each subgroup separately. Even though this is one way to overcome the curse of dimensionality, it is not a very satisfying approach. We would like to be able to handle a large number of columns at once.
The analysis is based on the STULONG 1 dataset. It is publicly available and can be downloaded the the CTU Prague Relational Learning Repository (Now residing at relational-data.org.).
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.5.0" "featuretools==1.31.0" "tsfresh==0.20.3" "matplotlib==3.9.2" "ipywidgets==8.1.5"
import matplotlib.pyplot as plt
import getml
%matplotlib inline
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('atherosclerosis')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912133457.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'atherosclerosis'.
1. Loading data¶
1.1 Download from source¶
Downloading the raw data and convert it into a prediction ready format takes time. To get to the getML model building as fast as possible, we prepared the data for you and excluded the code from this notebook. It is made available in the example notebook featuring the full analysis.
population, contr = getml.datasets.load_atherosclerosis()
1.2 Prepare data for getML¶
The getml.datasets.load_atherosclerosis
method took care of the entire data lifting:
- Downloads csv's from our servers in python
- Converts csv's to getML DataFrames
- Sets roles to columns inside getML DataFrames
Data visualization
The original data (image below) model is condensed into 2 tables:
- A population table population_{train/test/validate}, based on
death
table - A peripheral table:
contr
.
Death: population table
- Reference Date: Period of time from 1976 to 1999
- Target: If the patient dies within one year after each reference date
population
name | REFERENCE_DATE | ENTRY_DATE | ICO | TARGET | KONSKUP | STAV | VZDELANI | ZODPOV | TELAKTZA | AKTPOZAM | DOPRAVA | DOPRATRV | ALKOHOL | BOLHR | BOLDK | DUSNOST | MOC | RARISK | OBEZRISK | KOURRISK | HTRISK | CHOLRISK | KOURENI | DOBAKOUR | BYVKURAK | PIVOMN | VINOMN | LIHMN | KAVA | CAJ | CUKR | VYSKA | VAHA | SYST1 | DIAST1 | SYST2 | DIAST2 | TRIC | SUBSC | CHLST | TRIGL | AGE | PARTICIPATION | ROKNAR | ROKVSTUP | MESVSTUP | PIVO7 | PIVO10 | PIVO12 | VINO | LIHOV | IM | IML | HT | HTD | HTL | ICT | DIABET | DIABD | DIABL | HYPLIP | HYPLD | HYPLL | IMTRV | HTTRV | DIABTRV | HYPLTRV | DENUMR | MESUMR | ROKUMR | PRICUMR | DUMMY | YEAR | ICTL | ICTTRV | DEATH_DATE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | time_stamp | join_key | target | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_string | unused_string | unused_string |
unit | time stamp | time stamp | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 | 1977-01-01 | 1976-10-01 | 10001 | 0 | 4 | 1 | 3 | 1 | 1 | 2 | 3 | 6 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 1 | 4 | 9 | 2 | 4 | 3 | 169 | 71 | 120 | 85 | 120 | 90 | 4 | 12 | 209 | 86 | 48 | -1899 | 29 | 1976 | 10 | nan | nan | nan | nan | 12 | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1977 | NULL | NULL | NULL |
1 | 1978-01-01 | 1976-10-01 | 10001 | 0 | 4 | 1 | 3 | 1 | 1 | 2 | 3 | 6 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 1 | 4 | 9 | 2 | 4 | 3 | 169 | 71 | 120 | 85 | 120 | 90 | 4 | 12 | 209 | 86 | 49 | -1898 | 29 | 1976 | 10 | nan | nan | nan | nan | 12 | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1978 | NULL | NULL | NULL |
2 | 1979-01-01 | 1976-10-01 | 10001 | 0 | 4 | 1 | 3 | 1 | 1 | 2 | 3 | 6 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 1 | 4 | 9 | 2 | 4 | 3 | 169 | 71 | 120 | 85 | 120 | 90 | 4 | 12 | 209 | 86 | 50 | -1897 | 29 | 1976 | 10 | nan | nan | nan | nan | 12 | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1979 | NULL | NULL | NULL |
3 | 1980-01-01 | 1976-10-01 | 10001 | 0 | 4 | 1 | 3 | 1 | 1 | 2 | 3 | 6 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 1 | 4 | 9 | 2 | 4 | 3 | 169 | 71 | 120 | 85 | 120 | 90 | 4 | 12 | 209 | 86 | 51 | -1896 | 29 | 1976 | 10 | nan | nan | nan | nan | 12 | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1980 | NULL | NULL | NULL |
4 | 1981-01-01 | 1976-10-01 | 10001 | 0 | 4 | 1 | 3 | 1 | 1 | 2 | 3 | 6 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 1 | 4 | 9 | 2 | 4 | 3 | 169 | 71 | 120 | 85 | 120 | 90 | 4 | 12 | 209 | 86 | 52 | -1895 | 29 | 1976 | 10 | nan | nan | nan | nan | 12 | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1981 | NULL | NULL | NULL |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
28428 | 1995-01-01 | 1977-03-01 | 30065 | 0 | 5 | 2 | 2 | 3 | 1 | 2 | 1 | 5 | 2 | 3 | 2 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 3 | 4 | 7 | 1 | 5 | 2 | 179 | 69 | 110 | 80 | 120 | 80 | 10 | 18 | 235 | 733 | 67 | -1882 | 28 | 1977 | 3 | nan | 9 | nan | nan | nan | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1995 | NULL | NULL | NULL |
28429 | 1996-01-01 | 1977-03-01 | 30065 | 0 | 5 | 2 | 2 | 3 | 1 | 2 | 1 | 5 | 2 | 3 | 2 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 3 | 4 | 7 | 1 | 5 | 2 | 179 | 69 | 110 | 80 | 120 | 80 | 10 | 18 | 235 | 733 | 68 | -1881 | 28 | 1977 | 3 | nan | 9 | nan | nan | nan | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1996 | NULL | NULL | NULL |
28430 | 1997-01-01 | 1977-03-01 | 30065 | 0 | 5 | 2 | 2 | 3 | 1 | 2 | 1 | 5 | 2 | 3 | 2 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 3 | 4 | 7 | 1 | 5 | 2 | 179 | 69 | 110 | 80 | 120 | 80 | 10 | 18 | 235 | 733 | 69 | -1880 | 28 | 1977 | 3 | nan | 9 | nan | nan | nan | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1997 | NULL | NULL | NULL |
28431 | 1998-01-01 | 1977-03-01 | 30065 | 0 | 5 | 2 | 2 | 3 | 1 | 2 | 1 | 5 | 2 | 3 | 2 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 3 | 4 | 7 | 1 | 5 | 2 | 179 | 69 | 110 | 80 | 120 | 80 | 10 | 18 | 235 | 733 | 70 | -1879 | 28 | 1977 | 3 | nan | 9 | nan | nan | nan | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1998 | NULL | NULL | NULL |
28432 | 1999-01-01 | 1977-03-01 | 30065 | 0 | 5 | 2 | 2 | 3 | 1 | 2 | 1 | 5 | 2 | 3 | 2 | 2 | 1 | 0 | 0 | 1 | 0 | 0 | 4 | 10 | nan | 3 | 4 | 7 | 1 | 5 | 2 | 179 | 69 | 110 | 80 | 120 | 80 | 10 | 18 | 235 | 733 | 71 | -1878 | 28 | 1977 | 3 | nan | 9 | nan | nan | nan | 2 | nan | 2 | nan | nan | 2 | 2 | nan | nan | 2 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1999 | NULL | NULL | NULL |
28433 rows x 76 columns
memory usage: 15.27 MB
name: population
type: getml.DataFrame
Contr: peripheral table
contr
name | CONTROL_DATE | ICO | ZMTELAKT | AKTPOZAM | ZMDIET | LEKTLAK | ZMKOUR | POCCIG | PRACNES | JINAONE | BOLHR | BOLDK | DUSN | HODNSK | HYPERD | HYPCHL | HYPTGL | HMOT | CHLST | HDL | HDLMG | LDL | ROKVYS | MESVYS | PORADK | ZMCHARZA | LEKCHOL | SRDCE | HYPERT | CEVMOZ | DIAB | HODN0 | ROK0 | HODN1 | ROK1 | HODN2 | ROK2 | HODN3 | ROK3 | HODN4 | ROK4 | HODN11 | ROK11 | HODN12 | ROK12 | HODN13 | ROK13 | HODN14 | ROK14 | HODN15 | ROK15 | HODN21 | ROK21 | HODN23 | ROK23 | SYST | DIAST | TRIC | SUBSC | HYPERSD | HYPERS | CHLSTMG | TRIGL | TRIGLMG | MOC | GLYKEMIE | KYSMOC |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
role | time_stamp | join_key | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | categorical | numerical | numerical | numerical | numerical | numerical | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float | unused_float |
unit | time stamp | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
0 | 1977-09-01 | 10001 | 3 | 2 | 1 | 2 | 2 | 0 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 71 | 5.61 | nan | nan | nan | 1977 | 9 | 1 | 20 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 130 | 90 | 4 | 12 | 2 | 2 | 217 | 1.22 | 108 | 1 | nan | nan |
1 | 1979-01-01 | 10001 | 1 | 1 | 1 | 2 | 1 | 0 | 2 | NULL | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 72 | 6 | nan | nan | nan | 1979 | 1 | 2 | 20 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 140 | 90 | 4 | 11 | 2 | 2 | 232 | 4.4 | 389 | 3 | nan | nan |
2 | 1980-04-01 | 10001 | 2 | 1 | 1 | 2 | 1 | 0 | 2 | NULL | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 71 | 6.23 | nan | nan | nan | 1980 | 4 | 3 | 20 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 130 | 90 | 5 | 22 | 2 | 2 | 241 | 1.51 | 134 | 1 | nan | nan |
3 | 1982-01-01 | 10001 | 2 | 1 | 1 | 2 | 1 | 0 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 74 | 5.2 | nan | nan | nan | 1982 | 1 | 4 | 20 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 150 | 100 | 9 | 18 | 2 | 2 | 201 | 1.42 | 126 | 1 | nan | nan |
4 | 1983-02-01 | 10001 | 2 | 2 | 1 | 2 | 1 | 0 | 2 | NULL | 1 | 2 | 1 | 2 | 2 | 2 | 2 | 73 | 6.08 | 1.47 | 57 | 4.15 | 1983 | 2 | 5 | 20 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 165 | 105 | 7 | 15 | 2 | 2 | 235 | 0.99 | 88 | 1 | nan | nan |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | |
10567 | 1987-10-01 | 20358 | 2 | 2 | 1 | 2 | 1 | 0 | 2 | NULL | 4 | 1 | 1 | 3 | 2 | 2 | 2 | 85 | 4.89 | 0.98 | 38 | 3.11 | 1987 | 10 | 1 | 20 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 120 | 80 | nan | nan | 2 | 2 | 189 | 1.74 | 154 | 1 | nan | nan |
10568 | 1987-11-01 | 20359 | 2 | 2 | 1 | 2 | 1 | 14 | 2 | NULL | 1 | 1 | 1 | 3 | 2 | 2 | 2 | 80 | 4.5 | 0.98 | 38 | 3.16 | 1987 | 11 | 1 | 20 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 100 | 70 | nan | nan | 2 | 2 | 174 | 0.79 | 70 | 1 | nan | nan |
10569 | 1987-10-01 | 20360 | 2 | 2 | 5 | 2 | 1 | 0 | 2 | NULL | 2 | 1 | 2 | 3 | 2 | 2 | 2 | 95 | 5.51 | 1.29 | 50 | 3.68 | 1987 | 10 | 1 | 40 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 120 | 80 | nan | nan | 2 | 2 | 213 | 1.18 | 104 | 1 | nan | nan |
10570 | 1987-10-01 | 20362 | 1 | 2 | 1 | 2 | 1 | 0 | 2 | NULL | 6 | 4 | 1 | 3 | 2 | 2 | 2 | 56 | 4.89 | 3 | 116 | 1.51 | 1987 | 10 | 1 | 40 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 140 | 85 | nan | nan | 2 | 2 | 189 | 0.82 | 73 | 1 | nan | nan |
10571 | 1978-09-01 | 30037 | 2 | 1 | 1 | 2 | 3 | 15 | 2 | NULL | 4 | 1 | 2 | 4 | 2 | 2 | 1 | 72 | 5.61 | nan | nan | nan | 1978 | 9 | 1 | 20 | nan | nan | nan | nan | nan | 0 | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 125 | 90 | 9 | 17 | 2 | 2 | 217 | nan | nan | 1 | nan | nan |
10572 rows x 67 columns
memory usage: 4.99 MB
name: contr
type: getml.DataFrame
1.3 Define relational model¶
To start with relational learning, we need to specify an abstract data model. Here, we use the high-level star schema API that allows us to define the abstract data model and construct a container with the concrete data at one-go. While a simple StarSchema
indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's full-fledged data model and container APIs respectively.
split = getml.data.split.random(train=0.7, test=0.3)
star_schema = getml.data.StarSchema(population, split=split)
star_schema.join(
contr,
on="ICO",
time_stamps=("REFERENCE_DATE", "CONTROL_DATE")
)
star_schema
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | contr | CONTR__STAGING_TABLE_2 |
subset | name | rows | type | |
---|---|---|---|---|
0 | test | population | 8557 | View |
1 | train | population | 19876 | View |
name | rows | type | |
---|---|---|---|
0 | contr | 10572 | DataFrame |
2. Predictive modeling¶
We loaded the data, defined the roles, units and the abstract data model. Next, we create a getML pipeline for relational learning.
2.1 Fitting the pipeline¶
To illustrate the problem of computational complexity, we fit two getML pipelines using different feature learning algorithms:
The first pipeline uses RelMT
and the second pipeline uses Relboost
. To demonstrate the power of feature ensembling, we build a third pipeline that uses both algorithms. At the end, we compare the runtime and predictive accuracies of all three pipelines.
relmt = getml.feature_learning.RelMT(
num_features=30,
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1
)
relboost = getml.feature_learning.Relboost(
num_features=60,
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
min_num_samples=500,
max_depth=2,
num_threads=1
)
feature_selector = getml.predictors.XGBoostClassifier()
xgboost = getml.predictors.XGBoostClassifier(
max_depth=5,
reg_lambda=100.0,
learning_rate=0.1
)
pipe1 = getml.pipeline.Pipeline(
tags=["relmt"],
data_model=star_schema.data_model,
feature_learners=relmt,
feature_selectors=feature_selector,
share_selected_features=0.8,
predictors=xgboost,
include_categorical=True
)
pipe1
Pipeline(data_model='population', feature_learners=['RelMT'], feature_selectors=['XGBoostClassifier'], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['contr'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.8, tags=['relmt'])
pipe2 = getml.pipeline.Pipeline(
tags=["relboost"],
data_model=star_schema.data_model,
feature_learners=relboost,
feature_selectors=feature_selector,
share_selected_features=0.8,
predictors=xgboost,
include_categorical=True
)
pipe2
Pipeline(data_model='population', feature_learners=['Relboost'], feature_selectors=['XGBoostClassifier'], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['contr'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.8, tags=['relboost'])
We begin with RelMT. Features generated by RelMT suffer from quadratic complexity. Luckily, the number of columns is not too high, so it is still manageble.
It always a good idea to check before fitting. There is one minor warning: This warning means that about 12% of patients never appear in CONTR (presumably because they never showed up to their health check-ups).
pipe1.check(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and CONTR__STAGING_TABLE_2 over 'ICO' and 'ICO', there are no corresponding entries for 12.291205% of entries in 'ICO' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
pipe1.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 RelMT: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 07:02 RelMT: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04
Trained pipeline.
Time taken: 0:07:11.228924.
Pipeline(data_model='population', feature_learners=['RelMT'], feature_selectors=['XGBoostClassifier'], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['contr'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.8, tags=['relmt', 'container-5Vksxn'])
Let's see how well Relboost does. This algorithm has linear complexity in the number of columns.
pipe2.fit(star_schema.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:22 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
Trained pipeline.
Time taken: 0:00:31.813144.
Pipeline(data_model='population', feature_learners=['Relboost'], feature_selectors=['XGBoostClassifier'], include_categorical=True, loss_function='CrossEntropyLoss', peripheral=['contr'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.8, tags=['relboost', 'container-5Vksxn'])
Note that this runs through in under a minute. This demonstrates the power of computational complexity theory. If we had more columns, the difference between these two algorithms would become even more noticable.
2.2 Model evaluation¶
pipe1.score(star_schema.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 RelMT: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:09:44 | train | TARGET | 0.9891 | 0.8904 | 0.0507 |
1 | 2024-09-12 13:10:17 | test | TARGET | 0.9867 | 0.7143 | 0.06851 |
pipe2.score(star_schema.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:10:16 | train | TARGET | 0.9878 | 0.8912 | 0.0538 |
1 | 2024-09-12 13:10:19 | test | TARGET | 0.9867 | 0.7107 | 0.06748 |
2.3 Studying the features¶
It is always a good idea to study the features generated by the algorithms.
names, correlations = pipe2.features.correlations()
plt.subplots(figsize=(20, 10))
plt.bar(names, correlations)
plt.title("feature correlations")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("correlations")
plt.xticks(rotation='vertical')
plt.show()
names, importances = pipe2.features.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title("feature importances")
plt.grid(True)
plt.xlabel("features")
plt.ylabel("importances")
plt.xticks(rotation='vertical')
plt.show()
As we can see from these figures we need many features to get a good result. No single feature is very correlated with the target and the feature importance is not concentrated on a small number of features (as is often the case with other data sets).
This implies that the many columns in the data set are actually needed. The reason we emphasize that is that we sometimes see data sets with many columns, but after analyzing them we find and only a handful of these columns are actually needed. This is not one of these times.
The most important features look like this:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_7";
CREATE TABLE "FEATURE_1_7" AS
SELECT SUM(
CASE
WHEN ( t1."trigl" > 447.000000 ) AND ( t1."age" > 57.000000 ) THEN COALESCE( t1."koureni" - 3.392245777122775, 0.0 ) * 36.71734233772222 + COALESCE( t1."dobakour" - 6.958281373993878, 0.0 ) * -36.03237447465777 + COALESCE( t1."byvkurak" - 1.664097041151797, 0.0 ) * -4.140363438871325 + COALESCE( t1."pivomn" - 1.800816233987076, 0.0 ) * 2.866334536889667 + COALESCE( t1."vinomn" - 4.290329894569776, 0.0 ) * -3.474866798512964 + COALESCE( t1."lihmn" - 6.945697766693119, 0.0 ) * 42.97484465396469 + COALESCE( t1."kava" - 1.912708309715452, 0.0 ) * 37.67135620384678 + COALESCE( t1."caj" - 4.734043759210974, 0.0 ) * 41.49696018116632 + COALESCE( t1."cukr" - 4.523523410044213, 0.0 ) * 1.728682170938833 + COALESCE( t1."vyska" - 174.5359936515134, 0.0 ) * 2.532893582104208 + COALESCE( t1."vaha" - 79.88028568189547, 0.0 ) * -0.6526932998438458 + COALESCE( t1."syst1" - 132.0999886634168, 0.0 ) * 2.189892170970478 + COALESCE( t1."diast1" - 83.57759891168801, 0.0 ) * 0.3401612628322462 + COALESCE( t1."syst2" - 129.4103843101689, 0.0 ) * -1.979165210528913 + COALESCE( t1."diast2" - 83.1808185013037, 0.0 ) * 1.57192359403996 + COALESCE( t1."tric" - 9.403355628613536, 0.0 ) * -6.18176175433044 + COALESCE( t1."subsc" - 18.14816914182066, 0.0 ) * 12.22343448214527 + COALESCE( t1."chlst" - 232.3951932887428, 0.0 ) * 0.2971207012537834 + COALESCE( t1."trigl" - 134.7867588708763, 0.0 ) * -0.09941197254629983 + COALESCE( t1."age" - 58.58485432490647, 0.0 ) * 10.85819048719916 + COALESCE( t1."participation" - -1887.385670558893, 0.0 ) * 81.75340957666666 + COALESCE( t1."reference_date" - 615908174.0392246, 0.0 ) * -2.926904989988034e-06 + COALESCE( t1."entry_date" - 230998436.6398368, 0.0 ) * 1.154718952316122e-06 + COALESCE( t2."hmot" - 81.23347313760735, 0.0 ) * 0.1333271764688622 + COALESCE( t2."hdlmg" - 35.52386658677851, 0.0 ) * -0.2306312557846993 + COALESCE( t2."chlst" - 5.894816257239805, 0.0 ) * -0.490138288302913 + COALESCE( t2."hdl" - 0.916622728180551, 0.0 ) * 9.056133763122926 + COALESCE( t2."ldl" - 2.466835430397438, 0.0 ) * -0.01190990485510771 + COALESCE( t2."control_date" - 533547919.9520671, 0.0 ) * -1.626971839941278e-08 + 4.4320403142502030e+01
WHEN ( t1."trigl" > 447.000000 ) AND ( t1."age" <= 57.000000 OR t1."age" IS NULL ) THEN COALESCE( t1."koureni" - 3.392245777122775, 0.0 ) * 68.9362363063014 + COALESCE( t1."dobakour" - 6.958281373993878, 0.0 ) * -49.72699167016215 + COALESCE( t1."byvkurak" - 1.664097041151797, 0.0 ) * -16.57345552532797 + COALESCE( t1."pivomn" - 1.800816233987076, 0.0 ) * 93.14809820598344 + COALESCE( t1."vinomn" - 4.290329894569776, 0.0 ) * 66.54704537730123 + COALESCE( t1."lihmn" - 6.945697766693119, 0.0 ) * 6.98997936976407 + COALESCE( t1."kava" - 1.912708309715452, 0.0 ) * -25.37270111467802 + COALESCE( t1."caj" - 4.734043759210974, 0.0 ) * 44.97434463952945 + COALESCE( t1."cukr" - 4.523523410044213, 0.0 ) * -11.97758628053244 + COALESCE( t1."vyska" - 174.5359936515134, 0.0 ) * -14.41189734039584 + COALESCE( t1."vaha" - 79.88028568189547, 0.0 ) * 9.201053407268812 + COALESCE( t1."syst1" - 132.0999886634168, 0.0 ) * 6.60642275226787 + COALESCE( t1."diast1" - 83.57759891168801, 0.0 ) * 0.2798793062603677 + COALESCE( t1."syst2" - 129.4103843101689, 0.0 ) * -8.618828438293468 + COALESCE( t1."diast2" - 83.1808185013037, 0.0 ) * 2.610294450622261 + COALESCE( t1."tric" - 9.403355628613536, 0.0 ) * 0.05980697404869816 + COALESCE( t1."subsc" - 18.14816914182066, 0.0 ) * -22.03705030242565 + COALESCE( t1."chlst" - 232.3951932887428, 0.0 ) * 0.08847429025390953 + COALESCE( t1."trigl" - 134.7867588708763, 0.0 ) * -0.01251193674433636 + COALESCE( t1."age" - 58.58485432490647, 0.0 ) * 4.86187301570697 + COALESCE( t1."participation" - -1887.385670558893, 0.0 ) * -137.3954135916799 + COALESCE( t1."reference_date" - 615908174.0392246, 0.0 ) * 4.20106983933724e-06 + COALESCE( t1."entry_date" - 230998436.6398368, 0.0 ) * -8.388379539806533e-06 + COALESCE( t2."hmot" - 81.23347313760735, 0.0 ) * 0.1237910720450438 + COALESCE( t2."hdlmg" - 35.52386658677851, 0.0 ) * 3.651474792793678 + COALESCE( t2."chlst" - 5.894816257239805, 0.0 ) * -0.3361774904051968 + COALESCE( t2."hdl" - 0.916622728180551, 0.0 ) * -141.5557739425682 + COALESCE( t2."ldl" - 2.466835430397438, 0.0 ) * -0.7515251196140998 + COALESCE( t2."control_date" - 533547919.9520671, 0.0 ) * 6.744845044264805e-09 + 8.4141958089744847e+01
WHEN ( t1."trigl" <= 447.000000 OR t1."trigl" IS NULL ) AND ( t2."ldl" > 6.589412 ) THEN COALESCE( t1."koureni" - 3.392245777122775, 0.0 ) * 4.324545000639816 + COALESCE( t1."dobakour" - 6.958281373993878, 0.0 ) * 6.394011893899877 + COALESCE( t1."byvkurak" - 1.664097041151797, 0.0 ) * -0.00772340630819322 + COALESCE( t1."pivomn" - 1.800816233987076, 0.0 ) * 13.07564436701492 + COALESCE( t1."vinomn" - 4.290329894569776, 0.0 ) * 14.71174562700868 + COALESCE( t1."lihmn" - 6.945697766693119, 0.0 ) * -13.02469603877145 + COALESCE( t1."kava" - 1.912708309715452, 0.0 ) * -20.75690206999771 + COALESCE( t1."caj" - 4.734043759210974, 0.0 ) * -42.59571735654135 + COALESCE( t1."cukr" - 4.523523410044213, 0.0 ) * 4.547061088931812 + COALESCE( t1."vyska" - 174.5359936515134, 0.0 ) * 1.552658882464771 + COALESCE( t1."vaha" - 79.88028568189547, 0.0 ) * 1.317608177040875 + COALESCE( t1."syst1" - 132.0999886634168, 0.0 ) * 1.785126658763984 + COALESCE( t1."diast1" - 83.57759891168801, 0.0 ) * -2.546946347046458 + COALESCE( t1."syst2" - 129.4103843101689, 0.0 ) * -0.470414693564651 + COALESCE( t1."diast2" - 83.1808185013037, 0.0 ) * 1.623116203255628 + COALESCE( t1."tric" - 9.403355628613536, 0.0 ) * -6.40858643038764 + COALESCE( t1."subsc" - 18.14816914182066, 0.0 ) * 4.130064218077451 + COALESCE( t1."chlst" - 232.3951932887428, 0.0 ) * -0.1989132193311207 + COALESCE( t1."trigl" - 134.7867588708763, 0.0 ) * 0.2263022952597863 + COALESCE( t1."age" - 58.58485432490647, 0.0 ) * 1.197285989581476 + COALESCE( t1."participation" - -1887.385670558893, 0.0 ) * 19.06271270018097 + COALESCE( t1."reference_date" - 615908174.0392246, 0.0 ) * -6.382150275518887e-07 + COALESCE( t1."entry_date" - 230998436.6398368, 0.0 ) * 2.775023256741417e-07 + COALESCE( t2."hmot" - 81.23347313760735, 0.0 ) * -1.286757301370443 + COALESCE( t2."hdlmg" - 35.52386658677851, 0.0 ) * -16.40518752403113 + COALESCE( t2."chlst" - 5.894816257239805, 0.0 ) * 2.904438249194144 + COALESCE( t2."hdl" - 0.916622728180551, 0.0 ) * 641.5020896438801 + COALESCE( t2."ldl" - 2.466835430397438, 0.0 ) * 0.1447675001171414 + COALESCE( t2."control_date" - 533547919.9520671, 0.0 ) * -7.562988930726452e-09 + -9.0812299642137955e+00
WHEN ( t1."trigl" <= 447.000000 OR t1."trigl" IS NULL ) AND ( t2."ldl" <= 6.589412 OR t2."ldl" IS NULL ) THEN COALESCE( t1."koureni" - 3.392245777122775, 0.0 ) * 6.861948102582569e-05 + COALESCE( t1."dobakour" - 6.958281373993878, 0.0 ) * 0.01745941706076835 + COALESCE( t1."byvkurak" - 1.664097041151797, 0.0 ) * -0.006563506280726456 + COALESCE( t1."pivomn" - 1.800816233987076, 0.0 ) * 0.002370132374868186 + COALESCE( t1."vinomn" - 4.290329894569776, 0.0 ) * -0.01446308909963534 + COALESCE( t1."lihmn" - 6.945697766693119, 0.0 ) * -0.01721180426547575 + COALESCE( t1."kava" - 1.912708309715452, 0.0 ) * 0.0202459283752464 + COALESCE( t1."caj" - 4.734043759210974, 0.0 ) * 0.009776762328480083 + COALESCE( t1."cukr" - 4.523523410044213, 0.0 ) * 0.001186532318472517 + COALESCE( t1."vyska" - 174.5359936515134, 0.0 ) * 0.001436166322912831 + COALESCE( t1."vaha" - 79.88028568189547, 0.0 ) * -0.00414449771250642 + COALESCE( t1."syst1" - 132.0999886634168, 0.0 ) * -0.001658868199165182 + COALESCE( t1."diast1" - 83.57759891168801, 0.0 ) * 0.002557569464948623 + COALESCE( t1."syst2" - 129.4103843101689, 0.0 ) * 0.002437793550397995 + COALESCE( t1."diast2" - 83.1808185013037, 0.0 ) * -6.174045506422857e-05 + COALESCE( t1."tric" - 9.403355628613536, 0.0 ) * -0.007499631525141079 + COALESCE( t1."subsc" - 18.14816914182066, 0.0 ) * 0.002113558511703661 + COALESCE( t1."chlst" - 232.3951932887428, 0.0 ) * 0.0001341535907906942 + COALESCE( t1."trigl" - 134.7867588708763, 0.0 ) * -1.630868859521257e-05 + COALESCE( t1."age" - 58.58485432490647, 0.0 ) * 0.002356506859856778 + COALESCE( t1."participation" - -1887.385670558893, 0.0 ) * -0.02444683750177154 + COALESCE( t1."reference_date" - 615908174.0392246, 0.0 ) * 2.117099967782411e-09 + COALESCE( t1."entry_date" - 230998436.6398368, 0.0 ) * -2.945219991878199e-10 + COALESCE( t2."hmot" - 81.23347313760735, 0.0 ) * 0.003574474413349186 + COALESCE( t2."hdlmg" - 35.52386658677851, 0.0 ) * -0.003244489062109498 + COALESCE( t2."chlst" - 5.894816257239805, 0.0 ) * -0.03710669507874064 + COALESCE( t2."hdl" - 0.916622728180551, 0.0 ) * 0.088384351075893 + COALESCE( t2."ldl" - 2.466835430397438, 0.0 ) * -0.00342765294998646 + COALESCE( t2."control_date" - 533547919.9520671, 0.0 ) * -2.444170025832969e-09 + -2.5882564116851187e-01
ELSE NULL
END
) AS "feature_1_7",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "CONTR__STAGING_TABLE_2" t2
ON t1."ico" = t2."ico"
WHERE t2."control_date" <= t1."reference_date"
GROUP BY t1.rowid;
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[1].name]
DROP TABLE IF EXISTS "FEATURE_1_37";
CREATE TABLE "FEATURE_1_37" AS
SELECT AVG(
CASE
WHEN ( t1."reference_date" - t2."control_date" > 352382400.000000 ) AND ( t1."entry_date" > 231194988.679245 ) THEN 1.67596291761292
WHEN ( t1."reference_date" - t2."control_date" > 352382400.000000 ) AND ( t1."entry_date" <= 231194988.679245 OR t1."entry_date" IS NULL ) THEN -0.1053239490875762
WHEN ( t1."reference_date" - t2."control_date" <= 352382400.000000 OR t1."reference_date" IS NULL OR t2."control_date" IS NULL ) AND ( t2."poccig" IN ( '20' ) ) THEN 0.7547001150541075
WHEN ( t1."reference_date" - t2."control_date" <= 352382400.000000 OR t1."reference_date" IS NULL OR t2."control_date" IS NULL ) AND ( t2."poccig" NOT IN ( '20' ) OR t2."poccig" IS NULL ) THEN -0.4272755600104092
ELSE NULL
END
) AS "feature_1_37",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "CONTR__STAGING_TABLE_2" t2
ON t1."ico" = t2."ico"
WHERE t2."control_date" <= t1."reference_date"
GROUP BY t1.rowid;
2.4 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named atherosclerosis_pipeline containing
# the SQL code.
pipe2.features.to_sql().save("atherosclerosis_pipeline", remove=True)
pipe2.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("atherosclerosis_pipeline_spark", remove=True)
getml.engine.shutdown()
3. Conclusion¶
The runtime benchmark between RelMT & Relboost on the same dataset demonstrates the problem of computational complexity in practice. We find that RelMT takes four times as long as Relboost to learn fewer features.
The purpose of this notebook has been to illustrate the problem of the curse of dimensionality when engineering features from datasets with many columns.
The most important thing to remember is that this problem exists regardless of whether you engineer your features manually or using algorithms. Whether you like it or not: If you write your features in the traditional way, your search space grows quadratically with the number of columns.