project

Atherosclerosis (nb intro)

EXCERPT (BLOG ITEM + OG, TWITTER)

Disease lethality prediction

Feature engineering and the curse of dimensionality

With this notebook we give a brief introduction to feature engineering on relational data with many columns. We discuss why feature engineering on such data is particularly challenging and what we can do to overcome these problems.

Summary:

  • Prediction type: Binary classification
  • Domain: Health
  • Prediction target: Mortality within one year
  • Source data: 146 columns in 2 tables, 22 MB
  • Population size: 28433

Author: Dr. Patrick Urbanke

The problem

To illustrate the point, we give a simplified example based on the real data used in the analysis below. When we engineer features from relational data, we usually write something like this:

SELECT AVG(t2.HDL)
FROM population_training t1
LEFT JOIN contr t2
ON t1.ICO = t2.ICO
WHERE t1.AGE >= 60 AND t1.ALOKOHOL IN ('1', '2')
GROUP BY t1.ICO;

Think about that for a second. This feature aggregates high-density lipoprotein (HDL) cholesterol values recorded during control dates conditional on age and alcohol consumption. We arbitrarily chose both, the column to aggregate over (HDL) and the set of columns to construct conditions on (AGE and ALKOHOL) out of a greater set of 146 columns.

Every column that we have can either be aggregated (here HDL) or it can be used for our conditions (here AGE and ALKOHOL). That means if we have n columns to aggregate, we can potentially build conditions for $n$ other columns. In other words, the computational complexity is $n^2$ in the number of columns.

Note that this problem occurs regardless of whether you automate feature engineering or you do it by hand. The size of the search space is $n^2$ in the number of columns in either case, unless you can rule something out a-priori.

The solution

So when we have relational data sets with many columns, what do we do? The answer is to write different features. Specifically, suppose we had features like this:

SELECT AVG(
    CASE WHEN t1.AGE >= THEN weight1
    CASE WHEN t1.ALKOHOL IN ('1', '2') THEN weight2
    END
)
FROM population_training t1
LEFT JOIN contr t2
ON t1.ICO = t2.ICO
GROUP BY t1.ICO;

weight1 and weight2 are learnable weights. An algorithm that generates features like this can only use columns for conditions, it is not allowed to aggregate columns – and it doesn't need to do so.

That means the computational complexity is linear instead of quadratic. For data sets with a large number of columns this can make all the difference in the world. For instance, if you have 100 columns the size of the search space of the second approach is only 1% of the size of the search space of the first one.

Background

To illustrate the problem of dimensionality in predictive analytics on relational data, we use the STULONG 1 dataset. It is a longitudinal study of atherosclerosis patients.

One of its defining features is that it contains many columns, which makes it a good candidate to illustrate the problem discussed in this notebook.

The are some academic studies related to this dataset:

The way these studies handle the large number of columns in the data set is to divide the columns into subgroups and then handling each subgroup separately. Even though this is one way to overcome the curse of dimensionality, it is not a very satisfying approach. We would like to be able to handle a large number of columns at once.

The analysis is based on the STULONG 1 dataset. It is publicly available and can be downloaded the the CTU Prague Relational Learning Repository.

Related code example

Notebook:
Open in nbviewer
Open in mybinder