New algorithms for relational learning: where deep learning falls short of expectations
How can businesses fully unlock the potential of relational data with machine learning? This heise online article, guided by Silke Hahn’s editorial expertise, explores the latest developments in relational learning algorithms.
Feature learning makes relational data usable for machine learning, unlocking a vast trove of data with business potential for companies.
The idea of storing data in relational structures dates back to the 1970s. Today, relational data forms the backbone of every modern business. Corporate data accumulates in databases, playing a pivotal role in bridging the AI gap identified by decision-makers. However, despite the enthusiasm for innovation, extracting value from relational data with machine learning (ML) is currently only possible through significant effort. This challenge limits even large companies' access to machine learning and business applications with artificial intelligence (AI).
The key lies in the research field of relational learning, which has had little practical application so far. A new class of algorithms promises to change this. It transfers the central concept of feature learning from deep learning to relational data structures, making enterprise data accessible for modern machine learning algorithms.
Data scientists often face a classic challenge in their daily projects: using relational source data from a database such as MySQL to develop a machine learning model for predictive analytics. The resulting models are used across industries for various applications, such as predicting customer churn for financial service providers, forecasting sales and demand in retail, or predictive maintenance in manufacturing.
The common problem when developing predictive models on relational data is that this data is not suitable as input for ML models. Data scientists must spend up to 90% of their time on manual tasks to transform the relational source data into a representation suitable for use in models like XGBoost.
Coming up with features is difficult, timeconsuming, requires expert knowledge. [Applied machine learning is basically feature engineering.]
– Andrew Ng, Deep Learning
Feature learning aims to automate these steps, making relational data directly usable in machine learning. Based on the authors' data science experiences, this method helps avoid error-prone and time-intensive manual processes, leading to better ML models.
What is Relational Learning?
To train a predictive model through supervised learning, data scientists need a dataset with a defined target variable. In many textbook examples or data science competitions, the dataset consists of a flat table. This table represents the statistical population of the model. Each row in the table corresponds to an independent observation, associated with a target variable (also called the output value) and a fixed number of measurable or observable attributes (see Fig. 1).
Data scientists refer to these attributes as features, which serve as the input values for the model. A real estate price prediction model illustrates the relationship between the target variable and features: the target variable is the property value, and a possible feature is the square footage available for each property.
During the training phase, the algorithm learns the model's parameters and generalizes a functional relationship between input and output data. Applied to new input data, the trained model can then predict unknown output values.
However, in many use cases, input data is not solely available in the form of a flat table. Particularly in enterprise applications, it is more efficient to organize process-related information within relational data structures. Statistics confirm their widespread use: seven of the ten most popular databases are relational, according to the DB-Engine Ranking. When the input values of a predictive model are distributed across interconnected tables, the application falls within a subfield of statistical relational learning.
In addition to the population table, relational learning identifies another class of tables within the relational schema: peripheral tables, which contain observations of additional attributes in their rows. Peripheral tables can have many-to-many (m:n) relationships with the population table. Consequently, a row in the population table may have a one-to-many (1:n) relationship with rows in the peripheral table. In relational learning, it is common for each observation row in the population table to correspond to a varying number of rows in the peripheral tables (see Fig. 2).
Since a database often contains a large number of tables, the relationships between them add complexity. Visualized, this results in familiar star or snowflake schemas. An example is a customer churn prediction: the target variable encodes whether a customer places another order within a specific time frame. A peripheral table might include a varying number of additional observations for each customer ID, such as past purchases or digital customer activities.
But how does the workflow of data scientists change when the training data for developing a predictive model comes with relational structures? What must they do to still develop a predictive model?
How Machine Learning with Relational Data Currently Works
In the absence of modern, self-learning algorithms capable of processing relational input data, data scientists have no choice but to first convert the relational data into a compatible representation. But why not simply discard the peripheral tables? This would be the least effective approach. These tables often contain information strongly related to the target variable, such as a customer’s historical transactions in the churn prediction example. If this information were unavailable to the machine learning model during training, the predictive quality of the resulting model would suffer.
If data scientists are aware of all the information relevant to the prediction, they can replace the relational relationship between the population table and the peripheral table with scalar feature values. These values can then be passed on to the predictive model. Resolving these 1:n relationships between an observation in the population table and the peripheral tables is known as feature engineering. For this, data scientists program aggregation functions along with accompanying conditions.
The challenge in feature engineering lies in identifying the relevant aggregations and conditions. For example, in the context of customer churn, a feature might be the COUNT
aggregation of all purchases made by a customer in the last 90 days. Alternatively, the SUM
aggregation of the total value of purchases made in the past 45 days could be used. These are two relatively simple features among nearly infinite combinations of aggregation functions and arbitrarily complex conditions. Which of these possible features provides prediction-relevant information is unclear in advance and must largely be determined manually through trial and error (see Fig. 3).
In practice, feature engineering proves to be a time-consuming process: it requires close collaboration between subject matter experts with domain knowledge and data scientists with methodological expertise to identify factors relevant to predictions, translate them into logic, and then extract features based on that logic. Good predictive models often require hundreds of features, each of which can involve up to several hundred lines of code. Feature engineering plays a critical role in machine learning projects, as the features used significantly determine the quality of the predictions.
Some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.
Pedro Domingos, Useful Things to Know About Machine Learning
Although the training of predictive models can largely be automated through AutoML applications, the prediction quality primarily depends on the arbitrary assumptions made by teams during the feature engineering process.
It is also often unavoidable that the relevance of individual features diminishes over time, or that new features become important. This phenomenon is known as feature drift. As a result, continuous monitoring and adjustment of the features, along with regular retraining of the model built on them, is necessary.
Transforming Relational Data into Training Data with Feature Learning
It is clear that, at present, the quality of a machine learning application on relational data depends on arbitrary decisions regarding the aggregation functions and conditions chosen during the feature engineering process (see Fig. 3).
Feature learning refers to an approach where a statistical algorithm learns the features based on the training data. Feature learning has already proven successful on non-relational data: since 2012, deep learning has demonstrated the automation of feature engineering in the field of computer vision across numerous applications.
For instance, in the ImageNet competition for object recognition in images, only manually designed features by experts were used until 2011. The practical availability of increasingly deep neural networks brought a dramatic change: Error rates of algorithms dropped from over 25% (support vector machines with manual features) to under 3% (neural networks with feature learning). Feature learning contributed significantly to this success. This becomes apparent when taking a closer look at the architecture of a neural network: over 99% of computations within modern convolutional neural networks (CNNs) are related to learning specific feature representations. These learned features form the foundation for achieving superhuman accuracy in object recognition tasks without the need for externally provided expert knowledge.
For various reasons, architectures proven effective in feature learning on unstructured data, such as images, cannot be directly applied to relational data structures. For example, attempts to use convolutional neural networks (CNNs) on relational data fail at the outset: CNNs require a constant number of input values (see Fig. 4).
A New ML Framework for Relational Learning
The company behind getML aims to make relational learning accessible to data scientists with its ML framework. To this end, getML offers four new feature-learning algorithms. These algorithms learn optimal features through a supervised search based on an abstract data model that maps the relational relationships between tables. After this preprocessing, the learned features can also serve as input values for predictive models. A key practical requirement was ensuring the algorithms deliver stable and high-quality features with minimal configuration effort.
After initial attempts, the development team realized that tree-based approaches were the most promising option. When applied to structured data, decision trees often outperform neural networks and have an impressive development trajectory. In practice, tree-based models require less configuration effort due to their lower complexity, need fewer training data, and demand less computational time. These are all critical factors in project workflows. All feature-learning approaches developed by getML are tree-based ensemble methods. Starting from Multirel – an efficient implementation of Multi-Relational Decision Trees – to Relboost, a generalization of the Gradient Boosting approach for relational data, each method possesses unique qualities. These include their scalability, interpretability, or suitability for specific data structures. Figure 5.1 provides an example of the Multirel algorithm.
The greatest challenge during the development phase of getML was the efficient implementation of feature-learning algorithms. Learning features potentially requires evaluating billions of combinations of aggregation functions and conditions. To handle this enormous volume of computations, getML's algorithms employ a clever trick: they incrementally update the loss function during the search process to avoid unnecessary computational operations.
However, this approach introduced a new problem: no existing database engine provided the data structures and caching strategies needed to perform such incremental updates at an acceptable speed. This led to the decision to develop the algorithms, along with a custom in-memory database tailored to their requirements, entirely from scratch in C++.
The approach used by getML considers the prediction problem holistically and can be understood as a two-step process (see Fig. 6):
- getML's feature-learning models learn a compatible representation from relational data (Feature-Learning Step).
- The learned representation serves as input for state-of-the-art machine learning models such as XGBoost, which are then used to generate the actual predictions (Prediction Step).
Unlike manual feature engineering, feature learning provides an automated, algorithmic solution even in the first stage. This impacts not only efficiency but also the quality of the resulting models. It can be demonstrated that feature learning outperforms manual feature engineering and previous relational learning approaches in many use cases.
Introduction to Relational Learning for Practitioners
A typical task for data scientists is to assist financial institutions in making decisions about granting loans to existing customers. To do this, they develop a predictive model that can forecast the probability of a loan default. A simplified dataset for making such lending decisions might look like the one in Figure 7.
The information about loan defaults is the categorical target variable. Recall that the table containing observations of the target variable constitutes the population of the predictive model (Table population
, Fig. 7.1). Since the loan default information is nominally scaled, this represents a classification problem.
The financial institution also has access to additional data related to the business relationship that is relevant for predicting loan defaults. This includes past transactions observed as part of the existing customer relationship (Table trans
, Fig. 7.2). To keep the example concise, it is limited here to the relationship between population
and trans
(Listing 1).
Project Setup and Data Annotations
import getml
getml.set_project("young_devs")
population_train, population_test, _, trans, _ = getml.datasets.load_loans()
Listing 1: Creating a project and loading data
As described, the challenge in relational learning lies in resolving 1:n relationships between the population table and peripheral tables. In the loan decision example, this means aggregating multiple transactions associated with a customer account (from trans
) into a single value that can be used to predict loan default probability.
From a technical perspective, aggregating multiple rows from peripheral tables involves applying aggregation functions, such as the average of all past account balances. These aggregation functions are usually applied conditionally rather than globally. Listing 2 shows an example of a manually created feature based on the average account balances 90 days before a loan application, typical of feature engineering processes.
SELECT AVG(trans.balance)
FROM population
INNER JOIN trans ON population.account_id = trans.account_id
WHERE trans.date <= population.date_loan
AND trans.date >= population.date_loan - 90;
Listing 2: A SQL query for a manually created feature
But how do data scientists know that the optimal explanatory power for a model comes from the past 90 days and not, for example, 167.23 days? This process is manual, and selecting the column, aggregation function, and conditions involves arbitrary decisions with uncertain outcomes. Feature learning enables algorithms to learn the relevant aggregations and conditions automatically.
To learn features from relational data structures, the algorithm needs information beyond mere data types. In the getML framework, the concept of roles addresses this need. A role is like a data type with additional metadata that defines how the feature-learning algorithm should use a column. For instance:
- The transaction volume (
amount
) is assigned the role of numerical, allowing numerical aggregations likeSUM
orAVG
. - The transaction type (
type
) is assigned the role of categorical, disallowing numerical aggregations but making it usable for conditions and certain aggregations likeCOUNT
.
Defining the Relational Search Space
In addition to data and annotations, a feature-learning algorithm requires an abstract data model that describes the relationships between tables, defining the relational search space.
The Python API of getML offers several options. For this example, we use the StarSchema
class, which allows modeling a star schema with just a few lines of code. The API also provides TimeSeries
for time series modeling and a general DataModel
for handling complex relational problems (like snowflake schemas). See Listing 3.
schema = getml.data.StarSchema(
train=population_train,
test=population_test,
alias="population",
)
schema.join(
trans,
on="account_id",
time_stamps=("date_loan", "date"),
)
Listing 3: Defining the abstract data model
StarSchema
takes the population table as a parameter (population_train
and population_test
for training and testing datasets) and an alias (population
) for abstract referencing. Peripheral tables, the "spikes" of the star, are added to the schema via join
, linking them to the population table at the star's center. A join key (on
) is specified, and timestamps (time_stamps
) can be added for additional restrictions to ensure model plausibility.
For this example, only transactions preceding the loan are included (date <= date_loan
). Joins in the abstract data model follow the lazy evaluation principle, meaning no operations occur at this point. The joins materialize only when the feature-learning algorithms compute feature representations.
Relational Learning in the Pipeline
multirel = getml.feature_learning.Multirel(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
)
xgboost = getml.predictors.XGBoostClassifier()
pipe = getml.pipeline.Pipeline(
data_model=schema.data_model,
feature_learners=multirel,
predictors=xgboost,
)
Listing 4: Combining components into the pipeline
The next step is defining the models used in the multi-step learning problem: a feature-learning algorithm (Multirel
) and a prediction algorithm (XGBoost
). See Listing 4.
With these components, data scientists can construct an end-to-end pipeline, fully capturing the prediction problem. The pipeline accepts the data model (data_model
), the feature-learning model (feature_learners
), and the prediction model (predictors
) (Listing 5).
pipe.fit(population_table=population_train, peripheral_tables={"trans": trans})
Listing 5: Training the pipeline
The fit
method passes data to the pipeline, starting the model estimation process. During fit
, the pipeline learns the feature logic based on training data. Algorithms traverse paths in the abstract data model, progressively resolving joins. From the materialized data, they sample repeatedly to learn feature logic. Ensemble methods *** are used for further feature learning. The pipeline then transforms the learned logic into specific feature values, which serve as input for the prediction algorithm (XGBoost) in the second step.
A trained pipeline (fit
) can make predictions on unseen data (predict
), evaluate the model's prediction quality (score
), or materialize feature values (transform
). The latter enables integration with custom models, such as those from scikit-learn.
Achieving Breakthroughs with Feature Learning
Using the pipeline, the financial institution's data scientists can calculate loan default probabilities for potential borrowers based on transaction histories without manually transforming data, creating features, or writing SQL. They only need to pass raw data and available metadata (data model and annotations) to the API.
Code Demo
The example and over 20 others are available in getML's demo repository on GitHub. The getML engine is offered as a free trial on the getML website. All example notebooks can also be run interactively in a JupyterLab environment in the browser.
This enables validation of a feature’s business logic. The getML framework can transpile features or entire pipelines into various SQL dialects for execution, for example, on a Spark cluster. Listing 6 demonstrates a feature learned by Multirel as a SQL query: the minimum balance considering specific transaction types and timestamps.
CREATE TABLE feature_1_61 AS
SELECT MIN(trans.balance) AS feature_1_61
FROM population
INNER JOIN trans ON population.account_id = trans.account_id
WHERE (trans.date <= population.date_loan)
AND (((trans.operation NOT IN ('VYBER')
OR trans.operation IS NULL)
AND (trans.amount > 18278.000000))
OR ((trans.operation IN ('VYBER'))
AND (population.date_loan - trans.date > 7557120.000000)
AND (trans.balance > 28786.000000))
OR ((trans.operation IN ('VYBER'))
AND (population.date_loan - trans.date > 7557120.000000)
AND (trans.balance <= 28786.000000
OR trans.balance IS NULL)
AND (trans.balance <= 1681.000000))
OR ((trans.operation IN ('VYBER'))
AND (population.date_loan - trans.date <= 7557120.000000)))
GROUP BY population.account_id;
Listing 6: A feature learned by Multirel as a SQL query – "VYBER" classifies cash withdrawals
This example illustrates a typical application of relational learning—a bank’s loan decision—in a simplified form. Data scientists can apply feature-learning algorithms to develop prediction models on relational data or time series, independent of industry or use case. Feature learning has the potential to play a central role in enterprise ML applications through end-to-end algorithmic automation.
URL of the original Article on Heise (german onnly):
https://www.heise.de/-6655369