At getML, we follow academia and classify techniques that use simple, merely unconditional transformations to construct flat (attribute-value) representations from relational data structures as propositionalization approaches.
Propositionalization is a pretty common approach when dealing with relational data. The construction of meaningful features is usually carried out based on a manually chosen set of aggregations that results from a costly process called feature engineering. In real-world projects features are often constructed based on a static feature catalog that holds manually crafted recipes for thousands of (often redundant) features. The two main benefits of engineered features based on simple recipes are obvious: Such features are easy to interpret and their calculation is comparatively straightforward. Sometimes such simple features perform decently and the two benefits outlined above are of particular value when prototyping models during the exploration phase of a data science project.
Hence we developed our own take on propositionalization, Fast Prop (short for fast propositionalization), that we have just released with getML 0.16.0. And we have leveraged getML's strengths to make FastProp not only an exceptionally fast but also an incredibly powerful solution. FastProp is capable of building up to hundreds of features based on a single column and stands on the shoulders of some of getML's core concepts like units or mappings to deliver the most advanced set of propositionalized features possible. In no time.
In a recently published set of notebooks, we benchmark FastProp against some popular libraries based on the propositionalization approach.
As you can see, FastProp is true to its name: It achieves similar or slightly better performance than featuretools or tsfresh, but generates features between 34x to 179x faster than these implementations. In most cases the predictive performance can be further improved by utilizing one of getML's advanced feature learning algorithms.
If you want to reproduce these results, please refer to the following notebooks:
|Air pollution||~51x faster than featuretools, ~39x faster than tsfresh|
|Dodgers||~44x faster than featuretools, ~81x faster than tsfresh|
|Interstate94||~83x faster than featuretools|
|Occupancy||~61x faster than featuretools, ~34x faster than tsfresh|
|Robot||~179x faster than featuretools, ~84x faster than tsfresh|
Note: These results are hardware-dependent and may be different on your machine.
FastProp can rely on getML's custom-built C++-native in-memory database engine, which is highly optimized for relational data structures. getML's novel algorithms take advantage of advanced caching strategies and functional design patterns, where all column-based operations are evaluated lazily. This means getML carries out operations only on rows that matter and avoids redundant operations as much as possible. When deciding on whether to include an observation, getML takes into account even complex conditions that might span multiple tables within the relational data model. This allows working with data sets of substantial size even without falling back to distributed computing models.
GetML empowers a multinational car manufacturer to improve on customer retention strategies by providing accurate churn predictions.
Predict item purchases that will be gifts. This analysis is based on a public domain data set provided by the American Bureau of Labor Statistics.