Note that due to memory limitations, this notebook will not run on MyBinder.
In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.
The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.
As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.
Author: Dr. Patrick Urbanke
The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.
It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).
Automated feature engineering for relational business data? Sound great, but you don't really know what relational data is? This post is for you!
FastProp is our unique take on propositionalization. In real-world benchmarks against popular propositionalization libraries, we find FastProp is between 34x to 179x faster than the current state of the art.