project

IMDb (nb intro)

EXCERPT (BLOG ITEM + OG, TWITTER)

IMDb - Predicting actors' gender using getML

Note that due to memory limitations, this notebook will not run on MyBinder.

In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.

The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.

As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.

Summary:

  • Prediction type: Classification model
  • Domain: Entertainment
  • Prediction target: The gender of an actor
  • Population size: 817718

Author: Dr. Patrick Urbanke

Background

The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.

It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015).

Related code example

Notebook:
Open in nbviewer
Open in mybinder