IMDb - Predicting actors' gender¶

In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.

The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.

As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.

Summary:

Prediction type: Classification model
Domain: Entertainment
Prediction target: The gender of an actor
Population size: 817718

Background¶

The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.

It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015) (Now residing at relational-data.org.).

Analysis¶

Let's get started with the analysis and set up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "pyspark==3.5.2" "ipywidgets==8.1.5"
%pip install -q "getml==1.5.0" "pyspark==3.5.2" "ipywidgets==8.1.5"

In [2]:

  Copied!     
 
import os

from pyspark.sql import SparkSession
import getml

os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"

print(f"getML API version: {getml.__version__}\n")
import os from pyspark.sql import SparkSession import getml os.environ["PYARROW_IGNORE_TIMEZONE"] = "1" print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [32]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('imdb')
getml.engine.launch(allow_remote_ips=True, token='token') getml.engine.set_project('imdb')

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912144137.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'imdb'.

In the following, we set some flags that affect execution of the notebook:

We don't let the algorithms utilize the information on actors' first names (see below for an explanation).

In [4]:

  Copied!     
 
USE_FIRST_NAMES = False
RUN_SPARK = False
USE_FIRST_NAMES = False RUN_SPARK = False

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data from the source file:

In [5]:

  Copied!     
 
conn = getml.database.connect_mysql(
    host="relational.fel.cvut.cz",
    dbname="imdb_ijs",
    port=3306,
    user="guest",
    password="ctu-relational"
)

conn
conn = getml.database.connect_mysql( host="relational.fel.cvut.cz", dbname="imdb_ijs", port=3306, user="guest", password="ctu-relational" ) conn

Out[5]:

Connection(dbname='imdb_ijs',
           dialect='mysql',
           host='relational.fel.cvut.cz',
           port=3306)

In [6]:

  Copied!     
 
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if getml.data.exists(name):
        return getml.data.load_data_frame(name)
    data_frame = getml.data.DataFrame.from_db(
        name=name,
        table_name=name,
        conn=conn
    )
    data_frame.save()
    return data_frame
def load_if_needed(name): """ Loads the data from the relational learning repository, if the data frame has not already been loaded. """ if getml.data.exists(name): return getml.data.load_data_frame(name) data_frame = getml.data.DataFrame.from_db( name=name, table_name=name, conn=conn ) data_frame.save() return data_frame

In [7]:

  Copied!     
 
actors = load_if_needed("actors")
roles = load_if_needed("roles")
movies = load_if_needed("movies")
movies_genres = load_if_needed("movies_genres")
actors = load_if_needed("actors") roles = load_if_needed("roles") movies = load_if_needed("movies") movies_genres = load_if_needed("movies_genres")

In [8]:

  Copied!     
 
actors
actors

Out[8]:

name	id	first_name	last_name	gender
role	unused_float	unused_string	unused_string	unused_string
0	2	Michael	'babeepower' Viera	M
1	3	Eloy	'Chincheta'	M
2	4	Dieguito	'El Cigala'	M
3	5	Antonio	'El de Chipiona'	M
4	6	José	'El Francés'	M
	...	...	...	...
817713	845461	Herdís	Þorvaldsdóttir	F
817714	845462	Katla Margrét	Þorvaldsdóttir	F
817715	845463	Lilja Nótt	Þórarinsdóttir	F
817716	845464	Hólmfríður	Þórhallsdóttir	F
817717	845465	Theódóra	Þórðardóttir	F

817718 rows x 4 columns
memory usage: 40.22 MB
name: actors
type: getml.DataFrame

In [9]:

  Copied!     
 
roles
roles

Out[9]:

name	actor_id	movie_id	role
role	unused_float	unused_float	unused_string
0	2	280088	Stevie
1	2	396232	Various/lyricist
2	3	376687	Gitano 1
3	4	336265	El Cigala
4	5	135644	Himself
	...	...	...
3431961	845461	137097	Kata
3431962	845462	208838	Magga
3431963	845463	870	Gunna
3431964	845464	378123	Gudrun
3431965	845465	378123	NULL

3431966 rows x 3 columns
memory usage: 115.41 MB
name: roles
type: getml.DataFrame

In [10]:

  Copied!     
 
movies
movies

Out[10]:

name	id	year	rank	name
role	unused_float	unused_float	unused_float	unused_string
0	0	2002	nan	#28
1	1	2000	nan	#7 Train: An Immigrant Journey, ...
2	2	1971	6.4	$
3	3	1913	nan	$1,000 Reward
4	4	1915	nan	$1,000 Reward
	...	...	...	...
388264	412316	1991	nan	"zem blch krlu"
388265	412317	1995	nan	"rgammk"
388266	412318	2002	nan	"zgnm Leyla"
388267	412319	1983	nan	" Istanbul"
388268	412320	1958	nan	"sterreich"

388269 rows x 4 columns
memory usage: 19.92 MB
name: movies
type: getml.DataFrame

In [11]:

  Copied!     
 
movies_genres
movies_genres

Out[11]:

name	movie_id	genre
role	unused_float	unused_string
0	1	Documentary
1	1	Short
2	2	Comedy
3	2	Crime
4	5	Western
	...	...
395114	378612	Adventure
395115	378612	Drama
395116	378613	Comedy
395117	378613	Drama
395118	378614	Comedy

395119 rows x 2 columns
memory usage: 9.24 MB
name: movies_genres
type: getml.DataFrame

1.2 Prepare data for getML¶

getML requires that we define roles for each of the columns.

In [12]:

  Copied!     
 
actors["target"] = (actors.gender == 'F')
actors.set_role("id", getml.data.roles.join_key)
actors.set_role("target", getml.data.roles.target)
actors["target"] = (actors.gender == 'F') actors.set_role("id", getml.data.roles.join_key) actors.set_role("target", getml.data.roles.target)

The benchmark studies do not state clearly, whether it is fair game to use the first names of the actors. Using the first names, we can easily increase the predictive accuracy to above 90%. However, when doing so the problem basically becomes a first name identification problem rather than a relational learning problem. This would undermine the point of this notebook: Showcase relational learning. Therefore, our assumption is that using the first names is not allowed. Feel free to set this flag above to see how well getML incoporates such starightforward information into its feature logic.

In [13]:

  Copied!     
 
if USE_FIRST_NAMES:
    actors.set_role("first_name", getml.data.roles.text)
actors
if USE_FIRST_NAMES: actors.set_role("first_name", getml.data.roles.text) actors

Out[13]:

name	id	target	first_name	last_name	gender
role	join_key	target	unused_string	unused_string	unused_string
0	2	0	Michael	'babeepower' Viera	M
1	3	0	Eloy	'Chincheta'	M
2	4	0	Dieguito	'El Cigala'	M
3	5	0	Antonio	'El de Chipiona'	M
4	6	0	José	'El Francés'	M
	...	...	...	...	...
817713	845461	1	Herdís	Þorvaldsdóttir	F
817714	845462	1	Katla Margrét	Þorvaldsdóttir	F
817715	845463	1	Lilja Nótt	Þórarinsdóttir	F
817716	845464	1	Hólmfríður	Þórhallsdóttir	F
817717	845465	1	Theódóra	Þórðardóttir	F

817718 rows x 5 columns
memory usage: 43.49 MB
name: actors
type: getml.DataFrame

In [14]:

  Copied!     
 
roles.set_role(["actor_id", "movie_id"], getml.data.roles.join_key)
roles.set_role("role", getml.data.roles.text)
roles
roles.set_role(["actor_id", "movie_id"], getml.data.roles.join_key) roles.set_role("role", getml.data.roles.text) roles

Out[14]:

name	actor_id	movie_id	role
role	join_key	join_key	text
0	2	280088	Stevie
1	2	396232	Various/lyricist
2	3	376687	Gitano 1
3	4	336265	El Cigala
4	5	135644	Himself
	...	...	...
3431961	845461	137097	Kata
3431962	845462	208838	Magga
3431963	845463	870	Gunna
3431964	845464	378123	Gudrun
3431965	845465	378123	NULL

3431966 rows x 3 columns
memory usage: 87.96 MB
name: roles
type: getml.DataFrame

In [15]:

  Copied!     
 
movies.set_role("id", getml.data.roles.join_key)
movies.set_role(["year", "rank"], getml.data.roles.numerical)
movies
movies.set_role("id", getml.data.roles.join_key) movies.set_role(["year", "rank"], getml.data.roles.numerical) movies

Out[15]:

name	id	year	rank	name
role	join_key	numerical	numerical	unused_string
0	0	2002	nan	#28
1	1	2000	nan	#7 Train: An Immigrant Journey, ...
2	2	1971	6.4	$
3	3	1913	nan	$1,000 Reward
4	4	1915	nan	$1,000 Reward
	...	...	...	...
388264	412316	1991	nan	"zem blch krlu"
388265	412317	1995	nan	"rgammk"
388266	412318	2002	nan	"zgnm Leyla"
388267	412319	1983	nan	" Istanbul"
388268	412320	1958	nan	"sterreich"

388269 rows x 4 columns
memory usage: 18.37 MB
name: movies
type: getml.DataFrame

In [16]:

  Copied!     
 
movies_genres.set_role("movie_id", getml.data.roles.join_key)
movies_genres.set_role("genre", getml.data.roles.categorical)
movies_genres
movies_genres.set_role("movie_id", getml.data.roles.join_key) movies_genres.set_role("genre", getml.data.roles.categorical) movies_genres

Out[16]:

name	movie_id	genre
role	join_key	categorical
0	1	Documentary
1	1	Short
2	2	Comedy
3	2	Crime
4	5	Western
	...	...
395114	378612	Adventure
395115	378612	Drama
395116	378613	Comedy
395117	378613	Drama
395118	378614	Comedy

395119 rows x 2 columns
memory usage: 3.16 MB
name: movies_genres
type: getml.DataFrame

We need to separate our data set into a training, testing and validation set:

In [17]:

  Copied!     
 
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)
split
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15) split

Out[17]:


0	train
1	validation
2	train
3	validation
4	validation
	...

infinite number of rows
type: StringColumnView

In [18]:

  Copied!     
 
container = getml.data.Container(population=actors, split=split)

container.add(
    roles=roles,
    movies=movies,
    movies_genres=movies_genres,
)

container
container = getml.data.Container(population=actors, split=split) container.add( roles=roles, movies=movies, movies_genres=movies_genres, ) container

Out[18]:

population

	subset	name	rows	type
0	test	actors	122794	View
1	train	actors	571807	View
2	validation	actors	123117	View

peripheral

	name	rows	type
0	roles	3431966	DataFrame
1	movies	388269	DataFrame
2	movies_genres	395119	DataFrame

2. Predictive modelling¶

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model¶

To get started with relational learning, we need to specify the data model.

In [19]:

  Copied!     
 
dm = getml.data.DataModel("actors")

dm.add(getml.data.to_placeholder(
    roles=roles,
    movies=movies,
    movies_genres=movies_genres,
))

dm.population.join(
    dm.roles,
    on=("id", "actor_id"),
)

dm.roles.join(
    dm.movies,
    on=("movie_id", "id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.movies.join(
    dm.movies_genres,
    on=("id", "movie_id"),
)

dm
dm = getml.data.DataModel("actors") dm.add(getml.data.to_placeholder( roles=roles, movies=movies, movies_genres=movies_genres, )) dm.population.join( dm.roles, on=("id", "actor_id"), ) dm.roles.join( dm.movies, on=("movie_id", "id"), relationship=getml.data.relationship.many_to_one, ) dm.movies.join( dm.movies_genres, on=("id", "movie_id"), ) dm

Out[19]:

diagram

staging

	data frames	staging table
0	actors	ACTORS__STAGING_TABLE_1
1	movies_genres	MOVIES_GENRES__STAGING_TABLE_2
2	roles, movies	ROLES__STAGING_TABLE_3

2.2 getML pipeline¶

Set-up the feature learner & predictor

We can either use the relboost default parameters or some more fine-tuned parameters. Fine-tuning these parameters in this way can increase our predictive accuracy to 85%, but the training time increases to over 4 hours. We therefore assume that we want to use the default parameters.

In [20]:

  Copied!     
 
text_field_splitter = getml.preprocessors.TextFieldSplitter()

mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
)

feature_selector = getml.predictors.XGBoostClassifier()

predictor = getml.predictors.XGBoostClassifier()
text_field_splitter = getml.preprocessors.TextFieldSplitter() mapping = getml.preprocessors.Mapping() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, ) feature_selector = getml.predictors.XGBoostClassifier() predictor = getml.predictors.XGBoostClassifier()

Build the pipeline

In [21]:

  Copied!     
 
pipe = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=dm,
    preprocessors=[text_field_splitter, mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=[predictor],
    share_selected_features=0.1,
)
pipe = getml.pipeline.Pipeline( tags=['fast_prop'], data_model=dm, preprocessors=[text_field_splitter, mapping], feature_learners=[fast_prop], feature_selectors=[feature_selector], predictors=[predictor], share_selected_features=0.1, )

2.3 Model training¶

In [22]:

  Copied!     
 
pipe.check(container.train)
pipe.check(container.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:28
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.

Out[22]:

	type	label	message
0	INFO	FOREIGN KEYS NOT FOUND	When joining ROLES__STAGING_TABLE_3 and MOVIES_GENRES__STAGING_TABLE_2 over 'id' and 'movie_id', there are no corresponding entries for 26.899421% of entries in 'id' in 'ROLES__STAGING_TABLE_3'. You might want to double-check your join keys.

In [23]:

  Copied!     
 
pipe.fit(container.train)
pipe.fit(container.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09

The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09
  Indexing text fields... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06
  FastProp: Trying 226 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:52
  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:03
  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 04:04
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:29

Trained pipeline.

Time taken: 0:06:52.549544.

Out[23]:

Pipeline(data_model='actors',
         feature_learners=['FastProp'],
         feature_selectors=['XGBoostClassifier'],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['movies', 'movies_genres', 'roles'],
         predictors=['XGBoostClassifier'],
         preprocessors=['TextFieldSplitter', 'Mapping'],
         share_selected_features=0.1,
         tags=['fast_prop', 'container-i1xBdh'])

2.4 Model evaluation¶

In [24]:

  Copied!     
 
pipe.score(container.test)
pipe.score(container.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08
  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

Out[24]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 13:13:37	train	target	0.8417	0.9139	0.3217
1	2024-09-12 13:13:51	test	target	0.842	0.9138	0.323

2.5 Features¶

The most important feature looks as follows:

In [25]:

  Copied!     
 
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]

Out[25]:

DROP TABLE IF EXISTS "FEATURE_1_164";

CREATE TABLE "FEATURE_1_164" AS
SELECT AVG( COALESCE( f_1_1_18."feature_1_1_18", 0.0 ) ) AS "feature_1_164",
       t1.rowid AS rownum
FROM "ACTORS__STAGING_TABLE_1" t1
INNER JOIN "ROLES__STAGING_TABLE_3" t2
ON t1."id" = t2."actor_id"
LEFT JOIN "FEATURE_1_1_18" f_1_1_18
ON t2.rowid = f_1_1_18.rownum
GROUP BY t1.rowid;

2.6 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Here, we will demonstrate how the pipeline can be transpiled to Spark SQL and then executed on a Spark cluster.

In [ ]:

  Copied!     
 
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("imdb_spark")
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("imdb_spark")

In [27]:

  Copied!     
 
if RUN_SPARK:
    spark = SparkSession.builder.appName(
        "online_retail"
    ).config(
        "spark.driver.maxResultSize","10g"
    ).config(
        "spark.driver.memory", "10g"
    ).config(
        "spark.executor.memory", "20g"
    ).config(
        "spark.sql.execution.arrow.pyspark.enabled", "true"
    ).config(
        "spark.sql.session.timeZone", "UTC"
    ).enableHiveSupport().getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")
if RUN_SPARK: spark = SparkSession.builder.appName( "online_retail" ).config( "spark.driver.maxResultSize","10g" ).config( "spark.driver.memory", "10g" ).config( "spark.executor.memory", "20g" ).config( "spark.sql.execution.arrow.pyspark.enabled", "true" ).config( "spark.sql.session.timeZone", "UTC" ).enableHiveSupport().getOrCreate() spark.sparkContext.setLogLevel("ERROR")

In [28]:

  Copied!     
 
if RUN_SPARK:
    population_spark = container.train.population.to_pyspark(spark, name="actors")
if RUN_SPARK: population_spark = container.train.population.to_pyspark(spark, name="actors")

In [29]:

  Copied!     
 
if RUN_SPARK:
    movies_genres_spark = container.movies_genres.to_pyspark(spark, name="movies_genres")
    roles_spark = container.roles.to_pyspark(spark, name="roles")
    movies_spark = container.movies.to_pyspark(spark, name="movies")
if RUN_SPARK: movies_genres_spark = container.movies_genres.to_pyspark(spark, name="movies_genres") roles_spark = container.roles.to_pyspark(spark, name="roles") movies_spark = container.movies.to_pyspark(spark, name="movies")

In [30]:

  Copied!     
 
if RUN_SPARK:
    getml.spark.execute(spark, "imdb_spark")
if RUN_SPARK: getml.spark.execute(spark, "imdb_spark")

In [31]:

  Copied!     
 
if RUN_SPARK:
    spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()
if RUN_SPARK: spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()

3. Conclusion¶

In this notebook we have demonstrated how getML can be applied to text fields. We have demonstrated the our approach outperforms state-of-the-art relational learning algorithms on the IMDb dataset.

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

References¶

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).

Neville, Jennifer, and David Jensen. "Relational dependency networks." Journal of Machine Learning Research 8.Mar (2007): 653-692.

Neville, Jennifer, and David Jensen. "Collective classification with relational dependency networks." Workshop on Multi-Relational Data Mining (MRDM-2003). 2003.

Neville, Jennifer, et al. "Learning relational probability trees." Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003.

Perovšek, Matic, et al. "Wordification: Propositionalization by unfolding relational data into bags of words." Expert Systems with Applications 42.17-18 (2015): 6442-6456.