IMDb - Predicting actors' gender¶
In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.
The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.
As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.
Summary:
- Prediction type: Classification model
- Domain: Entertainment
- Prediction target: The gender of an actor
- Population size: 817718
Background¶
The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.
It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015) (Now residing at relational-data.org.).
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.5.0" "pyspark==3.5.2" "ipywidgets==8.1.5"
import os
from pyspark.sql import SparkSession
import getml
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('imdb')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912144137.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'imdb'.
In the following, we set some flags that affect execution of the notebook:
- We don't let the algorithms utilize the information on actors' first names (see below for an explanation).
USE_FIRST_NAMES = False
RUN_SPARK = False
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data from the source file:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="imdb_ijs",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='imdb_ijs', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if getml.data.exists(name):
return getml.data.load_data_frame(name)
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
return data_frame
actors = load_if_needed("actors")
roles = load_if_needed("roles")
movies = load_if_needed("movies")
movies_genres = load_if_needed("movies_genres")
actors
name | id | first_name | last_name | gender |
---|---|---|---|---|
role | unused_float | unused_string | unused_string | unused_string |
0 | 2 | Michael | 'babeepower' Viera | M |
1 | 3 | Eloy | 'Chincheta' | M |
2 | 4 | Dieguito | 'El Cigala' | M |
3 | 5 | Antonio | 'El de Chipiona' | M |
4 | 6 | José | 'El Francés' | M |
... | ... | ... | ... | |
817713 | 845461 | Herdís | Þorvaldsdóttir | F |
817714 | 845462 | Katla Margrét | Þorvaldsdóttir | F |
817715 | 845463 | Lilja Nótt | Þórarinsdóttir | F |
817716 | 845464 | Hólmfríður | Þórhallsdóttir | F |
817717 | 845465 | Theódóra | Þórðardóttir | F |
817718 rows x 4 columns
memory usage: 40.22 MB
name: actors
type: getml.DataFrame
roles
name | actor_id | movie_id | role |
---|---|---|---|
role | unused_float | unused_float | unused_string |
0 | 2 | 280088 | Stevie |
1 | 2 | 396232 | Various/lyricist |
2 | 3 | 376687 | Gitano 1 |
3 | 4 | 336265 | El Cigala |
4 | 5 | 135644 | Himself |
... | ... | ... | |
3431961 | 845461 | 137097 | Kata |
3431962 | 845462 | 208838 | Magga |
3431963 | 845463 | 870 | Gunna |
3431964 | 845464 | 378123 | Gudrun |
3431965 | 845465 | 378123 | NULL |
3431966 rows x 3 columns
memory usage: 115.41 MB
name: roles
type: getml.DataFrame
movies
name | id | year | rank | name |
---|---|---|---|---|
role | unused_float | unused_float | unused_float | unused_string |
0 | 0 | 2002 | nan | #28 |
1 | 1 | 2000 | nan | #7 Train: An Immigrant Journey, ... |
2 | 2 | 1971 | 6.4 | $ |
3 | 3 | 1913 | nan | $1,000 Reward |
4 | 4 | 1915 | nan | $1,000 Reward |
... | ... | ... | ... | |
388264 | 412316 | 1991 | nan | "zem blch krlu" |
388265 | 412317 | 1995 | nan | "rgammk" |
388266 | 412318 | 2002 | nan | "zgnm Leyla" |
388267 | 412319 | 1983 | nan | " Istanbul" |
388268 | 412320 | 1958 | nan | "sterreich" |
388269 rows x 4 columns
memory usage: 19.92 MB
name: movies
type: getml.DataFrame
movies_genres
name | movie_id | genre |
---|---|---|
role | unused_float | unused_string |
0 | 1 | Documentary |
1 | 1 | Short |
2 | 2 | Comedy |
3 | 2 | Crime |
4 | 5 | Western |
... | ... | |
395114 | 378612 | Adventure |
395115 | 378612 | Drama |
395116 | 378613 | Comedy |
395117 | 378613 | Drama |
395118 | 378614 | Comedy |
395119 rows x 2 columns
memory usage: 9.24 MB
name: movies_genres
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
actors["target"] = (actors.gender == 'F')
actors.set_role("id", getml.data.roles.join_key)
actors.set_role("target", getml.data.roles.target)
The benchmark studies do not state clearly, whether it is fair game to use the first names of the actors. Using the first names, we can easily increase the predictive accuracy to above 90%. However, when doing so the problem basically becomes a first name identification problem rather than a relational learning problem. This would undermine the point of this notebook: Showcase relational learning. Therefore, our assumption is that using the first names is not allowed. Feel free to set this flag above to see how well getML incoporates such starightforward information into its feature logic.
if USE_FIRST_NAMES:
actors.set_role("first_name", getml.data.roles.text)
actors
name | id | target | first_name | last_name | gender |
---|---|---|---|---|---|
role | join_key | target | unused_string | unused_string | unused_string |
0 | 2 | 0 | Michael | 'babeepower' Viera | M |
1 | 3 | 0 | Eloy | 'Chincheta' | M |
2 | 4 | 0 | Dieguito | 'El Cigala' | M |
3 | 5 | 0 | Antonio | 'El de Chipiona' | M |
4 | 6 | 0 | José | 'El Francés' | M |
... | ... | ... | ... | ... | |
817713 | 845461 | 1 | Herdís | Þorvaldsdóttir | F |
817714 | 845462 | 1 | Katla Margrét | Þorvaldsdóttir | F |
817715 | 845463 | 1 | Lilja Nótt | Þórarinsdóttir | F |
817716 | 845464 | 1 | Hólmfríður | Þórhallsdóttir | F |
817717 | 845465 | 1 | Theódóra | Þórðardóttir | F |
817718 rows x 5 columns
memory usage: 43.49 MB
name: actors
type: getml.DataFrame
roles.set_role(["actor_id", "movie_id"], getml.data.roles.join_key)
roles.set_role("role", getml.data.roles.text)
roles
name | actor_id | movie_id | role |
---|---|---|---|
role | join_key | join_key | text |
0 | 2 | 280088 | Stevie |
1 | 2 | 396232 | Various/lyricist |
2 | 3 | 376687 | Gitano 1 |
3 | 4 | 336265 | El Cigala |
4 | 5 | 135644 | Himself |
... | ... | ... | |
3431961 | 845461 | 137097 | Kata |
3431962 | 845462 | 208838 | Magga |
3431963 | 845463 | 870 | Gunna |
3431964 | 845464 | 378123 | Gudrun |
3431965 | 845465 | 378123 | NULL |
3431966 rows x 3 columns
memory usage: 87.96 MB
name: roles
type: getml.DataFrame
movies.set_role("id", getml.data.roles.join_key)
movies.set_role(["year", "rank"], getml.data.roles.numerical)
movies
name | id | year | rank | name |
---|---|---|---|---|
role | join_key | numerical | numerical | unused_string |
0 | 0 | 2002 | nan | #28 |
1 | 1 | 2000 | nan | #7 Train: An Immigrant Journey, ... |
2 | 2 | 1971 | 6.4 | $ |
3 | 3 | 1913 | nan | $1,000 Reward |
4 | 4 | 1915 | nan | $1,000 Reward |
... | ... | ... | ... | |
388264 | 412316 | 1991 | nan | "zem blch krlu" |
388265 | 412317 | 1995 | nan | "rgammk" |
388266 | 412318 | 2002 | nan | "zgnm Leyla" |
388267 | 412319 | 1983 | nan | " Istanbul" |
388268 | 412320 | 1958 | nan | "sterreich" |
388269 rows x 4 columns
memory usage: 18.37 MB
name: movies
type: getml.DataFrame
movies_genres.set_role("movie_id", getml.data.roles.join_key)
movies_genres.set_role("genre", getml.data.roles.categorical)
movies_genres
name | movie_id | genre |
---|---|---|
role | join_key | categorical |
0 | 1 | Documentary |
1 | 1 | Short |
2 | 2 | Comedy |
3 | 2 | Crime |
4 | 5 | Western |
... | ... | |
395114 | 378612 | Adventure |
395115 | 378612 | Drama |
395116 | 378613 | Comedy |
395117 | 378613 | Drama |
395118 | 378614 | Comedy |
395119 rows x 2 columns
memory usage: 3.16 MB
name: movies_genres
type: getml.DataFrame
We need to separate our data set into a training, testing and validation set:
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)
split
0 | train |
---|---|
1 | validation |
2 | train |
3 | validation |
4 | validation |
... |
infinite number of rows
type: StringColumnView
container = getml.data.Container(population=actors, split=split)
container.add(
roles=roles,
movies=movies,
movies_genres=movies_genres,
)
container
subset | name | rows | type | |
---|---|---|---|---|
0 | test | actors | 122794 | View |
1 | train | actors | 571807 | View |
2 | validation | actors | 123117 | View |
name | rows | type | |
---|---|---|---|
0 | roles | 3431966 | DataFrame |
1 | movies | 388269 | DataFrame |
2 | movies_genres | 395119 | DataFrame |
2. Predictive modelling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
2.1 Define relational model¶
To get started with relational learning, we need to specify the data model.
dm = getml.data.DataModel("actors")
dm.add(getml.data.to_placeholder(
roles=roles,
movies=movies,
movies_genres=movies_genres,
))
dm.population.join(
dm.roles,
on=("id", "actor_id"),
)
dm.roles.join(
dm.movies,
on=("movie_id", "id"),
relationship=getml.data.relationship.many_to_one,
)
dm.movies.join(
dm.movies_genres,
on=("id", "movie_id"),
)
dm
data frames | staging table | |
---|---|---|
0 | actors | ACTORS__STAGING_TABLE_1 |
1 | movies_genres | MOVIES_GENRES__STAGING_TABLE_2 |
2 | roles, movies | ROLES__STAGING_TABLE_3 |
2.2 getML pipeline¶
Set-up the feature learner & predictor
We can either use the relboost default parameters or some more fine-tuned parameters. Fine-tuning these parameters in this way can increase our predictive accuracy to 85%, but the training time increases to over 4 hours. We therefore assume that we want to use the default parameters.
text_field_splitter = getml.preprocessors.TextFieldSplitter()
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
)
feature_selector = getml.predictors.XGBoostClassifier()
predictor = getml.predictors.XGBoostClassifier()
Build the pipeline
pipe = getml.pipeline.Pipeline(
tags=['fast_prop'],
data_model=dm,
preprocessors=[text_field_splitter, mapping],
feature_learners=[fast_prop],
feature_selectors=[feature_selector],
predictors=[predictor],
share_selected_features=0.1,
)
2.3 Model training¶
pipe.check(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:28 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining ROLES__STAGING_TABLE_3 and MOVIES_GENRES__STAGING_TABLE_2 over 'id' and 'movie_id', there are no corresponding entries for 26.899421% of entries in 'id' in 'ROLES__STAGING_TABLE_3'. You might want to double-check your join keys. |
pipe.fit(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09
The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09 Indexing text fields... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06 FastProp: Trying 226 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:52 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:03 XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 04:04 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:29
Trained pipeline.
Time taken: 0:06:52.549544.
Pipeline(data_model='actors', feature_learners=['FastProp'], feature_selectors=['XGBoostClassifier'], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['movies', 'movies_genres', 'roles'], predictors=['XGBoostClassifier'], preprocessors=['TextFieldSplitter', 'Mapping'], share_selected_features=0.1, tags=['fast_prop', 'container-i1xBdh'])
2.4 Model evaluation¶
pipe.score(container.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:13:37 | train | target | 0.8417 | 0.9139 | 0.3217 |
1 | 2024-09-12 13:13:51 | test | target | 0.842 | 0.9138 | 0.323 |
2.5 Features¶
The most important feature looks as follows:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_164";
CREATE TABLE "FEATURE_1_164" AS
SELECT AVG( COALESCE( f_1_1_18."feature_1_1_18", 0.0 ) ) AS "feature_1_164",
t1.rowid AS rownum
FROM "ACTORS__STAGING_TABLE_1" t1
INNER JOIN "ROLES__STAGING_TABLE_3" t2
ON t1."id" = t2."actor_id"
LEFT JOIN "FEATURE_1_1_18" f_1_1_18
ON t2.rowid = f_1_1_18.rownum
GROUP BY t1.rowid;
2.6 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Here, we will demonstrate how the pipeline can be transpiled to Spark SQL and then executed on a Spark cluster.
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("imdb_spark")
if RUN_SPARK:
spark = SparkSession.builder.appName(
"online_retail"
).config(
"spark.driver.maxResultSize","10g"
).config(
"spark.driver.memory", "10g"
).config(
"spark.executor.memory", "20g"
).config(
"spark.sql.execution.arrow.pyspark.enabled", "true"
).config(
"spark.sql.session.timeZone", "UTC"
).enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
if RUN_SPARK:
population_spark = container.train.population.to_pyspark(spark, name="actors")
if RUN_SPARK:
movies_genres_spark = container.movies_genres.to_pyspark(spark, name="movies_genres")
roles_spark = container.roles.to_pyspark(spark, name="roles")
movies_spark = container.movies.to_pyspark(spark, name="movies")
if RUN_SPARK:
getml.spark.execute(spark, "imdb_spark")
if RUN_SPARK:
spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()
3. Conclusion¶
In this notebook we have demonstrated how getML can be applied to text fields. We have demonstrated the our approach outperforms state-of-the-art relational learning algorithms on the IMDb dataset.
getml.engine.shutdown()
References¶
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
Neville, Jennifer, and David Jensen. "Relational dependency networks." Journal of Machine Learning Research 8.Mar (2007): 653-692.
Neville, Jennifer, and David Jensen. "Collective classification with relational dependency networks." Workshop on Multi-Relational Data Mining (MRDM-2003). 2003.
Neville, Jennifer, et al. "Learning relational probability trees." Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003.
Perovšek, Matic, et al. "Wordification: Propositionalization by unfolding relational data into bags of words." Expert Systems with Applications 42.17-18 (2015): 6442-6456.