MovieLens - Predicting a user's gender based on the movies they have watched¶
In this notebook, we will apply getML to a dataset that is often used for benchmarking in the relational learning literature: The MovieLens dataset.
Summary:
- Prediction type: Classification model
- Domain: Entertainment
- Prediction target: The gender of a user
- Population size: 6039
Background¶
The MovieLens dataset is often used in the relational learning literature has a benchmark for newly developed algorithms. Following the tradition, we benchmark getML's own algorithms on this dataset as well. The task is to predict a user's gender based on the movies they have watched.
It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015) (Now residing at relational-data.org.).
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"
import os
import pandas as pd
import matplotlib.pyplot as plt
import getml
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
%matplotlib inline
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("MovieLens")
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912151421.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'MovieLens'.
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data from the source file:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="imdb_MovieLens",
port=3306,
user="guest",
password="relational",
)
conn
Connection(dbname='imdb_MovieLens', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.data.DataFrame.from_db(name=name, table_name=name, conn=conn)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
users = load_if_needed("users")
u2base = load_if_needed("u2base")
movies = load_if_needed("movies")
movies2directors = load_if_needed("movies2directors")
directors = load_if_needed("directors")
movies2actors = load_if_needed("movies2actors")
actors = load_if_needed("actors")
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
users["target"] = users.u_gender == "F"
users.set_role("userid", getml.data.roles.join_key)
users.set_role("age", getml.data.roles.numerical)
users.set_role("occupation", getml.data.roles.categorical)
users.set_role("target", getml.data.roles.target)
users.save()
name | userid | target | occupation | age | u_gender |
---|---|---|---|---|---|
role | join_key | target | categorical | numerical | unused_string |
0 | 1 | 1 | 2 | 1 | F |
1 | 51 | 1 | 2 | 1 | F |
2 | 75 | 1 | 2 | 1 | F |
3 | 86 | 1 | 2 | 1 | F |
4 | 99 | 1 | 2 | 1 | F |
... | ... | ... | ... | ... | |
6034 | 5658 | 0 | 5 | 56 | M |
6035 | 5669 | 0 | 5 | 56 | M |
6036 | 5703 | 0 | 5 | 56 | M |
6037 | 5948 | 0 | 5 | 56 | M |
6038 | 5980 | 0 | 5 | 56 | M |
6039 rows x 5 columns
memory usage: 0.21 MB
name: users
type: getml.DataFrame
u2base.set_role(["userid", "movieid"], getml.data.roles.join_key)
u2base.set_role("rating", getml.data.roles.numerical)
u2base.save()
name | userid | movieid | rating |
---|---|---|---|
role | join_key | join_key | numerical |
0 | 2 | 1964242 | 1 |
1 | 2 | 2219779 | 1 |
2 | 3 | 1856939 | 1 |
3 | 4 | 2273044 | 1 |
4 | 5 | 1681655 | 1 |
... | ... | ... | |
996154 | 6040 | 2560616 | 5 |
996155 | 6040 | 2564194 | 5 |
996156 | 6040 | 2581228 | 5 |
996157 | 6040 | 2581428 | 5 |
996158 | 6040 | 2593112 | 5 |
996159 rows x 3 columns
memory usage: 15.94 MB
name: u2base
type: getml.DataFrame
movies.set_role("movieid", getml.data.roles.join_key)
movies.set_role(["year", "runningtime"], getml.data.roles.numerical)
movies.set_role(["isEnglish", "country"], getml.data.roles.categorical)
movies.save()
name | movieid | isEnglish | country | year | runningtime |
---|---|---|---|---|---|
role | join_key | categorical | categorical | numerical | numerical |
0 | 1672052 | T | other | 3 | 2 |
1 | 1672111 | T | other | 4 | 2 |
2 | 1672580 | T | USA | 4 | 3 |
3 | 1672716 | T | USA | 4 | 2 |
4 | 1672946 | T | USA | 4 | 0 |
... | ... | ... | ... | ... | |
3827 | 2591814 | T | other | 4 | 2 |
3828 | 2592334 | T | USA | 4 | 2 |
3829 | 2592963 | F | France | 2 | 2 |
3830 | 2593112 | T | USA | 4 | 1 |
3831 | 2593313 | F | other | 4 | 3 |
3832 rows x 5 columns
memory usage: 0.11 MB
name: movies
type: getml.DataFrame
movies2directors.set_role(["movieid", "directorid"], getml.data.roles.join_key)
movies2directors.set_role("genre", getml.data.roles.categorical)
movies2directors.save()
name | movieid | directorid | genre |
---|---|---|---|
role | join_key | join_key | categorical |
0 | 1672111 | 54934 | Action |
1 | 1672946 | 188940 | Action |
2 | 1679461 | 179783 | Action |
3 | 1691387 | 291700 | Action |
4 | 1693305 | 14663 | Action |
... | ... | ... | |
4136 | 2570825 | 265215 | Other |
4137 | 2572478 | 149311 | Other |
4138 | 2577062 | 304827 | Other |
4139 | 2590181 | 270707 | Other |
4140 | 2591814 | 57348 | Other |
4141 rows x 3 columns
memory usage: 0.05 MB
name: movies2directors
type: getml.DataFrame
directors.set_role("directorid", getml.data.roles.join_key)
directors.set_role(["d_quality", "avg_revenue"], getml.data.roles.numerical)
directors.save()
name | directorid | d_quality | avg_revenue |
---|---|---|---|
role | join_key | numerical | numerical |
0 | 67 | 4 | 1 |
1 | 92 | 2 | 3 |
2 | 284 | 4 | 0 |
3 | 708 | 4 | 1 |
4 | 746 | 4 | 4 |
... | ... | ... | |
2196 | 305962 | 4 | 4 |
2197 | 305978 | 4 | 2 |
2198 | 306168 | 3 | 2 |
2199 | 306343 | 4 | 1 |
2200 | 306351 | 4 | 1 |
2201 rows x 3 columns
memory usage: 0.04 MB
name: directors
type: getml.DataFrame
movies2actors.set_role(["movieid", "actorid"], getml.data.roles.join_key)
movies2actors.set_role("cast_num", getml.data.roles.numerical)
movies2actors.save()
name | movieid | actorid | cast_num |
---|---|---|---|
role | join_key | join_key | numerical |
0 | 1672580 | 981535 | 0 |
1 | 1672946 | 1094968 | 0 |
2 | 1673647 | 149985 | 0 |
3 | 1673647 | 261595 | 0 |
4 | 1673647 | 781357 | 0 |
... | ... | ... | |
138344 | 2593313 | 947005 | 3 |
138345 | 2593313 | 1090590 | 3 |
138346 | 2593313 | 1347419 | 3 |
138347 | 2593313 | 2099917 | 3 |
138348 | 2593313 | 2633550 | 3 |
138349 rows x 3 columns
memory usage: 2.21 MB
name: movies2actors
type: getml.DataFrame
We need to separate our data set into a training, testing and validation set:
actors.set_role("actorid", getml.data.roles.join_key)
actors.set_role("a_quality", getml.data.roles.numerical)
actors.set_role("a_gender", getml.data.roles.categorical)
actors.save()
name | actorid | a_gender | a_quality |
---|---|---|---|
role | join_key | categorical | numerical |
0 | 4 | M | 4 |
1 | 16 | M | 0 |
2 | 28 | M | 4 |
3 | 566 | M | 4 |
4 | 580 | M | 4 |
... | ... | ... | |
98685 | 2749162 | F | 3 |
98686 | 2749168 | F | 3 |
98687 | 2749204 | F | 3 |
98688 | 2749377 | F | 4 |
98689 | 2749386 | F | 4 |
98690 rows x 3 columns
memory usage: 1.58 MB
name: actors
type: getml.DataFrame
split = getml.data.split.random(train=0.75, test=0.25)
split
0 | train |
---|---|
1 | train |
2 | train |
3 | test |
4 | test |
... |
infinite number of rows
type: StringColumnView
container = getml.data.Container(population=users, split=split)
container.add(
u2base=u2base,
movies=movies,
movies2directors=movies2directors,
directors=directors,
movies2actors=movies2actors,
actors=actors,
)
container
subset | name | rows | type | |
---|---|---|---|---|
0 | test | users | 1511 | View |
1 | train | users | 4528 | View |
name | rows | type | |
---|---|---|---|
0 | u2base | 996159 | DataFrame |
1 | movies | 3832 | DataFrame |
2 | movies2directors | 4141 | DataFrame |
3 | directors | 2201 | DataFrame |
4 | movies2actors | 138349 | DataFrame |
5 | actors | 98690 | DataFrame |
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
2.1 Define relational model¶
To get started with relational learning, we need to specify the data model.
dm = getml.data.DataModel(users.to_placeholder())
dm.add(
getml.data.to_placeholder(
u2base=u2base,
movies=movies,
movies2directors=movies2directors,
directors=directors,
movies2actors=movies2actors,
actors=actors,
)
)
dm.population.join(dm.u2base, on="userid")
dm.u2base.join(
dm.movies, on="movieid", relationship=getml.data.relationship.many_to_one
)
dm.movies.join(
dm.movies2directors,
on="movieid",
relationship=getml.data.relationship.propositionalization,
)
dm.movies2directors.join(
dm.directors, on="directorid", relationship=getml.data.relationship.many_to_one
)
dm.movies.join(
dm.movies2actors,
on="movieid",
relationship=getml.data.relationship.propositionalization,
)
dm.movies2actors.join(
dm.actors, on="actorid", relationship=getml.data.relationship.many_to_one
)
dm
data frames | staging table | |
---|---|---|
0 | users | USERS__STAGING_TABLE_1 |
1 | movies2actors, actors | MOVIES2ACTORS__STAGING_TABLE_2 |
2 | movies2directors, directors | MOVIES2DIRECTORS__STAGING_TABLE_3 |
3 | u2base, movies | U2BASE__STAGING_TABLE_4 |
2.2 getML pipeline¶
Set-up the feature learner & predictor
We will set up two pipelines. One of them uses FastProp
, the other one uses Relboost
. Note that we have marked some of the joins in the data model with the propositionalization
tag. This means that FastProp
will be used for these relationships, even for the second pipeline. This can significantly speed up the training process.
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1,
)
relboost = getml.feature_learning.Relboost(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_subfeatures=50,
num_threads=1,
)
predictor = getml.predictors.XGBoostClassifier(
max_depth=5,
n_jobs=1,
)
Build the pipeline
pipe1 = getml.pipeline.Pipeline(
tags=["fast_prop"],
data_model=dm,
preprocessors=[mapping],
feature_learners=[fast_prop],
predictors=[predictor],
)
pipe1
Pipeline(data_model='users', feature_learners=['FastProp'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['actors', 'directors', 'movies', 'movies2actors', 'movies2directors', 'u2base'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop'])
pipe2 = getml.pipeline.Pipeline(
tags=["relboost"],
data_model=dm,
preprocessors=[mapping],
feature_learners=[relboost],
predictors=[predictor],
)
pipe2
Pipeline(data_model='users', feature_learners=['Relboost'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['actors', 'directors', 'movies', 'movies2actors', 'movies2directors', 'u2base'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['relboost'])
2.3 Model training¶
pipe1.check(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:12 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining U2BASE__STAGING_TABLE_4 and MOVIES2DIRECTORS__STAGING_TABLE_3 over 'movieid' and 'movieid', there are no corresponding entries for 0.159513% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys. |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining U2BASE__STAGING_TABLE_4 and MOVIES2ACTORS__STAGING_TABLE_2 over 'movieid' and 'movieid', there are no corresponding entries for 0.336492% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys. |
pipe1.fit(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 941 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:34 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:18 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
Trained pipeline.
Time taken: 0:00:59.177060.
Pipeline(data_model='users', feature_learners=['FastProp'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['actors', 'directors', 'movies', 'movies2actors', 'movies2directors', 'u2base'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop', 'container-AA9NXV'])
pipe2.check(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | FOREIGN KEYS NOT FOUND | When joining U2BASE__STAGING_TABLE_4 and MOVIES2DIRECTORS__STAGING_TABLE_3 over 'movieid' and 'movieid', there are no corresponding entries for 0.159513% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys. |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining U2BASE__STAGING_TABLE_4 and MOVIES2ACTORS__STAGING_TABLE_2 over 'movieid' and 'movieid', there are no corresponding entries for 0.336492% of entries in 'movieid' in 'U2BASE__STAGING_TABLE_4'. You might want to double-check your join keys. |
pipe2.fit(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 2 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 11:48 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 06:18 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:03
Trained pipeline.
Time taken: 0:18:19.843564.
Pipeline(data_model='users', feature_learners=['Relboost'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['actors', 'directors', 'movies', 'movies2actors', 'movies2directors', 'u2base'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['relboost', 'container-AA9NXV'])
2.4 Model evaluation¶
fastprop_score = pipe1.score(container.test)
fastprop_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:19 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 15:15:36 | train | target | 0.9139 | 0.9686 | 0.2788 |
1 | 2024-09-12 15:34:18 | test | target | 0.775 | 0.7859 | 0.4779 |
relboost_score = pipe2.score(container.test)
relboost_score
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:04 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 02:01
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 15:33:57 | train | target | 0.9755 | 0.9962 | 0.1488 |
1 | 2024-09-12 15:36:24 | test | target | 0.8048 | 0.8377 | 0.4429 |
2.6 Studying features¶
Column importances
Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.
As we can see, most of the predictive accuracy is drawn from the roles played by the actors. This suggests that the text fields contained in this relational database have a higher impact on predictive accuracy than for most other data sets.
names, importances = pipe1.columns.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title("Columns importances")
plt.xlabel("Columns")
plt.ylabel("Importances")
plt.xticks(rotation="vertical")
plt.show()
names, importances = pipe2.columns.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title("Columns importances")
plt.xlabel("Columns")
plt.ylabel("Importances")
plt.xticks(rotation="vertical")
plt.show()
2.7 Features¶
The most important features look as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_155";
CREATE TABLE "FEATURE_1_155" AS
SELECT MEDIAN( COALESCE( f_1_1_76."feature_1_1_76", 0.0 ) ) AS "feature_1_155",
t1.rowid AS rownum
FROM "USERS__STAGING_TABLE_1" t1
INNER JOIN "U2BASE__STAGING_TABLE_4" t2
ON t1."userid" = t2."userid"
LEFT JOIN "FEATURE_1_1_76" f_1_1_76
ON t2.rowid = f_1_1_76.rownum
GROUP BY t1.rowid;
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_1";
CREATE TABLE "FEATURE_1_1" AS
SELECT AVG(
CASE
WHEN ( p_1_1."feature_1_1_76" > 0.241788 ) AND ( p_1_1."feature_1_1_91" > 0.007531 ) AND ( t2."t3__year" > 3.000000 ) THEN -14.47282772262984
WHEN ( p_1_1."feature_1_1_76" > 0.241788 ) AND ( p_1_1."feature_1_1_91" > 0.007531 ) AND ( t2."t3__year" <= 3.000000 OR t2."t3__year" IS NULL ) THEN 18.80608436507123
WHEN ( p_1_1."feature_1_1_76" > 0.241788 ) AND ( p_1_1."feature_1_1_91" <= 0.007531 OR p_1_1."feature_1_1_91" IS NULL ) AND ( t2."t3__year__mapping_1_target_1_avg" > 0.250270 ) THEN 17.15219491452244
WHEN ( p_1_1."feature_1_1_76" > 0.241788 ) AND ( p_1_1."feature_1_1_91" <= 0.007531 OR p_1_1."feature_1_1_91" IS NULL ) AND ( t2."t3__year__mapping_1_target_1_avg" <= 0.250270 OR t2."t3__year__mapping_1_target_1_avg" IS NULL ) THEN 7.625903043390911
WHEN ( p_1_1."feature_1_1_76" <= 0.241788 OR p_1_1."feature_1_1_76" IS NULL ) AND ( p_1_1."feature_1_1_40" > 0.251831 ) AND ( p_1_1."feature_1_1_58" > 2.309180 ) THEN -9.39027739930436
WHEN ( p_1_1."feature_1_1_76" <= 0.241788 OR p_1_1."feature_1_1_76" IS NULL ) AND ( p_1_1."feature_1_1_40" > 0.251831 ) AND ( p_1_1."feature_1_1_58" <= 2.309180 OR p_1_1."feature_1_1_58" IS NULL ) THEN -3.045979108396568
WHEN ( p_1_1."feature_1_1_76" <= 0.241788 OR p_1_1."feature_1_1_76" IS NULL ) AND ( p_1_1."feature_1_1_40" <= 0.251831 OR p_1_1."feature_1_1_40" IS NULL ) AND ( p_1_1."feature_1_1_73" > 0.009437 ) THEN 0.007265281402375446
WHEN ( p_1_1."feature_1_1_76" <= 0.241788 OR p_1_1."feature_1_1_76" IS NULL ) AND ( p_1_1."feature_1_1_40" <= 0.251831 OR p_1_1."feature_1_1_40" IS NULL ) AND ( p_1_1."feature_1_1_73" <= 0.009437 OR p_1_1."feature_1_1_73" IS NULL ) THEN -3.672817201353017
ELSE NULL
END
) AS "feature_1_1",
t1.rowid AS rownum
FROM "USERS__STAGING_TABLE_1" t1
INNER JOIN "U2BASE__STAGING_TABLE_4" t2
ON t1."userid" = t2."userid"
LEFT JOIN "FEATURES_1_1_PROPOSITIONALIZATION" p_1_1
ON t2.rowid = p_1_1."rownum"
GROUP BY t1.rowid;
2.8 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder named movie_lens_pipeline containing
# the SQL code.
pipe2.features.to_sql().save("movie_lens_pipeline")
pipe2.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("movie_lens_spark")
2.9 Benchmarks¶
State-of-the-art approaches on this dataset perform as follows:
Approach | Study | Accuracy | AUC |
---|---|---|---|
Probabalistic Relational Model | Ghanem (2009) | -- | 69.2% |
Multi-Relational Bayesian Network | Schulte and Khosravi (2012) | 69% | -- |
Multi-Relational Bayesian Network | Schulte et al (2013) | 66% | -- |
By contrast, getML's algorithms, as used in this notebook, perform as follows:
scores = [fastprop_score, relboost_score]
pd.DataFrame(
data={
"Approach": ["FastProp", "Relboost"],
"Accuracy": [f"{score.accuracy:.1%}" for score in scores],
"AUC": [f"{score.auc:,.1%}" for score in scores],
}
)
Approach | Accuracy | AUC | |
---|---|---|---|
0 | FastProp | 77.5% | 78.6% |
1 | Relboost | 80.5% | 83.8% |
getml.engine.shutdown()
3. Conclusion¶
In this notebook we have demonstrated how getML can be applied to the MovieLens dataset. We have demonstrated the our approach outperforms state-of-the-art relational learning algorithms.
References¶
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
Ghanem, Amal S. "Probabilistic models for mining imbalanced relational data." Doctoral dissertation, Curtin University (2009).
Schulte, Oliver, and Hassan Khosravi. "Learning graphical models for relational data via lattice search." Machine Learning 88.3 (2012): 331-368.
Schulte, Oliver, et al. "A hierarchy of independence assumptions for multi-relational Bayes net classifiers." 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM). IEEE, 2013.