CORA - Categorizing academic publications¶
In this notebook, we compare getML against extant approaches in the relational learning literature on the CORA data set, which is often used for benchmarking. We demonstrate that getML outperforms the state of the art in the relational learning literature on this data set. Beyond the benchmarking aspects, this notebooks showcases getML's excellent capabilities in dealing with categorical data.
Summary:
- Prediction type: Classification model
- Domain: Academia
- Prediction target: The category of a paper
- Population size: 2708
Background¶
CORA is a well-known benchmarking dataset in the academic literature on relational learning. The dataset contains 2708 scientific publications on machine learning. The papers are divided into 7 categories. The challenge is to predict the category of a paper based on the papers it cites, the papers it is cited by and keywords contained in the paper.
It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015)(Now residing at relational-data.org.).
Analysis¶
Let's get started with the analysis and set up your session:
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import getml
%matplotlib inline
print(f"getML API version: {getml.__version__}\n")
getML API version: 1.5.0
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('cora')
Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux... Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912134108.log. Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Connected to project 'cora'.
1. Loading data¶
1.1 Download from source¶
We begin by downloading the data from the source file:
conn = getml.database.connect_mysql(
host="db.relational-data.org",
dbname="CORA",
port=3306,
user="guest",
password="relational"
)
conn
Connection(dbname='CORA', dialect='mysql', host='db.relational-data.org', port=3306)
def load_if_needed(name):
"""
Loads the data from the relational learning
repository, if the data frame has not already
been loaded.
"""
if not getml.data.exists(name):
data_frame = getml.data.DataFrame.from_db(
name=name,
table_name=name,
conn=conn
)
data_frame.save()
else:
data_frame = getml.data.load_data_frame(name)
return data_frame
paper = load_if_needed("paper")
cites = load_if_needed("cites")
content = load_if_needed("content")
paper
name | paper_id | class_label |
---|---|---|
role | unused_float | unused_string |
0 | 35 | Genetic_Algorithms |
1 | 40 | Genetic_Algorithms |
2 | 114 | Reinforcement_Learning |
3 | 117 | Reinforcement_Learning |
4 | 128 | Reinforcement_Learning |
... | ... | |
2703 | 1154500 | Case_Based |
2704 | 1154520 | Neural_Networks |
2705 | 1154524 | Rule_Learning |
2706 | 1154525 | Rule_Learning |
2707 | 1155073 | Rule_Learning |
2708 rows x 2 columns
memory usage: 0.09 MB
name: paper
type: getml.DataFrame
cites
name | cited_paper_id | citing_paper_id |
---|---|---|
role | unused_float | unused_float |
0 | 35 | 887 |
1 | 35 | 1033 |
2 | 35 | 1688 |
3 | 35 | 1956 |
4 | 35 | 8865 |
... | ... | |
5424 | 853116 | 19621 |
5425 | 853116 | 853155 |
5426 | 853118 | 1140289 |
5427 | 853155 | 853118 |
5428 | 954315 | 1155073 |
5429 rows x 2 columns
memory usage: 0.09 MB
name: cites
type: getml.DataFrame
content
name | paper_id | word_cited_id |
---|---|---|
role | unused_float | unused_string |
0 | 35 | word100 |
1 | 35 | word1152 |
2 | 35 | word1175 |
3 | 35 | word1228 |
4 | 35 | word1248 |
... | ... | |
49211 | 1155073 | word75 |
49212 | 1155073 | word759 |
49213 | 1155073 | word789 |
49214 | 1155073 | word815 |
49215 | 1155073 | word979 |
49216 rows x 2 columns
memory usage: 1.20 MB
name: content
type: getml.DataFrame
1.2 Prepare data for getML¶
getML requires that we define roles for each of the columns.
paper.set_role("paper_id", getml.data.roles.join_key)
paper.set_role("class_label", getml.data.roles.categorical)
paper
name | paper_id | class_label |
---|---|---|
role | join_key | categorical |
0 | 35 | Genetic_Algorithms |
1 | 40 | Genetic_Algorithms |
2 | 114 | Reinforcement_Learning |
3 | 117 | Reinforcement_Learning |
4 | 128 | Reinforcement_Learning |
... | ... | |
2703 | 1154500 | Case_Based |
2704 | 1154520 | Neural_Networks |
2705 | 1154524 | Rule_Learning |
2706 | 1154525 | Rule_Learning |
2707 | 1155073 | Rule_Learning |
2708 rows x 2 columns
memory usage: 0.02 MB
name: paper
type: getml.DataFrame
cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key)
cites
name | cited_paper_id | citing_paper_id |
---|---|---|
role | join_key | join_key |
0 | 35 | 887 |
1 | 35 | 1033 |
2 | 35 | 1688 |
3 | 35 | 1956 |
4 | 35 | 8865 |
... | ... | |
5424 | 853116 | 19621 |
5425 | 853116 | 853155 |
5426 | 853118 | 1140289 |
5427 | 853155 | 853118 |
5428 | 954315 | 1155073 |
5429 rows x 2 columns
memory usage: 0.04 MB
name: cites
type: getml.DataFrame
We need to separate our data set into a training, testing and validation set:
content.set_role("paper_id", getml.data.roles.join_key)
content.set_role("word_cited_id", getml.data.roles.categorical)
content
name | paper_id | word_cited_id |
---|---|---|
role | join_key | categorical |
0 | 35 | word100 |
1 | 35 | word1152 |
2 | 35 | word1175 |
3 | 35 | word1228 |
4 | 35 | word1248 |
... | ... | |
49211 | 1155073 | word75 |
49212 | 1155073 | word759 |
49213 | 1155073 | word789 |
49214 | 1155073 | word815 |
49215 | 1155073 | word979 |
49216 rows x 2 columns
memory usage: 0.39 MB
name: content
type: getml.DataFrame
The goal is to predict seven different labels. We generate a target column for each of those labels. We also have to separate the data set into a training and testing set.
data_full = getml.data.make_target_columns(paper, "class_label")
data_full
name | paper_id | class_label=Case_Based | class_label=Genetic_Algorithms | class_label=Neural_Networks | class_label=Probabilistic_Methods | class_label=Reinforcement_Learning | class_label=Rule_Learning | class_label=Theory |
---|---|---|---|---|---|---|---|---|
role | join_key | target | target | target | target | target | target | target |
0 | 35 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 40 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
2 | 114 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 117 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 128 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2708 rows
type: getml.data.View
split = getml.data.split.random(train=0.7, test=0.3, validation=0.0)
split
0 | train |
---|---|
1 | test |
2 | train |
3 | test |
4 | test |
... |
infinite number of rows
type: StringColumnView
container = getml.data.Container(population=data_full, split=split)
container.add(cites=cites, content=content, paper=paper)
container.freeze()
container
subset | name | rows | type | |
---|---|---|---|---|
0 | test | paper | 821 | View |
1 | train | paper | 1887 | View |
name | rows | type | |
---|---|---|---|
0 | cites | 5429 | DataFrame |
1 | content | 49216 | DataFrame |
2 | paper | 2708 | DataFrame |
2. Predictive modeling¶
We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.
2.1 Define relational model¶
To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.
That is because the class label can be predicting using three different pieces of information:
- The keywords used by the paper
- The keywords used by papers it cites and by papers that cite the paper
- The class label of papers it cites and by papers that cite the paper
The main challenge here is that cites
is used twice, once to connect the cited papers and then to connect the citing papers. To resolve this, we need two placeholders on cites
.
dm = getml.data.DataModel(paper.to_placeholder("population"))
# We need two different placeholders for cites.
dm.add(getml.data.to_placeholder(cites=[cites]*2, content=content, paper=paper))
dm.population.join(
dm.cites[0],
on=('paper_id', 'cited_paper_id')
)
dm.cites[0].join(
dm.content,
on=('citing_paper_id', 'paper_id')
)
dm.cites[0].join(
dm.paper,
on=('citing_paper_id', 'paper_id'),
relationship=getml.data.relationship.many_to_one
)
dm.population.join(
dm.cites[1],
on=('paper_id', 'citing_paper_id')
)
dm.cites[1].join(
dm.content,
on=('cited_paper_id', 'paper_id')
)
dm.cites[1].join(
dm.paper,
on=('cited_paper_id', 'paper_id'),
relationship=getml.data.relationship.many_to_one
)
dm.population.join(
dm.content,
on='paper_id'
)
dm
data frames | staging table | |
---|---|---|
0 | population | POPULATION__STAGING_TABLE_1 |
1 | cites, paper | CITES__STAGING_TABLE_2 |
2 | cites, paper | CITES__STAGING_TABLE_3 |
3 | content | CONTENT__STAGING_TABLE_4 |
2.2 getML pipeline¶
Set-up the feature learner & predictor
We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples
).
mapping = getml.preprocessors.Mapping()
fast_prop = getml.feature_learning.FastProp(
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
num_threads=1
)
relboost = getml.feature_learning.Relboost(
num_features=10,
num_subfeatures=10,
loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
seed=4367,
num_threads=1,
min_num_samples=30
)
predictor = getml.predictors.XGBoostClassifier()
Build the pipeline
pipe1 = getml.pipeline.Pipeline(
tags=['fast_prop'],
data_model=dm,
preprocessors=[mapping],
feature_learners=[fast_prop],
predictors=[predictor]
)
pipe1
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['cites', 'content', 'paper'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop'])
pipe2 = getml.pipeline.Pipeline(
tags=['relboost'],
data_model=dm,
feature_learners=[relboost],
predictors=[predictor]
)
pipe2
Pipeline(data_model='population', feature_learners=['Relboost'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['cites', 'content', 'paper'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.5, tags=['relboost'])
2.3 Model training¶
pipe1.check(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 ⠏ Preprocessing... 0% • 00:00
Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | MIGHT TAKE LONG | The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
2 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
pipe1.fit(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Trying 3780 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
Trained pipeline.
Time taken: 0:00:16.670446.
Pipeline(data_model='population', feature_learners=['FastProp'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['cites', 'content', 'paper'], predictors=['XGBoostClassifier'], preprocessors=['Mapping'], share_selected_features=0.5, tags=['fast_prop', 'container-BRPpU2'])
pipe2.check(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 ⠧ Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90% • 00:01
Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.
type | label | message | |
---|---|---|---|
0 | INFO | MIGHT TAKE LONG | The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only'). |
1 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
2 | INFO | FOREIGN KEYS NOT FOUND | When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys. |
The training process seems a bit intimidating. That is because the relboost algorithms needs to train separate models for each class label. This is due to the nature of the generated features.
pipe2.fit(container.train)
Checking data model...
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
Trained pipeline.
Time taken: 0:00:39.645936.
Pipeline(data_model='population', feature_learners=['Relboost'], feature_selectors=[], include_categorical=False, loss_function='CrossEntropyLoss', peripheral=['cites', 'content', 'paper'], predictors=['XGBoostClassifier'], preprocessors=[], share_selected_features=0.5, tags=['relboost', 'container-BRPpU2'])
2.4 Model evaluation¶
pipe1.score(container.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:05:07 | train | class_label=Case_Based | 0.9979 | 0.9999 | 0.02323 |
1 | 2024-09-12 13:05:07 | train | class_label=Genetic_Algorithms | 1.0 | 1. | 0.004862 |
2 | 2024-09-12 13:05:07 | train | class_label=Neural_Networks | 0.9846 | 0.9983 | 0.065852 |
3 | 2024-09-12 13:05:07 | train | class_label=Probabilistic_Methods | 0.9958 | 0.9998 | 0.027649 |
4 | 2024-09-12 13:05:07 | train | class_label=Reinforcement_Learning | 0.9995 | 1. | 0.008878 |
... | ... | ... | ... | ... | ... | |
9 | 2024-09-12 13:05:48 | test | class_label=Neural_Networks | 0.9513 | 0.9787 | 0.163577 |
10 | 2024-09-12 13:05:48 | test | class_label=Probabilistic_Methods | 0.9744 | 0.9873 | 0.082802 |
11 | 2024-09-12 13:05:48 | test | class_label=Reinforcement_Learning | 0.9805 | 0.9736 | 0.073926 |
12 | 2024-09-12 13:05:48 | test | class_label=Rule_Learning | 0.9842 | 0.9937 | 0.052303 |
13 | 2024-09-12 13:05:48 | test | class_label=Theory | 0.9562 | 0.977 | 0.128617 |
pipe2.score(container.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:05:47 | train | class_label=Case_Based | 1.0 | 1. | 0.009385 |
1 | 2024-09-12 13:05:47 | train | class_label=Genetic_Algorithms | 1.0 | 1. | 0.004222 |
2 | 2024-09-12 13:05:47 | train | class_label=Neural_Networks | 0.991 | 0.9996 | 0.03766 |
3 | 2024-09-12 13:05:47 | train | class_label=Probabilistic_Methods | 0.9989 | 1. | 0.013846 |
4 | 2024-09-12 13:05:47 | train | class_label=Reinforcement_Learning | 1.0 | 1. | 0.004409 |
... | ... | ... | ... | ... | ... | |
9 | 2024-09-12 13:05:51 | test | class_label=Neural_Networks | 0.9391 | 0.9757 | 0.193588 |
10 | 2024-09-12 13:05:51 | test | class_label=Probabilistic_Methods | 0.9769 | 0.9892 | 0.072601 |
11 | 2024-09-12 13:05:51 | test | class_label=Reinforcement_Learning | 0.9769 | 0.9773 | 0.094757 |
12 | 2024-09-12 13:05:51 | test | class_label=Rule_Learning | 0.9842 | 0.9912 | 0.060603 |
13 | 2024-09-12 13:05:51 | test | class_label=Theory | 0.9488 | 0.9745 | 0.142024 |
To make things a bit easier, we just look at our test results.
pipe1.scores.filter(lambda score: score.set_used == "test")
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:05:48 | test | class_label=Case_Based | 0.9708 | 0.9861 | 0.08689 |
1 | 2024-09-12 13:05:48 | test | class_label=Genetic_Algorithms | 0.9842 | 0.9981 | 0.04915 |
2 | 2024-09-12 13:05:48 | test | class_label=Neural_Networks | 0.9513 | 0.9787 | 0.16358 |
3 | 2024-09-12 13:05:48 | test | class_label=Probabilistic_Methods | 0.9744 | 0.9873 | 0.0828 |
4 | 2024-09-12 13:05:48 | test | class_label=Reinforcement_Learning | 0.9805 | 0.9736 | 0.07393 |
5 | 2024-09-12 13:05:48 | test | class_label=Rule_Learning | 0.9842 | 0.9937 | 0.0523 |
6 | 2024-09-12 13:05:48 | test | class_label=Theory | 0.9562 | 0.977 | 0.12862 |
pipe2.scores.filter(lambda score: score.set_used == "test")
date time | set used | target | accuracy | auc | cross entropy | |
---|---|---|---|---|---|---|
0 | 2024-09-12 13:05:51 | test | class_label=Case_Based | 0.9744 | 0.9895 | 0.08319 |
1 | 2024-09-12 13:05:51 | test | class_label=Genetic_Algorithms | 0.9903 | 0.9988 | 0.03866 |
2 | 2024-09-12 13:05:51 | test | class_label=Neural_Networks | 0.9391 | 0.9757 | 0.19359 |
3 | 2024-09-12 13:05:51 | test | class_label=Probabilistic_Methods | 0.9769 | 0.9892 | 0.0726 |
4 | 2024-09-12 13:05:51 | test | class_label=Reinforcement_Learning | 0.9769 | 0.9773 | 0.09476 |
5 | 2024-09-12 13:05:51 | test | class_label=Rule_Learning | 0.9842 | 0.9912 | 0.0606 |
6 | 2024-09-12 13:05:51 | test | class_label=Theory | 0.9488 | 0.9745 | 0.14202 |
We take the average of the AUC values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora).
fastprop_auc = np.mean(pipe1.auc)
relboost_auc = np.mean(pipe2.auc)
print(fastprop_auc)
print(relboost_auc)
0.9849300280909873 0.9851644641057435
The accuracy for multiple targets can be calculated using one of two methods. The first method is to simply take the average of the pair-wise accuracy values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora).
print(np.mean(pipe1.accuracy))
print(np.mean(pipe2.accuracy))
0.9716373760222724 0.9700713415695145
However, the benchmarking papers actually use a different approach:
- They first generate probabilities for each of the labels:
probabilities1 = pipe1.predict(container.test)
probabilities2 = pipe2.predict(container.test)
Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00 Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
- They then find the class label with the highest probability:
class_label = paper.class_label.unique()
ix_max = np.argmax(probabilities1, axis=1)
predicted_labels1 = np.asarray([class_label[ix] for ix in ix_max])
ix_max = np.argmax(probabilities2, axis=1)
predicted_labels2 = np.asarray([class_label[ix] for ix in ix_max])
- They then compare that value to the actual class label:
actual_labels = paper[split == "test"].class_label.to_numpy()
fastprop_accuracy = (actual_labels == predicted_labels1).sum() / len(actual_labels)
relboost_accuracy = (actual_labels == predicted_labels2).sum() / len(actual_labels)
print("Share of accurately predicted class labels (pipe1):")
print(fastprop_accuracy)
print()
print("Share of accurately predicted class labels (pipe2):")
print(relboost_accuracy)
print()
Share of accurately predicted class labels (pipe1): 0.9001218026796589 Share of accurately predicted class labels (pipe2): 0.8964677222898904
Since this is the method the benchmark papers use, this is the accuracy score we will report as well.
2.5 Studying features¶
Feature correlations
We want to analyze how the features are correlated with the target variables.
TARGET_NUM = 0
names, correlations = pipe2.features.correlations(target_num=TARGET_NUM)
plt.subplots(figsize=(20, 10))
plt.bar(names, correlations)
plt.title('Feature correlations with class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Correlations')
plt.xticks(rotation='vertical')
plt.show()
Feature importances
Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.
names, importances = pipe2.features.importances()
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title('Feature importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()
Column importances
Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.
names, importances = pipe2.columns.importances(target_num=TARGET_NUM)
plt.subplots(figsize=(20, 10))
plt.bar(names, importances)
plt.title('Columns importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()
The most important features look as follows:
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_1_51";
CREATE TABLE "FEATURE_1_51" AS
SELECT AVG( t2."t4__class_label__mapping_2_target_3_avg" ) AS "feature_1_51",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "CITES__STAGING_TABLE_3" t2
ON t1."paper_id" = t2."citing_paper_id"
GROUP BY t1.rowid;
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[0].name]
DROP TABLE IF EXISTS "FEATURE_6_1";
CREATE TABLE "FEATURE_6_1" AS
SELECT AVG(
CASE
WHEN ( f_6_2."feature_6_2_2" > 3.142350 ) AND ( f_6_2."feature_6_2_11" > 5.516436 ) AND ( f_6_2."feature_6_2_1" > 15.824859 ) THEN 19.11484833926196
WHEN ( f_6_2."feature_6_2_2" > 3.142350 ) AND ( f_6_2."feature_6_2_11" > 5.516436 ) AND ( f_6_2."feature_6_2_1" <= 15.824859 ) THEN 16.25336464210706
WHEN ( f_6_2."feature_6_2_2" > 3.142350 ) AND ( f_6_2."feature_6_2_11" <= 5.516436 ) AND ( f_6_2."feature_6_2_20" > 0.747603 ) THEN 13.8749941754607
WHEN ( f_6_2."feature_6_2_2" > 3.142350 ) AND ( f_6_2."feature_6_2_11" <= 5.516436 ) AND ( f_6_2."feature_6_2_20" <= 0.747603 ) THEN 8.209072454235654
WHEN ( f_6_2."feature_6_2_2" <= 3.142350 ) AND ( f_6_2."feature_6_2_13" > 0.575234 ) THEN 5.856092769106291
WHEN ( f_6_2."feature_6_2_2" <= 3.142350 ) AND ( f_6_2."feature_6_2_13" <= 0.575234 ) AND ( f_6_2."feature_6_2_4" > 1.058131 ) THEN -2.241272133429655
WHEN ( f_6_2."feature_6_2_2" <= 3.142350 ) AND ( f_6_2."feature_6_2_13" <= 0.575234 ) AND ( f_6_2."feature_6_2_4" <= 1.058131 ) THEN -0.6025668375656026
ELSE NULL
END
) AS "feature_6_1",
t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "CITES__STAGING_TABLE_3" t2
ON t1."paper_id" = t2."citing_paper_id"
LEFT JOIN "FEATURES_6_2" f_6_2
ON t2.rowid = f_6_2."rownum"
GROUP BY t1.rowid;
2.6 Productionization¶
It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3
and spark
modules.
# Creates a folder containing the SQL code.
pipe1.features.to_sql().save("cora_pipeline")
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("cora_spark")
2.7 Benchmarks¶
State-of-the-art approaches on this data set perform as follows:
Approach | Study | Accuracy | AUC |
---|---|---|---|
RelF | Dinh et al (2012) | 85.7% | -- |
LBP | Dinh et al (2012) | 85.0% | -- |
EPRN | Preisach and Thieme (2006) | 84.0% | -- |
PRN | Preisach and Thieme (2006) | 81.0% | -- |
ACORA | Perlich and Provost (2006) | -- | 97.0% |
As we can see, the performance of the relboost algorithm, as used in this notebook, compares favorably to these benchmarks.
pd.DataFrame(data={
'Approach': ['FastProp', 'Relboost'],
'Accuracy': [f'{score:.1%}' for score in [fastprop_accuracy, relboost_accuracy]],
'AUC': [f'{score:,.1%}' for score in [fastprop_auc, relboost_auc]]
})
Approach | Accuracy | AUC | |
---|---|---|---|
0 | FastProp | 90.0% | 98.5% |
1 | Relboost | 89.6% | 98.5% |
getml.engine.shutdown()
3. Conclusion¶
In this notebook we have demonstrated that getML outperforms state-of-the-art relational learning algorithms on the CORA dataset.
References¶
Dinh, Quang-Thang, Christel Vrain, and Matthieu Exbrayat. "A Link-Based Method for Propositionalization." ILP (Late Breaking Papers). 2012.
Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
Perlich, Claudia, and Foster Provost. "Distribution-based aggregation for relational learning with identifier attributes." Machine Learning 62.1-2 (2006): 65-105.
Preisach, Christine, and Lars Schmidt-Thieme. "Relational ensemble classification." Sixth International Conference on Data Mining (ICDM'06). IEEE, 2006.