CORA - Categorizing academic publications¶

In this notebook, we compare getML against extant approaches in the relational learning literature on the CORA data set, which is often used for benchmarking. We demonstrate that getML outperforms the state of the art in the relational learning literature on this data set. Beyond the benchmarking aspects, this notebooks showcases getML's excellent capabilities in dealing with categorical data.

Summary:

Prediction type: Classification model
Domain: Academia
Prediction target: The category of a paper
Population size: 2708

Background¶

CORA is a well-known benchmarking dataset in the academic literature on relational learning. The dataset contains 2708 scientific publications on machine learning. The papers are divided into 7 categories. The challenge is to predict the category of a paper based on the papers it cites, the papers it is cited by and keywords contained in the paper.

It has been downloaded from the CTU Prague relational learning repository (Motl and Schulte, 2015)(Now residing at relational-data.org.).

Analysis¶

Let's get started with the analysis and set up your session:

In [1]:

  Copied!     
 
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"
%pip install -q "getml==1.5.0" "matplotlib==3.9.2" "ipywidgets==8.1.5"

In [2]:

  Copied!     
 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import getml

%matplotlib inline  

print(f"getML API version: {getml.__version__}\n")
import numpy as np import pandas as pd import matplotlib.pyplot as plt import getml %matplotlib inline print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0

In [42]:

  Copied!     
 
getml.engine.launch(allow_remote_ips=True, token='token')
getml.engine.set_project('cora')
getml.engine.launch(allow_remote_ips=True, token='token') getml.engine.set_project('cora')

Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912134108.log.
  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Connected to project 'cora'.

1. Loading data¶

1.1 Download from source¶

We begin by downloading the data from the source file:

In [4]:

  Copied!     
 
conn = getml.database.connect_mysql(
    host="relational.fel.cvut.cz",
    dbname="CORA",
    port=3306,
    user="guest",
    password="ctu-relational"
)

conn
conn = getml.database.connect_mysql( host="relational.fel.cvut.cz", dbname="CORA", port=3306, user="guest", password="ctu-relational" ) conn

Out[4]:

Connection(dbname='CORA', dialect='mysql', host='relational.fel.cvut.cz', port=3306)

In [5]:

  Copied!     
 
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
def load_if_needed(name): """ Loads the data from the relational learning repository, if the data frame has not already been loaded. """ if not getml.data.exists(name): data_frame = getml.data.DataFrame.from_db( name=name, table_name=name, conn=conn ) data_frame.save() else: data_frame = getml.data.load_data_frame(name) return data_frame

In [6]:

  Copied!     
 
paper = load_if_needed("paper")
cites = load_if_needed("cites")
content = load_if_needed("content")
paper = load_if_needed("paper") cites = load_if_needed("cites") content = load_if_needed("content")

In [7]:

  Copied!     
 
paper
paper

Out[7]:

name	paper_id	class_label
role	unused_float	unused_string
0	35	Genetic_Algorithms
1	40	Genetic_Algorithms
2	114	Reinforcement_Learning
3	117	Reinforcement_Learning
4	128	Reinforcement_Learning
	...	...
2703	1154500	Case_Based
2704	1154520	Neural_Networks
2705	1154524	Rule_Learning
2706	1154525	Rule_Learning
2707	1155073	Rule_Learning

2708 rows x 2 columns
memory usage: 0.09 MB
name: paper
type: getml.DataFrame

In [8]:

  Copied!     
 
cites
cites

Out[8]:

name	cited_paper_id	citing_paper_id
role	unused_float	unused_float
0	35	887
1	35	1033
2	35	1688
3	35	1956
4	35	8865
	...	...
5424	853116	19621
5425	853116	853155
5426	853118	1140289
5427	853155	853118
5428	954315	1155073

5429 rows x 2 columns
memory usage: 0.09 MB
name: cites
type: getml.DataFrame

In [9]:

  Copied!     
 
content
content

Out[9]:

name	paper_id	word_cited_id
role	unused_float	unused_string
0	35	word100
1	35	word1152
2	35	word1175
3	35	word1228
4	35	word1248
	...	...
49211	1155073	word75
49212	1155073	word759
49213	1155073	word789
49214	1155073	word815
49215	1155073	word979

49216 rows x 2 columns
memory usage: 1.20 MB
name: content
type: getml.DataFrame

1.2 Prepare data for getML¶

getML requires that we define roles for each of the columns.

In [10]:

  Copied!     
 
paper.set_role("paper_id", getml.data.roles.join_key)
paper.set_role("class_label", getml.data.roles.categorical)
paper
paper.set_role("paper_id", getml.data.roles.join_key) paper.set_role("class_label", getml.data.roles.categorical) paper

Out[10]:

name	paper_id	class_label
role	join_key	categorical
0	35	Genetic_Algorithms
1	40	Genetic_Algorithms
2	114	Reinforcement_Learning
3	117	Reinforcement_Learning
4	128	Reinforcement_Learning
	...	...
2703	1154500	Case_Based
2704	1154520	Neural_Networks
2705	1154524	Rule_Learning
2706	1154525	Rule_Learning
2707	1155073	Rule_Learning

2708 rows x 2 columns
memory usage: 0.02 MB
name: paper
type: getml.DataFrame

In [11]:

  Copied!     
 
cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key)
cites
cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key) cites

Out[11]:

name	cited_paper_id	citing_paper_id
role	join_key	join_key
0	35	887
1	35	1033
2	35	1688
3	35	1956
4	35	8865
	...	...
5424	853116	19621
5425	853116	853155
5426	853118	1140289
5427	853155	853118
5428	954315	1155073

5429 rows x 2 columns
memory usage: 0.04 MB
name: cites
type: getml.DataFrame

We need to separate our data set into a training, testing and validation set:

In [12]:

  Copied!     
 
content.set_role("paper_id", getml.data.roles.join_key)
content.set_role("word_cited_id", getml.data.roles.categorical)
content
content.set_role("paper_id", getml.data.roles.join_key) content.set_role("word_cited_id", getml.data.roles.categorical) content

Out[12]:

name	paper_id	word_cited_id
role	join_key	categorical
0	35	word100
1	35	word1152
2	35	word1175
3	35	word1228
4	35	word1248
	...	...
49211	1155073	word75
49212	1155073	word759
49213	1155073	word789
49214	1155073	word815
49215	1155073	word979

49216 rows x 2 columns
memory usage: 0.39 MB
name: content
type: getml.DataFrame

The goal is to predict seven different labels. We generate a target column for each of those labels. We also have to separate the data set into a training and testing set.

In [13]:

  Copied!     
 
data_full = getml.data.make_target_columns(paper, "class_label")
data_full
data_full = getml.data.make_target_columns(paper, "class_label") data_full

Out[13]:

name	paper_id	class_label=Case_Based	class_label=Genetic_Algorithms	class_label=Neural_Networks	class_label=Probabilistic_Methods	class_label=Reinforcement_Learning	class_label=Rule_Learning	class_label=Theory
role	join_key	target	target	target	target	target	target	target
0	35	0	1	0	0	0	0	0
1	40	0	1	0	0	0	0	0
2	114	0	0	0	0	1	0	0
3	117	0	0	0	0	1	0	0
4	128	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...

2708 rows
type: getml.data.View

In [14]:

  Copied!     
 
split = getml.data.split.random(train=0.7, test=0.3, validation=0.0)
split
split = getml.data.split.random(train=0.7, test=0.3, validation=0.0) split

Out[14]:


0	train
1	test
2	train
3	test
4	test
	...

infinite number of rows
type: StringColumnView

In [15]:

  Copied!     
 
container = getml.data.Container(population=data_full, split=split)
container.add(cites=cites, content=content, paper=paper)
container.freeze()
container
container = getml.data.Container(population=data_full, split=split) container.add(cites=cites, content=content, paper=paper) container.freeze() container

Out[15]:

population

	subset	name	rows	type
0	test	paper	821	View
1	train	paper	1887	View

peripheral

	name	rows	type
0	cites	5429	DataFrame
1	content	49216	DataFrame
2	paper	2708	DataFrame

2. Predictive modeling¶

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

2.1 Define relational model¶

To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.

That is because the class label can be predicting using three different pieces of information:

The keywords used by the paper
The keywords used by papers it cites and by papers that cite the paper
The class label of papers it cites and by papers that cite the paper

The main challenge here is that cites is used twice, once to connect the cited papers and then to connect the citing papers. To resolve this, we need two placeholders on cites.

In [16]:

  Copied!     
 
dm = getml.data.DataModel(paper.to_placeholder("population"))

# We need two different placeholders for cites.
dm.add(getml.data.to_placeholder(cites=[cites]*2, content=content, paper=paper))

dm.population.join(
    dm.cites[0],
    on=('paper_id', 'cited_paper_id')
)

dm.cites[0].join(
    dm.content,
    on=('citing_paper_id', 'paper_id')
)

dm.cites[0].join(
    dm.paper,
    on=('citing_paper_id', 'paper_id'),
    relationship=getml.data.relationship.many_to_one
)

dm.population.join(
    dm.cites[1],
    on=('paper_id', 'citing_paper_id')
)

dm.cites[1].join(
    dm.content,
    on=('cited_paper_id', 'paper_id')
)

dm.cites[1].join(
    dm.paper,
    on=('cited_paper_id', 'paper_id'),
    relationship=getml.data.relationship.many_to_one
)

dm.population.join(
    dm.content,
    on='paper_id'
)

dm
dm = getml.data.DataModel(paper.to_placeholder("population")) # We need two different placeholders for cites. dm.add(getml.data.to_placeholder(cites=[cites]*2, content=content, paper=paper)) dm.population.join( dm.cites[0], on=('paper_id', 'cited_paper_id') ) dm.cites[0].join( dm.content, on=('citing_paper_id', 'paper_id') ) dm.cites[0].join( dm.paper, on=('citing_paper_id', 'paper_id'), relationship=getml.data.relationship.many_to_one ) dm.population.join( dm.cites[1], on=('paper_id', 'citing_paper_id') ) dm.cites[1].join( dm.content, on=('cited_paper_id', 'paper_id') ) dm.cites[1].join( dm.paper, on=('cited_paper_id', 'paper_id'), relationship=getml.data.relationship.many_to_one ) dm.population.join( dm.content, on='paper_id' ) dm

Out[16]:

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	cites, paper	CITES__STAGING_TABLE_2
2	cites, paper	CITES__STAGING_TABLE_3
3	content	CONTENT__STAGING_TABLE_4

2.2 getML pipeline¶

Set-up the feature learner & predictor

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (min_num_samples).

In [17]:

  Copied!     
 
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1
)

relboost = getml.feature_learning.Relboost(
    num_features=10,
    num_subfeatures=10,
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    seed=4367,
    num_threads=1,
    min_num_samples=30
)

predictor = getml.predictors.XGBoostClassifier()
mapping = getml.preprocessors.Mapping() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, num_threads=1 ) relboost = getml.feature_learning.Relboost( num_features=10, num_subfeatures=10, loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, seed=4367, num_threads=1, min_num_samples=30 ) predictor = getml.predictors.XGBoostClassifier()

Build the pipeline

In [18]:

  Copied!     
 
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=dm,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe1
pipe1 = getml.pipeline.Pipeline( tags=['fast_prop'], data_model=dm, preprocessors=[mapping], feature_learners=[fast_prop], predictors=[predictor] ) pipe1

Out[18]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['cites', 'content', 'paper'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop'])

In [19]:

  Copied!     
 
pipe2 = getml.pipeline.Pipeline(
    tags=['relboost'],
    data_model=dm,
    feature_learners=[relboost],
    predictors=[predictor]
)

pipe2
pipe2 = getml.pipeline.Pipeline( tags=['relboost'], data_model=dm, feature_learners=[relboost], predictors=[predictor] ) pipe2

Out[19]:

Pipeline(data_model='population',
         feature_learners=['Relboost'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['cites', 'content', 'paper'],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['relboost'])

2.3 Model training¶

In [20]:

  Copied!     
 
pipe1.check(container.train)
pipe1.check(container.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
⠏ Preprocessing...                                            0% • 00:00

  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.

Out[20]:

	type	label	message
0	INFO	MIGHT TAKE LONG	The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
1	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
2	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.

In [21]:

  Copied!     
 
pipe1.fit(container.train)
pipe1.fit(container.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Trying 3780 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06
  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01

Trained pipeline.

Time taken: 0:00:16.670446.

Out[21]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['cites', 'content', 'paper'],
         predictors=['XGBoostClassifier'],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-BRPpU2'])

In [22]:

  Copied!     
 
pipe2.check(container.train)
pipe2.check(container.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
⠧ Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━      90% • 00:01

  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.

Out[22]:

	type	label	message
0	INFO	MIGHT TAKE LONG	The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
1	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
2	INFO	FOREIGN KEYS NOT FOUND	When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.

The training process seems a bit intimidating. That is because the relboost algorithms needs to train separate models for each class label. This is due to the nature of the generated features.

In [23]:

  Copied!     
 
pipe2.fit(container.train)
pipe2.fit(container.train)

Checking data model...

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.

To see the issues in full, run .check() on the pipeline.

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Trained pipeline.

Time taken: 0:00:39.645936.

Out[23]:

Pipeline(data_model='population',
         feature_learners=['Relboost'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['cites', 'content', 'paper'],
         predictors=['XGBoostClassifier'],
         preprocessors=[],
         share_selected_features=0.5,
         tags=['relboost', 'container-BRPpU2'])

2.4 Model evaluation¶

In [24]:

  Copied!     
 
pipe1.score(container.test)
pipe1.score(container.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[24]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 13:05:07	train	class_label=Case_Based	0.9979	0.9999	0.02323
1	2024-09-12 13:05:07	train	class_label=Genetic_Algorithms	1.0	1.	0.004862
2	2024-09-12 13:05:07	train	class_label=Neural_Networks	0.9846	0.9983	0.065852
3	2024-09-12 13:05:07	train	class_label=Probabilistic_Methods	0.9958	0.9998	0.027649
4	2024-09-12 13:05:07	train	class_label=Reinforcement_Learning	0.9995	1.	0.008878
	...	...	...	...	...	...
9	2024-09-12 13:05:48	test	class_label=Neural_Networks	0.9513	0.9787	0.163577
10	2024-09-12 13:05:48	test	class_label=Probabilistic_Methods	0.9744	0.9873	0.082802
11	2024-09-12 13:05:48	test	class_label=Reinforcement_Learning	0.9805	0.9736	0.073926
12	2024-09-12 13:05:48	test	class_label=Rule_Learning	0.9842	0.9937	0.052303
13	2024-09-12 13:05:48	test	class_label=Theory	0.9562	0.977	0.128617

In [25]:

  Copied!     
 
pipe2.score(container.test)
pipe2.score(container.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

Out[25]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 13:05:47	train	class_label=Case_Based	1.0	1.	0.009385
1	2024-09-12 13:05:47	train	class_label=Genetic_Algorithms	1.0	1.	0.004222
2	2024-09-12 13:05:47	train	class_label=Neural_Networks	0.991	0.9996	0.03766
3	2024-09-12 13:05:47	train	class_label=Probabilistic_Methods	0.9989	1.	0.013846
4	2024-09-12 13:05:47	train	class_label=Reinforcement_Learning	1.0	1.	0.004409
	...	...	...	...	...	...
9	2024-09-12 13:05:51	test	class_label=Neural_Networks	0.9391	0.9757	0.193588
10	2024-09-12 13:05:51	test	class_label=Probabilistic_Methods	0.9769	0.9892	0.072601
11	2024-09-12 13:05:51	test	class_label=Reinforcement_Learning	0.9769	0.9773	0.094757
12	2024-09-12 13:05:51	test	class_label=Rule_Learning	0.9842	0.9912	0.060603
13	2024-09-12 13:05:51	test	class_label=Theory	0.9488	0.9745	0.142024

To make things a bit easier, we just look at our test results.

In [26]:

  Copied!     
 
pipe1.scores.filter(lambda score: score.set_used == "test")
pipe1.scores.filter(lambda score: score.set_used == "test")

Out[26]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 13:05:48	test	class_label=Case_Based	0.9708	0.9861	0.08689
1	2024-09-12 13:05:48	test	class_label=Genetic_Algorithms	0.9842	0.9981	0.04915
2	2024-09-12 13:05:48	test	class_label=Neural_Networks	0.9513	0.9787	0.16358
3	2024-09-12 13:05:48	test	class_label=Probabilistic_Methods	0.9744	0.9873	0.0828
4	2024-09-12 13:05:48	test	class_label=Reinforcement_Learning	0.9805	0.9736	0.07393
5	2024-09-12 13:05:48	test	class_label=Rule_Learning	0.9842	0.9937	0.0523
6	2024-09-12 13:05:48	test	class_label=Theory	0.9562	0.977	0.12862

In [27]:

  Copied!     
 
pipe2.scores.filter(lambda score: score.set_used == "test")
pipe2.scores.filter(lambda score: score.set_used == "test")

Out[27]:

	date time	set used	target	accuracy	auc	cross entropy
0	2024-09-12 13:05:51	test	class_label=Case_Based	0.9744	0.9895	0.08319
1	2024-09-12 13:05:51	test	class_label=Genetic_Algorithms	0.9903	0.9988	0.03866
2	2024-09-12 13:05:51	test	class_label=Neural_Networks	0.9391	0.9757	0.19359
3	2024-09-12 13:05:51	test	class_label=Probabilistic_Methods	0.9769	0.9892	0.0726
4	2024-09-12 13:05:51	test	class_label=Reinforcement_Learning	0.9769	0.9773	0.09476
5	2024-09-12 13:05:51	test	class_label=Rule_Learning	0.9842	0.9912	0.0606
6	2024-09-12 13:05:51	test	class_label=Theory	0.9488	0.9745	0.14202

We take the average of the AUC values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora).

In [28]:

  Copied!     
 
fastprop_auc = np.mean(pipe1.auc)
relboost_auc = np.mean(pipe2.auc)
print(fastprop_auc)
print(relboost_auc)
fastprop_auc = np.mean(pipe1.auc) relboost_auc = np.mean(pipe2.auc) print(fastprop_auc) print(relboost_auc)

0.9849300280909873
0.9851644641057435

The accuracy for multiple targets can be calculated using one of two methods. The first method is to simply take the average of the pair-wise accuracy values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora).

In [29]:

  Copied!     
 
print(np.mean(pipe1.accuracy))
print(np.mean(pipe2.accuracy))
print(np.mean(pipe1.accuracy)) print(np.mean(pipe2.accuracy))

0.9716373760222724
0.9700713415695145

However, the benchmarking papers actually use a different approach:

They first generate probabilities for each of the labels:

In [30]:

  Copied!     
 
probabilities1 = pipe1.predict(container.test)
probabilities2 = pipe2.predict(container.test)
probabilities1 = pipe1.predict(container.test) probabilities2 = pipe2.predict(container.test)

  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
  Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00

They then find the class label with the highest probability:

In [31]:

  Copied!     
 
class_label = paper.class_label.unique()

ix_max = np.argmax(probabilities1, axis=1)
predicted_labels1 = np.asarray([class_label[ix] for ix in ix_max])

ix_max = np.argmax(probabilities2, axis=1)
predicted_labels2 = np.asarray([class_label[ix] for ix in ix_max])
class_label = paper.class_label.unique() ix_max = np.argmax(probabilities1, axis=1) predicted_labels1 = np.asarray([class_label[ix] for ix in ix_max]) ix_max = np.argmax(probabilities2, axis=1) predicted_labels2 = np.asarray([class_label[ix] for ix in ix_max])

They then compare that value to the actual class label:

In [32]:

  Copied!     
 
actual_labels = paper[split == "test"].class_label.to_numpy()
fastprop_accuracy = (actual_labels == predicted_labels1).sum() / len(actual_labels)
relboost_accuracy = (actual_labels == predicted_labels2).sum() / len(actual_labels)

print("Share of accurately predicted class labels (pipe1):")
print(fastprop_accuracy)
print()
print("Share of accurately predicted class labels (pipe2):")
print(relboost_accuracy)
print()
actual_labels = paper[split == "test"].class_label.to_numpy() fastprop_accuracy = (actual_labels == predicted_labels1).sum() / len(actual_labels) relboost_accuracy = (actual_labels == predicted_labels2).sum() / len(actual_labels) print("Share of accurately predicted class labels (pipe1):") print(fastprop_accuracy) print() print("Share of accurately predicted class labels (pipe2):") print(relboost_accuracy) print()

Share of accurately predicted class labels (pipe1):
0.9001218026796589

Share of accurately predicted class labels (pipe2):
0.8964677222898904

Since this is the method the benchmark papers use, this is the accuracy score we will report as well.

2.5 Studying features¶

Feature correlations

We want to analyze how the features are correlated with the target variables.

In [33]:

  Copied!     
 
TARGET_NUM = 0
TARGET_NUM = 0

In [34]:

  Copied!     
 
names, correlations = pipe2.features.correlations(target_num=TARGET_NUM)

plt.subplots(figsize=(20, 10))

plt.bar(names, correlations)

plt.title('Feature correlations with class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Correlations')
plt.xticks(rotation='vertical')
plt.show()
names, correlations = pipe2.features.correlations(target_num=TARGET_NUM) plt.subplots(figsize=(20, 10)) plt.bar(names, correlations) plt.title('Feature correlations with class label ' + class_label[TARGET_NUM]) plt.xlabel('Features') plt.ylabel('Correlations') plt.xticks(rotation='vertical') plt.show()

No description has been provided for this image

Feature importances

Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%.

In [35]:

  Copied!     
 
names, importances = pipe2.features.importances()

plt.subplots(figsize=(20, 10))

plt.bar(names, importances)

plt.title('Feature importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Features')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()
names, importances = pipe2.features.importances() plt.subplots(figsize=(20, 10)) plt.bar(names, importances) plt.title('Feature importances for class label ' + class_label[TARGET_NUM]) plt.xlabel('Features') plt.ylabel('Importances') plt.xticks(rotation='vertical') plt.show()

Column importances

Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well.

In [36]:

  Copied!     
 
names, importances = pipe2.columns.importances(target_num=TARGET_NUM)

plt.subplots(figsize=(20, 10))

plt.bar(names, importances)

plt.title('Columns importances for class label ' + class_label[TARGET_NUM])
plt.xlabel('Columns')
plt.ylabel('Importances')
plt.xticks(rotation='vertical')
plt.show()
names, importances = pipe2.columns.importances(target_num=TARGET_NUM) plt.subplots(figsize=(20, 10)) plt.bar(names, importances) plt.title('Columns importances for class label ' + class_label[TARGET_NUM]) plt.xlabel('Columns') plt.ylabel('Importances') plt.xticks(rotation='vertical') plt.show()

The most important features look as follows:

In [37]:

  Copied!     
 
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]
pipe1.features.to_sql()[pipe1.features.sort(by="importances")[0].name]

Out[37]:

DROP TABLE IF EXISTS "FEATURE_1_51";

CREATE TABLE "FEATURE_1_51" AS
SELECT AVG( t2."t4__class_label__mapping_2_target_3_avg" ) AS "feature_1_51",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "CITES__STAGING_TABLE_3" t2
ON t1."paper_id" = t2."citing_paper_id"
GROUP BY t1.rowid;

In [38]:

  Copied!     
 
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[0].name]
pipe2.features.to_sql()[pipe2.features.sort(by="importances")[0].name]

Out[38]:

DROP TABLE IF EXISTS "FEATURE_6_1";

CREATE TABLE "FEATURE_6_1" AS
SELECT AVG( 
    CASE
        WHEN ( f_6_2."feature_6_2_2" > 3.142350 ) AND ( f_6_2."feature_6_2_11" > 5.516436 ) AND ( f_6_2."feature_6_2_1" > 15.824859 ) THEN 19.11484833926196
        WHEN ( f_6_2."feature_6_2_2" > 3.142350 ) AND ( f_6_2."feature_6_2_11" > 5.516436 ) AND ( f_6_2."feature_6_2_1" <= 15.824859 ) THEN 16.25336464210706
        WHEN ( f_6_2."feature_6_2_2" > 3.142350 ) AND ( f_6_2."feature_6_2_11" <= 5.516436 ) AND ( f_6_2."feature_6_2_20" > 0.747603 ) THEN 13.8749941754607
        WHEN ( f_6_2."feature_6_2_2" > 3.142350 ) AND ( f_6_2."feature_6_2_11" <= 5.516436 ) AND ( f_6_2."feature_6_2_20" <= 0.747603 ) THEN 8.209072454235654
        WHEN ( f_6_2."feature_6_2_2" <= 3.142350 ) AND ( f_6_2."feature_6_2_13" > 0.575234 ) THEN 5.856092769106291
        WHEN ( f_6_2."feature_6_2_2" <= 3.142350 ) AND ( f_6_2."feature_6_2_13" <= 0.575234 ) AND ( f_6_2."feature_6_2_4" > 1.058131 ) THEN -2.241272133429655
        WHEN ( f_6_2."feature_6_2_2" <= 3.142350 ) AND ( f_6_2."feature_6_2_13" <= 0.575234 ) AND ( f_6_2."feature_6_2_4" <= 1.058131 ) THEN -0.6025668375656026
        ELSE NULL
    END
) AS "feature_6_1",
       t1.rowid AS rownum
FROM "POPULATION__STAGING_TABLE_1" t1
INNER JOIN "CITES__STAGING_TABLE_3" t2
ON t1."paper_id" = t2."citing_paper_id"
LEFT JOIN "FEATURES_6_2" f_6_2
ON t2.rowid = f_6_2."rownum"
GROUP BY t1.rowid;

2.6 Productionization¶

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's sqlite3 and spark modules.

In [ ]:

  Copied!     
 
# Creates a folder containing the SQL code.
pipe1.features.to_sql().save("cora_pipeline")
# Creates a folder containing the SQL code. pipe1.features.to_sql().save("cora_pipeline")

In [ ]:

  Copied!     
 
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("cora_spark")
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("cora_spark")

2.7 Benchmarks¶

State-of-the-art approaches on this data set perform as follows:

Approach	Study	Accuracy	AUC
RelF	Dinh et al (2012)	85.7%	--
LBP	Dinh et al (2012)	85.0%	--
EPRN	Preisach and Thieme (2006)	84.0%	--
PRN	Preisach and Thieme (2006)	81.0%	--
ACORA	Perlich and Provost (2006)	--	97.0%

As we can see, the performance of the relboost algorithm, as used in this notebook, compares favorably to these benchmarks.

In [41]:

  Copied!     
 
pd.DataFrame(data={
    'Approach': ['FastProp', 'Relboost'],
    'Accuracy': [f'{score:.1%}' for score in [fastprop_accuracy, relboost_accuracy]],
    'AUC': [f'{score:,.1%}' for score in [fastprop_auc, relboost_auc]]
})
pd.DataFrame(data={ 'Approach': ['FastProp', 'Relboost'], 'Accuracy': [f'{score:.1%}' for score in [fastprop_accuracy, relboost_accuracy]], 'AUC': [f'{score:,.1%}' for score in [fastprop_auc, relboost_auc]] })

Out[41]:

	Approach	Accuracy	AUC
0	FastProp	90.0%	98.5%
1	Relboost	89.6%	98.5%

In [ ]:

  Copied!     
 
getml.engine.shutdown()
getml.engine.shutdown()

3. Conclusion¶

In this notebook we have demonstrated that getML outperforms state-of-the-art relational learning algorithms on the CORA dataset.

References¶

Dinh, Quang-Thang, Christel Vrain, and Matthieu Exbrayat. "A Link-Based Method for Propositionalization." ILP (Late Breaking Papers). 2012.

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).

Perlich, Claudia, and Foster Provost. "Distribution-based aggregation for relational learning with identifier attributes." Machine Learning 62.1-2 (2006): 65-105.

Preisach, Christine, and Lars Schmidt-Thieme. "Relational ensemble classification." Sixth International Conference on Data Mining (ICDM'06). IEEE, 2006.