¶

In the previous notebook we compared the predictive performance of getML's relational feature learning approach with off-the-shelf Graph Neural Network (GNN) implementations. We demonstrated a slight performance gain when using the relational feature learner over the GNN approach. Naturally, now the question arises: is it possible to combine both approaches, pick the best of both worlds and boost predictive power?

The answer is a resounding Yes!

The input to GNN's Neural Network is the one-hot-encoded word matrix of the papers' abstracts. This matrix is very sparse and represents target relevant information inefficently. getML's power lies in aggregating over large amounts of data and distilling relevant target information into few, highly optimized features. That means, both approaches are completely compatible! Features are engineered from word matrix based on the tabular form of the CORA data set. That smaller set of features is given to the GNN to learn the embeddings for every node and predict its class label.

And that is exactly what we will do in this notebook!

Table of Contents

Setup and Download of Data
Data Preparation
Feature Learning with getML
Graph Neural Network with getML Engineered Features
Comparison of Approaches
Conclusion

Setup and Download of Data ¶

With respect to contextual settings, we keep everything as in the former notebook: We use the well entrenched benchmarking data set CORA, which consists of academic papers, their references and the word composition of their respective abstracts. We start out from the tabular format of the data set.

In [32]:

  Copied!     
 
# You might need to restart the kernel after the installs
%pip install -q "getml==1.5.0" "torch-geometric~=2.5" "pandas~=2.2" "matplotlib~=3.9" "seaborn~=0.13" "numpy~=1.26" "torch~=2.4"

# Download and extract getML software
!wget -q https://static.getml.com/download/1.4.0/getml-1.4.0-x64-linux.tar.gz
!tar -xzf getml-1.4.0-x64-linux.tar.gz >/dev/null 2>&1

# Install getML
!./getml-1.4.0-x64-linux/getML install >/dev/null 2>&1
# You might need to restart the kernel after the installs %pip install -q "getml==1.5.0" "torch-geometric~=2.5" "pandas~=2.2" "matplotlib~=3.9" "seaborn~=0.13" "numpy~=1.26" "torch~=2.4" # Download and extract getML software !wget -q https://static.getml.com/download/1.4.0/getml-1.4.0-x64-linux.tar.gz !tar -xzf getml-1.4.0-x64-linux.tar.gz >/dev/null 2>&1 # Install getML !./getml-1.4.0-x64-linux/getML install >/dev/null 2>&1

[notice] A new release of pip is available: 23.1.2 -> 24.2
[notice] To update, run: pip install --upgrade pip
Note: you may need to restart the kernel to use updated packages.

In [65]:

  Copied!     
 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import seaborn as sns

import torch
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv

import getml

getml.engine.launch()
getml.engine.set_project("getml_gnn_cora")
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.manifold import TSNE import seaborn as sns import torch from torch_geometric.data import Data from torch_geometric.nn import GCNConv import getml getml.engine.launch() getml.engine.set_project("getml_gnn_cora")

Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false in /home/user/.getML/getml-1.4.0-x64-linux...
Launched the getML engine. The log output will be stored in /home/user/.getML/logs/20240827150338.log.
Loading pipelines... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          

Connected to project 'getml_gnn_cora'

In [35]:

  Copied!     
 
conn = getml.database.connect_mysql(
    host="db.relational-data.org",
    dbname="CORA",
    port=3306,
    user="guest",
    password="relational",
)
conn = getml.database.connect_mysql( host="db.relational-data.org", dbname="CORA", port=3306, user="guest", password="relational", )

In [36]:

  Copied!     
 
def load_if_needed(name):
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(name=name, table_name=name, conn=conn)
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame
def load_if_needed(name): if not getml.data.exists(name): data_frame = getml.data.DataFrame.from_db(name=name, table_name=name, conn=conn) data_frame.save() else: data_frame = getml.data.load_data_frame(name) return data_frame

In [37]:

  Copied!     
 
paper = load_if_needed("paper")
cites = load_if_needed("cites")
content = load_if_needed("content")
paper = load_if_needed("paper") cites = load_if_needed("cites") content = load_if_needed("content")

Data Preparation ¶

Data preparation is similar to the "ordinary" GNN analysis. We establish the node index and derive from it the edge index. The main difference is, that we do not need to vectorize the word information, because we replace the raw word information with the feature vector, optimized by getML. However, later we use the original GNN implementation for comparison purposes. For this purpose, we vectorize the word information and add it to the dataframe.

In [38]:

  Copied!     
 
paper_df = paper.to_pandas()
cites_df = cites.to_pandas()
content_df = content.to_pandas()
paper_df = paper.to_pandas() cites_df = cites.to_pandas() content_df = content.to_pandas()

In [39]:

  Copied!     
 
def vectorize(word_list):
    word_vector = np.zeros(vocab_size, dtype=int)
    word_vector[word_list] = 1
    return word_vector


vocab_size = content_df["word_cited_id"].nunique() + 1

content_df["paper_id"] = content_df["paper_id"].astype(int)
content_df["word_list"] = content_df["word_cited_id"].apply(lambda x: int(x[4:]) - 1)
vector_content_df = content_df.groupby("paper_id").agg(list)
vector_content_df["word_vector"] = vector_content_df["word_list"].apply(
    lambda x: vectorize(x)
)
def vectorize(word_list): word_vector = np.zeros(vocab_size, dtype=int) word_vector[word_list] = 1 return word_vector vocab_size = content_df["word_cited_id"].nunique() + 1 content_df["paper_id"] = content_df["paper_id"].astype(int) content_df["word_list"] = content_df["word_cited_id"].apply(lambda x: int(x[4:]) - 1) vector_content_df = content_df.groupby("paper_id").agg(list) vector_content_df["word_vector"] = vector_content_df["word_list"].apply( lambda x: vectorize(x) )

In [40]:

  Copied!     
 
paper_df["paper_id"] = paper_df["paper_id"].astype(int)
paper_df["class_label_enc"] = paper_df["class_label"].astype("category").cat.codes
num_classes = paper_df["class_label_enc"].nunique()

_df = paper_df[["class_label_enc", "class_label"]].drop_duplicates()
_df.index = _df.class_label_enc
label_mapping = _df.sort_index().to_dict()["class_label"]
paper_df["paper_id"] = paper_df["paper_id"].astype(int) paper_df["class_label_enc"] = paper_df["class_label"].astype("category").cat.codes num_classes = paper_df["class_label_enc"].nunique() _df = paper_df[["class_label_enc", "class_label"]].drop_duplicates() _df.index = _df.class_label_enc label_mapping = _df.sort_index().to_dict()["class_label"]

In [41]:

  Copied!     
 
index_aligned_df = paper_df.merge(vector_content_df, on="paper_id")
index_aligned_df = index_aligned_df.sample(frac=1, random_state=1).reset_index(
    drop=True
)  # Randomization step
index_aligned_df["node_index"] = index_aligned_df.index
index_aligned_df = paper_df.merge(vector_content_df, on="paper_id") index_aligned_df = index_aligned_df.sample(frac=1, random_state=1).reset_index( drop=True ) # Randomization step index_aligned_df["node_index"] = index_aligned_df.index

Citations across papers constitute the edges of the Network. Now we align them with the index of the nodes.

In [42]:

  Copied!     
 
cites_df = cites_df.astype({"cited_paper_id": int, "citing_paper_id": int})

cites_df = cites_df.merge(
    index_aligned_df[["paper_id", "node_index"]],
    how="left",
    left_on="cited_paper_id",
    right_on="paper_id",
    copy=False,
)
cites_df.rename(columns={"node_index": "cited_node_index"}, inplace=True)
cites_df.drop("paper_id", axis=1, inplace=True)

cites_df = cites_df.merge(
    index_aligned_df[["paper_id", "node_index"]],
    how="left",
    left_on="citing_paper_id",
    right_on="paper_id",
    copy=False,
)
cites_df.rename(columns={"node_index": "citing_node_index"}, inplace=True)
cites_df.drop("paper_id", axis=1, inplace=True)
cites_df
cites_df = cites_df.astype({"cited_paper_id": int, "citing_paper_id": int}) cites_df = cites_df.merge( index_aligned_df[["paper_id", "node_index"]], how="left", left_on="cited_paper_id", right_on="paper_id", copy=False, ) cites_df.rename(columns={"node_index": "cited_node_index"}, inplace=True) cites_df.drop("paper_id", axis=1, inplace=True) cites_df = cites_df.merge( index_aligned_df[["paper_id", "node_index"]], how="left", left_on="citing_paper_id", right_on="paper_id", copy=False, ) cites_df.rename(columns={"node_index": "citing_node_index"}, inplace=True) cites_df.drop("paper_id", axis=1, inplace=True) cites_df

Out[42]:

	cited_paper_id	citing_paper_id	cited_node_index	citing_node_index
0	35	887	1217	1930
1	35	1033	1217	1804
2	35	1688	1217	2112
3	35	1956	1217	2333
4	35	8865	1217	726
...	...	...	...	...
5424	853116	19621	113	1967
5425	853116	853155	113	2577
5426	853118	1140289	2591	683
5427	853155	853118	2577	2591
5428	954315	1155073	520	129

5429 rows × 4 columns

In [43]:

  Copied!     
 
_cited_node = cites_df["cited_node_index"].values
_citing_node = cites_df["citing_node_index"].values

cited_node = np.concatenate((_citing_node, _cited_node))
citing_node = np.concatenate((_cited_node, _citing_node))
_cited_node = cites_df["cited_node_index"].values _citing_node = cites_df["citing_node_index"].values cited_node = np.concatenate((_citing_node, _cited_node)) citing_node = np.concatenate((_cited_node, _citing_node))

In [44]:

  Copied!     
 
train_share = 0.7
n_papers = len(index_aligned_df)
cut_off = int(n_papers * train_share)
train_mask = n_papers * [False]
train_mask[:cut_off] = cut_off * [True]
test_mask = [not e for e in train_mask]
train_share = 0.7 n_papers = len(index_aligned_df) cut_off = int(n_papers * train_share) train_mask = n_papers * [False] train_mask[:cut_off] = cut_off * [True] test_mask = [not e for e in train_mask]

Feature Learning with getML ¶

Now that we have an index aligned feature matrix, label vector (class_label_enc) and an edge index that builds on the aligned node indices, we can break from the established GNN routine: we enter getML territory. The procedure is very similar to the getML implementation of the previous notebook. The difference lies in the fact that we won't apply a predictor, but leave it with feature learners.

There is one more important detail to caution for. Even though we "only" learn the features with the getML routine, we have to guard us from subtle data leakage. We must only use data for feature learning, that is used later on in GNN modeling. This is also the main reason, why the getML feature learning doesn't precede data preparation in its entirety: to ensure the train test split is identical for both processing steps.

In [45]:

  Copied!     
 
data_train = getml.DataFrame.from_pandas(
    index_aligned_df[["paper_id", "class_label"]][train_mask], name="data_train"
)
data_test = getml.DataFrame.from_pandas(
    index_aligned_df[["paper_id", "class_label"]][test_mask], name="data_test"
)

paper, split = getml.data.split.concat("population", train=data_train, test=data_test)
data_train = getml.DataFrame.from_pandas( index_aligned_df[["paper_id", "class_label"]][train_mask], name="data_train" ) data_test = getml.DataFrame.from_pandas( index_aligned_df[["paper_id", "class_label"]][test_mask], name="data_test" ) paper, split = getml.data.split.concat("population", train=data_train, test=data_test)

getML requires that we define roles for each of the columns.

Also, the goal is to predict seven different labels. We generate a target column for each of those labels.

In [46]:

  Copied!     
 
paper.set_role("paper_id", getml.data.roles.join_key)
paper.set_role("class_label", getml.data.roles.categorical)

cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key)

content.set_role("paper_id", getml.data.roles.join_key)
content.set_role("word_cited_id", getml.data.roles.categorical)

data_full = getml.data.make_target_columns(paper, "class_label")
paper.set_role("paper_id", getml.data.roles.join_key) paper.set_role("class_label", getml.data.roles.categorical) cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key) content.set_role("paper_id", getml.data.roles.join_key) content.set_role("word_cited_id", getml.data.roles.categorical) data_full = getml.data.make_target_columns(paper, "class_label")

In [47]:

  Copied!     
 
container = getml.data.Container(population=data_full, split=split)
container.add(cites=cites, content=content, paper=paper)
container.freeze()
container = getml.data.Container(population=data_full, split=split) container.add(cites=cites, content=content, paper=paper) container.freeze()

To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.

That is because the class label can be predicted using three different pieces of information:

The keywords used by the paper
The keywords used by papers it cites and by papers that cite the paper
The class label of papers it cites and by papers that cite the paper

The main challenge here is that cites is used twice, once to connect the cited papers and then to connect the citing papers. To resolve this, we need two placeholders on cites.

In [48]:

  Copied!     
 
dm = getml.data.DataModel(paper.to_placeholder("population"))

# We need two different placeholders for cites.
dm.add(getml.data.to_placeholder(cites=[cites] * 2, content=content, paper=paper))

dm.population.join(dm.cites[0], on=("paper_id", "cited_paper_id"))

dm.cites[0].join(dm.content, on=("citing_paper_id", "paper_id"))

dm.cites[0].join(
    dm.paper,
    on=("citing_paper_id", "paper_id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.population.join(dm.cites[1], on=("paper_id", "citing_paper_id"))

dm.cites[1].join(dm.content, on=("cited_paper_id", "paper_id"))

dm.cites[1].join(
    dm.paper,
    on=("cited_paper_id", "paper_id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.population.join(dm.content, on="paper_id")

dm
dm = getml.data.DataModel(paper.to_placeholder("population")) # We need two different placeholders for cites. dm.add(getml.data.to_placeholder(cites=[cites] * 2, content=content, paper=paper)) dm.population.join(dm.cites[0], on=("paper_id", "cited_paper_id")) dm.cites[0].join(dm.content, on=("citing_paper_id", "paper_id")) dm.cites[0].join( dm.paper, on=("citing_paper_id", "paper_id"), relationship=getml.data.relationship.many_to_one, ) dm.population.join(dm.cites[1], on=("paper_id", "citing_paper_id")) dm.cites[1].join(dm.content, on=("cited_paper_id", "paper_id")) dm.cites[1].join( dm.paper, on=("cited_paper_id", "paper_id"), relationship=getml.data.relationship.many_to_one, ) dm.population.join(dm.content, on="paper_id") dm

Out[48]:

diagram

staging

	data frames	staging table
0	population	POPULATION__STAGING_TABLE_1
1	cites, paper	CITES__STAGING_TABLE_2
2	cites, paper	CITES__STAGING_TABLE_3
3	content	CONTENT__STAGING_TABLE_4

We use the FastProp algorithm for feature learning. Again, no predictor is required, and the pipeline is built without one.

In [49]:

  Copied!     
 
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1,
)
mapping = getml.preprocessors.Mapping() fast_prop = getml.feature_learning.FastProp( loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss, num_threads=1, )

In [50]:

  Copied!     
 
pipe1 = getml.pipeline.Pipeline(
    tags=["fast_prop"],
    data_model=dm,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
)

pipe1
pipe1 = getml.pipeline.Pipeline( tags=["fast_prop"], data_model=dm, preprocessors=[mapping], feature_learners=[fast_prop], ) pipe1

Out[50]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['cites', 'content', 'paper'],
         predictors=[],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop'])

Model training

In [51]:

  Copied!     
 
pipe1.fit(container.train)
pipe1.fit(container.train)

Checking data model...
Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Checking... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          

The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.
To see the issues in full, run .check() on the pipeline.

Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Trying 3780 features... 100% |██████████| [elapsed: 00:07, remaining: 00:00]          

Trained pipeline.
Time taken: 0h:0m:7.226766

Out[51]:

Pipeline(data_model='population',
         feature_learners=['FastProp'],
         feature_selectors=[],
         include_categorical=False,
         loss_function='CrossEntropyLoss',
         peripheral=['cites', 'content', 'paper'],
         predictors=[],
         preprocessors=['Mapping'],
         share_selected_features=0.5,
         tags=['fast_prop', 'container-2Cj7ZD'])

Now comes the crucial trick! Instead of applying the pipeline on our test data to make predictions, we apply the pipeline on our train and test data and transform them to the learned (or optimal) features. It is these optimized features that will replace the word vectors as node attributes in the graph neural network.

In [52]:

  Copied!     
 
trans_feat_train = pipe1.transform(container.train)
trans_feat_test = pipe1.transform(container.test)

optimized_features = np.concatenate((trans_feat_train, trans_feat_test))
trans_feat_train = pipe1.transform(container.train) trans_feat_test = pipe1.transform(container.test) optimized_features = np.concatenate((trans_feat_train, trans_feat_test))

Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Building subfeatures... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Building subfeatures... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Building features... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          

Staging... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Building subfeatures... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Building subfeatures... 100% |██████████| [elapsed: 00:00, remaining: 00:00]          
FastProp: Building features... 100% |██████████| [elapsed: 00:00, remaining: 00:00]

Graph Neural Network with getML Engineered Features ¶

From here on out it is smooth sailing, because everything happens exactly as in the previous notebook. We create the graph object and populate it with the nodes containing the optimized features, the edges and masks.

In [53]:

  Copied!     
 
optimized_x = torch.tensor(optimized_features, dtype=torch.float32)

y = torch.tensor(np.array(index_aligned_df["class_label_enc"].values), dtype=torch.long)
edge_index = torch.tensor([cited_node, citing_node], dtype=torch.int64)

optimized_graph_object = Data(x=optimized_x, edge_index=edge_index, y=y)
optimized_graph_object.train_mask = torch.tensor(train_mask)
optimized_graph_object.test_mask = torch.tensor(test_mask)
optimized_x = torch.tensor(optimized_features, dtype=torch.float32) y = torch.tensor(np.array(index_aligned_df["class_label_enc"].values), dtype=torch.long) edge_index = torch.tensor([cited_node, citing_node], dtype=torch.int64) optimized_graph_object = Data(x=optimized_x, edge_index=edge_index, y=y) optimized_graph_object.train_mask = torch.tensor(train_mask) optimized_graph_object.test_mask = torch.tensor(test_mask)

In [54]:

  Copied!     
 
print(optimized_graph_object)
print("==============================================================")

print(f"Number of nodes: {optimized_graph_object.num_nodes}")
print(f"Number of edges: {int(optimized_graph_object.num_edges/2)}")
print(
    f"Average node degree: {(optimized_graph_object.num_edges) / optimized_graph_object.num_nodes:.2f}"
)
print(f"Number of training nodes: {optimized_graph_object.train_mask.sum()}")
print(f"Number of test nodes: {optimized_graph_object.test_mask.sum()}")
print(
    f"Training node label rate: {int(optimized_graph_object.train_mask.sum()) / optimized_graph_object.num_nodes:.2f}"
)
print(
    f"Test node label rate: {int(optimized_graph_object.test_mask.sum()) / optimized_graph_object.num_nodes:.2f}"
)
print(f"Contains isolated nodes: {optimized_graph_object.has_isolated_nodes()}")
print(f"Contains self-loops: {optimized_graph_object.has_self_loops()}")
print(f"Is undirected: {optimized_graph_object.is_undirected()}")
print(optimized_graph_object) print("==============================================================") print(f"Number of nodes: {optimized_graph_object.num_nodes}") print(f"Number of edges: {int(optimized_graph_object.num_edges/2)}") print( f"Average node degree: {(optimized_graph_object.num_edges) / optimized_graph_object.num_nodes:.2f}" ) print(f"Number of training nodes: {optimized_graph_object.train_mask.sum()}") print(f"Number of test nodes: {optimized_graph_object.test_mask.sum()}") print( f"Training node label rate: {int(optimized_graph_object.train_mask.sum()) / optimized_graph_object.num_nodes:.2f}" ) print( f"Test node label rate: {int(optimized_graph_object.test_mask.sum()) / optimized_graph_object.num_nodes:.2f}" ) print(f"Contains isolated nodes: {optimized_graph_object.has_isolated_nodes()}") print(f"Contains self-loops: {optimized_graph_object.has_self_loops()}") print(f"Is undirected: {optimized_graph_object.is_undirected()}")

Data(x=[2708, 200], edge_index=[2, 10858], y=[2708], train_mask=[2708], test_mask=[2708])
==============================================================
Number of nodes: 2708
Number of edges: 5429
Average node degree: 4.01
Number of training nodes: 1895
Number of test nodes: 813
Training node label rate: 0.70
Test node label rate: 0.30
Contains isolated nodes: False
Contains self-loops: False
Is undirected: True

In [55]:

  Copied!     
 
class GNNTrain:
    def __init__(self, graph_object, nn_model):
        self.graph_object = graph_object
        self.nn_model = nn_model
        self.loss_function = torch.nn.CrossEntropyLoss()
        self.optimizer = torch.optim.Adam(
            nn_model.parameters(), lr=0.01, weight_decay=5e-4
        )

    def train(self):
        self.nn_model.train()
        self.optimizer.zero_grad()
        out = self.nn_model(self.graph_object.x, self.graph_object.edge_index)
        loss = self.loss_function(
            out[self.graph_object.train_mask],
            self.graph_object.y[self.graph_object.train_mask],
        )
        loss.backward()  # compute loss
        self.optimizer.step()  # apply grad
        return loss

    def test(self, mask):
        self.nn_model.eval()
        out = self.nn_model(self.graph_object.x, self.graph_object.edge_index)
        pred = out.argmax(dim=1)
        correct = pred[mask] == self.graph_object.y[mask]
        acc = int(correct.sum()) / int(mask.sum())
        return acc

    def run(self):
        max_test_acc = 0
        test_acc_list = []
        best_model = None
        for epoch in range(1, 1001):
            self.train()
            train_acc = self.test(self.graph_object.train_mask)
            test_acc = self.test(self.graph_object.test_mask)
            test_acc_list.append(test_acc)
            if test_acc > max_test_acc:
                max_test_acc = test_acc
                best_model = self.nn_model
            if epoch % 50 == 0:
                print(
                    f"Epoch: {epoch:03d}, Train acc: {train_acc:.4f}"
                    f", Test acc: {test_acc:.4f}"
                )
        self.test_acc_list = test_acc_list
        self.max_test_acc = max_test_acc
        self.best_model = best_model
class GNNTrain: def __init__(self, graph_object, nn_model): self.graph_object = graph_object self.nn_model = nn_model self.loss_function = torch.nn.CrossEntropyLoss() self.optimizer = torch.optim.Adam( nn_model.parameters(), lr=0.01, weight_decay=5e-4 ) def train(self): self.nn_model.train() self.optimizer.zero_grad() out = self.nn_model(self.graph_object.x, self.graph_object.edge_index) loss = self.loss_function( out[self.graph_object.train_mask], self.graph_object.y[self.graph_object.train_mask], ) loss.backward() # compute loss self.optimizer.step() # apply grad return loss def test(self, mask): self.nn_model.eval() out = self.nn_model(self.graph_object.x, self.graph_object.edge_index) pred = out.argmax(dim=1) correct = pred[mask] == self.graph_object.y[mask] acc = int(correct.sum()) / int(mask.sum()) return acc def run(self): max_test_acc = 0 test_acc_list = [] best_model = None for epoch in range(1, 1001): self.train() train_acc = self.test(self.graph_object.train_mask) test_acc = self.test(self.graph_object.test_mask) test_acc_list.append(test_acc) if test_acc > max_test_acc: max_test_acc = test_acc best_model = self.nn_model if epoch % 50 == 0: print( f"Epoch: {epoch:03d}, Train acc: {train_acc:.4f}" f", Test acc: {test_acc:.4f}" ) self.test_acc_list = test_acc_list self.max_test_acc = max_test_acc self.best_model = best_model

In [56]:

  Copied!     
 
nn_model = GCNConv(
    in_channels=optimized_graph_object.num_features,
    out_channels=len(optimized_graph_object.y.unique()),
)
optimized_gcn = GNNTrain(graph_object=optimized_graph_object, nn_model=nn_model)
optimized_gcn.run()
print("Maximum Test Accuracy: ", optimized_gcn.max_test_acc)
nn_model = GCNConv( in_channels=optimized_graph_object.num_features, out_channels=len(optimized_graph_object.y.unique()), ) optimized_gcn = GNNTrain(graph_object=optimized_graph_object, nn_model=nn_model) optimized_gcn.run() print("Maximum Test Accuracy: ", optimized_gcn.max_test_acc)

Epoch: 050, Train acc: 0.9208, Test acc: 0.9114
Epoch: 100, Train acc: 0.9266, Test acc: 0.9176
Epoch: 150, Train acc: 0.9298, Test acc: 0.9164
Epoch: 200, Train acc: 0.9309, Test acc: 0.9188
Epoch: 250, Train acc: 0.9335, Test acc: 0.9188
Epoch: 300, Train acc: 0.9346, Test acc: 0.9188
Epoch: 350, Train acc: 0.9367, Test acc: 0.9188
Epoch: 400, Train acc: 0.9372, Test acc: 0.9188
Epoch: 450, Train acc: 0.9367, Test acc: 0.9188
Epoch: 500, Train acc: 0.9383, Test acc: 0.9176
Epoch: 550, Train acc: 0.9404, Test acc: 0.9164
Epoch: 600, Train acc: 0.9409, Test acc: 0.9164
Epoch: 650, Train acc: 0.9409, Test acc: 0.9164
Epoch: 700, Train acc: 0.9409, Test acc: 0.9164
Epoch: 750, Train acc: 0.9409, Test acc: 0.9176
Epoch: 800, Train acc: 0.9409, Test acc: 0.9164
Epoch: 850, Train acc: 0.9409, Test acc: 0.9176
Epoch: 900, Train acc: 0.9409, Test acc: 0.9164
Epoch: 950, Train acc: 0.9414, Test acc: 0.9164
Epoch: 1000, Train acc: 0.9414, Test acc: 0.9164
Maximum Test Accuracy:  0.9200492004920049

92% Accuracy! This outcome looks really promising. It compares favourably to the GNN standalone solution (87.6%) and the getML standalone solution (88.3%) of the previous notebook.

Comparison of Approaches ¶

Let's dig a bit deeper and lift our results on more solid grounds. We ran a large scale experiment with 100 random train test splits, the raw results of which can be found below:

In [57]:

  Copied!     
 
getml_accs = [0.89012, 0.87812, 0.87627, 0.89474, 0.87442, 0.8892, 0.89751, 0.88273, 0.88181, 0.88089, 0.88366, 0.87996, 0.89566, 0.87627, 0.89104, 0.87719, 0.88366, 0.88366, 0.87258, 0.8892, 0.89104, 0.8855, 0.87812, 0.8892, 0.88827, 0.88273, 0.87165, 0.86981, 0.88273, 0.88735, 0.8735, 0.87627, 0.87719, 0.8892, 0.88366, 0.87627, 0.88181, 0.86981, 0.87627, 0.89751, 0.8855, 0.87165, 0.87073, 0.88827, 0.88827, 0.88181, 0.88827, 0.88089, 0.87073, 0.8892, 0.89012, 0.87442, 0.88735, 0.87073, 0.89197, 0.89751, 0.88827, 0.89289, 0.88089, 0.8855, 0.88366, 0.88458, 0.88181, 0.87442, 0.88735, 0.86981, 0.87812, 0.86888, 0.86981, 0.88827, 0.87627, 0.88827, 0.89289, 0.88181, 0.87442, 0.88181, 0.8855, 0.88181, 0.87535, 0.87812, 0.86519, 0.89104, 0.8892, 0.88366, 0.87258, 0.87719, 0.87719, 0.88366, 0.87535, 0.87904, 0.88089, 0.88643, 0.88273, 0.89197, 0.89381, 0.89381, 0.87258, 0.89197, 0.86981, 0.8855]

original_accs = [0.88284, 0.87823, 0.86162, 0.87085, 0.87638, 0.87915, 0.88192, 0.86624, 0.88838, 0.86531, 0.88376, 0.87269, 0.88284, 0.87269, 0.87177, 0.86439, 0.8607, 0.87454, 0.8607, 0.87454, 0.87269, 0.88469, 0.85701, 0.89207, 0.88838, 0.87731, 0.86993, 0.87269, 0.87085, 0.87454, 0.86439, 0.89207, 0.87823, 0.87546, 0.87454, 0.86993, 0.87362, 0.869, 0.86347, 0.86347, 0.87454, 0.86531, 0.86624, 0.87454, 0.87454, 0.87269, 0.87546, 0.869, 0.86993, 0.88469, 0.87915, 0.86808, 0.87085, 0.86716, 0.87362, 0.87454, 0.88469, 0.88745, 0.86255, 0.87546, 0.87546, 0.88284, 0.85793, 0.86808, 0.87731, 0.87454, 0.88376, 0.85978, 0.86347, 0.881, 0.86624, 0.88561, 0.88653, 0.89391, 0.87362, 0.869, 0.86993, 0.88376, 0.88376, 0.87362, 0.8524, 0.89483, 0.86808, 0.88284, 0.88192, 0.87546, 0.85886, 0.87269, 0.87269, 0.87915, 0.86255, 0.869, 0.87085, 0.87731, 0.87915, 0.87362, 0.869, 0.881, 0.84963, 0.88376]

optimized_accs = [0.92343, 0.92251, 0.9179, 0.93266, 0.9262, 0.93266, 0.94096, 0.92712, 0.93173, 0.91882, 0.93358, 0.92804, 0.92435, 0.92989, 0.9262, 0.92989, 0.92528, 0.92159, 0.9179, 0.92343, 0.92989, 0.93266, 0.9179, 0.93081, 0.92343, 0.92066, 0.91513, 0.91052, 0.91882, 0.92712, 0.92066, 0.92343, 0.93173, 0.92804, 0.91974, 0.91882, 0.92989, 0.9262, 0.91882, 0.92989, 0.92712, 0.92343, 0.92712, 0.92804, 0.92159, 0.92251, 0.93173, 0.92159, 0.90959, 0.92251, 0.92343, 0.91974, 0.92528, 0.92066, 0.92712, 0.93358, 0.93173, 0.9345, 0.92804, 0.92343, 0.92989, 0.9345, 0.92159, 0.92804, 0.93173, 0.91236, 0.91882, 0.9179, 0.91605, 0.92712, 0.9262, 0.93266, 0.93266, 0.92712, 0.92251, 0.92804, 0.92712, 0.92528, 0.92251, 0.9262, 0.90867, 0.93173, 0.93266, 0.9262, 0.92251, 0.9262, 0.91513, 0.91974, 0.92343, 0.90959, 0.92528, 0.92712, 0.92989, 0.9345, 0.93081, 0.9345, 0.92251, 0.92159, 0.91605, 0.9262]
getml_accs = [0.89012, 0.87812, 0.87627, 0.89474, 0.87442, 0.8892, 0.89751, 0.88273, 0.88181, 0.88089, 0.88366, 0.87996, 0.89566, 0.87627, 0.89104, 0.87719, 0.88366, 0.88366, 0.87258, 0.8892, 0.89104, 0.8855, 0.87812, 0.8892, 0.88827, 0.88273, 0.87165, 0.86981, 0.88273, 0.88735, 0.8735, 0.87627, 0.87719, 0.8892, 0.88366, 0.87627, 0.88181, 0.86981, 0.87627, 0.89751, 0.8855, 0.87165, 0.87073, 0.88827, 0.88827, 0.88181, 0.88827, 0.88089, 0.87073, 0.8892, 0.89012, 0.87442, 0.88735, 0.87073, 0.89197, 0.89751, 0.88827, 0.89289, 0.88089, 0.8855, 0.88366, 0.88458, 0.88181, 0.87442, 0.88735, 0.86981, 0.87812, 0.86888, 0.86981, 0.88827, 0.87627, 0.88827, 0.89289, 0.88181, 0.87442, 0.88181, 0.8855, 0.88181, 0.87535, 0.87812, 0.86519, 0.89104, 0.8892, 0.88366, 0.87258, 0.87719, 0.87719, 0.88366, 0.87535, 0.87904, 0.88089, 0.88643, 0.88273, 0.89197, 0.89381, 0.89381, 0.87258, 0.89197, 0.86981, 0.8855] original_accs = [0.88284, 0.87823, 0.86162, 0.87085, 0.87638, 0.87915, 0.88192, 0.86624, 0.88838, 0.86531, 0.88376, 0.87269, 0.88284, 0.87269, 0.87177, 0.86439, 0.8607, 0.87454, 0.8607, 0.87454, 0.87269, 0.88469, 0.85701, 0.89207, 0.88838, 0.87731, 0.86993, 0.87269, 0.87085, 0.87454, 0.86439, 0.89207, 0.87823, 0.87546, 0.87454, 0.86993, 0.87362, 0.869, 0.86347, 0.86347, 0.87454, 0.86531, 0.86624, 0.87454, 0.87454, 0.87269, 0.87546, 0.869, 0.86993, 0.88469, 0.87915, 0.86808, 0.87085, 0.86716, 0.87362, 0.87454, 0.88469, 0.88745, 0.86255, 0.87546, 0.87546, 0.88284, 0.85793, 0.86808, 0.87731, 0.87454, 0.88376, 0.85978, 0.86347, 0.881, 0.86624, 0.88561, 0.88653, 0.89391, 0.87362, 0.869, 0.86993, 0.88376, 0.88376, 0.87362, 0.8524, 0.89483, 0.86808, 0.88284, 0.88192, 0.87546, 0.85886, 0.87269, 0.87269, 0.87915, 0.86255, 0.869, 0.87085, 0.87731, 0.87915, 0.87362, 0.869, 0.881, 0.84963, 0.88376] optimized_accs = [0.92343, 0.92251, 0.9179, 0.93266, 0.9262, 0.93266, 0.94096, 0.92712, 0.93173, 0.91882, 0.93358, 0.92804, 0.92435, 0.92989, 0.9262, 0.92989, 0.92528, 0.92159, 0.9179, 0.92343, 0.92989, 0.93266, 0.9179, 0.93081, 0.92343, 0.92066, 0.91513, 0.91052, 0.91882, 0.92712, 0.92066, 0.92343, 0.93173, 0.92804, 0.91974, 0.91882, 0.92989, 0.9262, 0.91882, 0.92989, 0.92712, 0.92343, 0.92712, 0.92804, 0.92159, 0.92251, 0.93173, 0.92159, 0.90959, 0.92251, 0.92343, 0.91974, 0.92528, 0.92066, 0.92712, 0.93358, 0.93173, 0.9345, 0.92804, 0.92343, 0.92989, 0.9345, 0.92159, 0.92804, 0.93173, 0.91236, 0.91882, 0.9179, 0.91605, 0.92712, 0.9262, 0.93266, 0.93266, 0.92712, 0.92251, 0.92804, 0.92712, 0.92528, 0.92251, 0.9262, 0.90867, 0.93173, 0.93266, 0.9262, 0.92251, 0.9262, 0.91513, 0.91974, 0.92343, 0.90959, 0.92528, 0.92712, 0.92989, 0.9345, 0.93081, 0.9345, 0.92251, 0.92159, 0.91605, 0.9262]

In [58]:

  Copied!     
 
print("getML standalone mean accuracy: ", round(np.mean(getml_accs), 4))
print("GNN standalone mean accuracy: ", round(np.mean(original_accs), 4))
print("GNN + getML accuracy: ", round(np.mean(optimized_accs), 4))
print()
print("Accuracy gain :", round(np.mean(optimized_accs) - np.mean(original_accs), 4))
print("getML standalone mean accuracy: ", round(np.mean(getml_accs), 4)) print("GNN standalone mean accuracy: ", round(np.mean(original_accs), 4)) print("GNN + getML accuracy: ", round(np.mean(optimized_accs), 4)) print() print("Accuracy gain :", round(np.mean(optimized_accs) - np.mean(original_accs), 4))

getML standalone mean accuracy:  0.8823
GNN standalone mean accuracy:  0.8739
GNN + getML accuracy:  0.925

Accuracy gain : 0.0511

There we have it: By simply adding getML, off-the-shelf GNN implementations gain a whopping 5.1% points in accuracy! The following histogram illustrates the difference.

In [59]:

  Copied!     
 
value_list = []
label_list = []

for label, accs in [
    ("getML standalone", getml_accs),
    ("GNN standalone", original_accs),
    ("GNN + getML", optimized_accs),
]:
    value_list = value_list + accs
    label_list = label_list + len(accs) * [label]
sns_df = pd.DataFrame({"Accuracy": value_list, "Solution": label_list})
sns.displot(sns_df, x="Accuracy", hue="Solution", stat="density", bins=35)
value_list = [] label_list = [] for label, accs in [ ("getML standalone", getml_accs), ("GNN standalone", original_accs), ("GNN + getML", optimized_accs), ]: value_list = value_list + accs label_list = label_list + len(accs) * [label] sns_df = pd.DataFrame({"Accuracy": value_list, "Solution": label_list}) sns.displot(sns_df, x="Accuracy", hue="Solution", stat="density", bins=35)

Out[59]:

<seaborn.axisgrid.FacetGrid at 0x71b904b67e10>

No description has been provided for this image

Though overlapping, we see that the getML standalone implementation performs slightly better than the GNN standalone solution. Much more eyecatching, however, is the massive boost in accuracy, when both approaches are combined and integrated.

When visualizing the learned embeddings of both GNN versions, the contrast between the two becomes also apparent. Let's quickly run a standalone GNN's model to output the node embeddings.

In [60]:

  Copied!     
 
original_x = torch.tensor(
    index_aligned_df.word_vector.values.tolist(), dtype=torch.float32
)

y = torch.tensor(np.array(index_aligned_df["class_label_enc"].values), dtype=torch.long)
edge_index = torch.tensor([cited_node, citing_node], dtype=torch.int64)

original_graph_object = Data(x=original_x, edge_index=edge_index, y=y)
original_graph_object.train_mask = torch.tensor(train_mask)
original_graph_object.test_mask = torch.tensor(test_mask)
original_x = torch.tensor( index_aligned_df.word_vector.values.tolist(), dtype=torch.float32 ) y = torch.tensor(np.array(index_aligned_df["class_label_enc"].values), dtype=torch.long) edge_index = torch.tensor([cited_node, citing_node], dtype=torch.int64) original_graph_object = Data(x=original_x, edge_index=edge_index, y=y) original_graph_object.train_mask = torch.tensor(train_mask) original_graph_object.test_mask = torch.tensor(test_mask)

In [61]:

  Copied!     
 
nn_model = GCNConv(
    in_channels=original_graph_object.num_features,
    out_channels=len(original_graph_object.y.unique()),
)
original_gcn = GNNTrain(graph_object=original_graph_object, nn_model=nn_model)
original_gcn.run()
print("Maximum Test Accuracy: ", original_gcn.max_test_acc)
nn_model = GCNConv( in_channels=original_graph_object.num_features, out_channels=len(original_graph_object.y.unique()), ) original_gcn = GNNTrain(graph_object=original_graph_object, nn_model=nn_model) original_gcn.run() print("Maximum Test Accuracy: ", original_gcn.max_test_acc)

Epoch: 050, Train acc: 0.9435, Test acc: 0.8708
Epoch: 100, Train acc: 0.9573, Test acc: 0.8733
Epoch: 150, Train acc: 0.9641, Test acc: 0.8770
Epoch: 200, Train acc: 0.9641, Test acc: 0.8708
Epoch: 250, Train acc: 0.9652, Test acc: 0.8684
Epoch: 300, Train acc: 0.9668, Test acc: 0.8733
Epoch: 350, Train acc: 0.9662, Test acc: 0.8758
Epoch: 400, Train acc: 0.9673, Test acc: 0.8758
Epoch: 450, Train acc: 0.9668, Test acc: 0.8770
Epoch: 500, Train acc: 0.9673, Test acc: 0.8770
Epoch: 550, Train acc: 0.9673, Test acc: 0.8782
Epoch: 600, Train acc: 0.9678, Test acc: 0.8782
Epoch: 650, Train acc: 0.9678, Test acc: 0.8782
Epoch: 700, Train acc: 0.9678, Test acc: 0.8795
Epoch: 750, Train acc: 0.9678, Test acc: 0.8795
Epoch: 800, Train acc: 0.9678, Test acc: 0.8770
Epoch: 850, Train acc: 0.9673, Test acc: 0.8782
Epoch: 900, Train acc: 0.9673, Test acc: 0.8770
Epoch: 950, Train acc: 0.9673, Test acc: 0.8770
Epoch: 1000, Train acc: 0.9673, Test acc: 0.8770
Maximum Test Accuracy:  0.8794587945879458

In [62]:

  Copied!     
 
def visualize_embeddings(h1, h2, color):
    z1 = TSNE(n_components=2).fit_transform(h1.detach().cpu().numpy())
    z2 = TSNE(n_components=2).fit_transform(h2.detach().cpu().numpy())

    z1 = pd.DataFrame(z1)
    z1["label"] = color
    z2 = pd.DataFrame(z2)
    z2["label"] = color

    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

    sns.scatterplot(z1, x=0, y=1, hue="label", ax=ax1, palette="Set1")
    sns.scatterplot(z2, x=0, y=1, hue="label", ax=ax2, palette="Set1")

    ax1.set_title("GNN with getML features")
    ax2.set_title("GNN only")

    plt.show()
def visualize_embeddings(h1, h2, color): z1 = TSNE(n_components=2).fit_transform(h1.detach().cpu().numpy()) z2 = TSNE(n_components=2).fit_transform(h2.detach().cpu().numpy()) z1 = pd.DataFrame(z1) z1["label"] = color z2 = pd.DataFrame(z2) z2["label"] = color fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10)) sns.scatterplot(z1, x=0, y=1, hue="label", ax=ax1, palette="Set1") sns.scatterplot(z2, x=0, y=1, hue="label", ax=ax2, palette="Set1") ax1.set_title("GNN with getML features") ax2.set_title("GNN only") plt.show()

In [63]:

  Copied!     
 
visualize_embeddings(
    optimized_gcn.best_model(
        optimized_graph_object.x, optimized_graph_object.edge_index
    ),
    original_gcn.best_model(original_graph_object.x, original_graph_object.edge_index),
    color=original_graph_object.y,
)
visualize_embeddings( optimized_gcn.best_model( optimized_graph_object.x, optimized_graph_object.edge_index ), original_gcn.best_model(original_graph_object.x, original_graph_object.edge_index), color=original_graph_object.y, )

In the integrated solution, the learned embeddings are much better fine tuned to distinguish between labels. More coherent and well delineated label patches in the graphic representation above reflect that ability. That is the difference 5 percentage points of accuracy can make!

Conclusion ¶

In this notebook we have gone beyond the vanilla implementations of getML and GNN. Instead, we combined them and harnessed the best of both worlds. And the results, again, speak for themselves. A gain of more than 5% points in accuracy pushes that implementation well ahead of state of the art GNN development, and ranks us #1 at Papers with Code.

In the next notebook we will explore the different neural net layers that pytorch provides and we will investigate how the choice of layers affects the performance when combined with getML feature engineering. So, stay tuned!