article

Introduction to relational data (doppelt)

Author: Johannes King

Characteristics of relational data structures with code examples in SQL, R & python

Relational data

And how to handle it in Python/R/SQL

The company I work at, getML, offers a tool for automated feature engineering on relational business data. When I was onboarding at getML I faced one major issue: I didn't know what relational data was. I had some experience with data analysis in physics but there I always dealt with what I considered "normal data". Back then I created this notebook in order to understand the concepts behind the relational data model.

When talking to former colleague from science I frequently notice that most of them also have no idea what relational data is. So I decided that this notebook could be useful for other as well. It is aimed at people who have some experience in data analysis but have never heard the term "relational data". The relational data model is introduced step by step and each new concept is illustrated with code examples in Python, R and SQL (I also did't know R and SQL when I started at getML).

Code examples

  • Python is an interpreted, high-level, general-purpose programming language. A standard tool to represent data structures in Python is pandas, an open source, BSD-licensed library. A basic introduction to pandas can be found here.
# Python
import pandas as pd
  • R is an interpreted, high-level programming language specialized in statistics, data analysis and data exploration. It has a builtin data structure called data.frame, but in the past few years people were rather using tibbles for they provide some advanced features like and beyond pandas. A thorough guide on relational data in R can be found here.
# R
library(tibble)
  • SQL is a standard language for storing, manipulating and retrieving data in databases. In order to run SQL commands one needs an actual database. This document uses sqlite. We will connect to a database stored in a file
# SQL
sqlite:///relational_data_model.sqlite

Introduction

Let's start with a simple table. A table is an arrangement of data in rows and columns. Imaging you're working in the analytics department of a big department store. In order to keep track of the available products, you might create the following table

prodnr prodname prodtype quantity
41234 banana fruit 400
44321 apple fruit 2000
53153 carrot vegetable 200
78234 hammer tool 10

This table states that there are 400 units of product 41234 that is of type fruit and is called banana, 2000 units of product 44321 that is also of type fruit but is called apple, and so on.

In order to identify a table, it should have an unique name. It's good practice to associate a certain table with a real-world entity. The above table represent the products in the department store. Therefore, let's call it: Product.

The rows of the table contain information about one specific product in the department store. Each row is a unique, ordered list of attributes. The attributes are specified in the colum names of the table, in this case: prodnr, prodname, prodtype and quantity. Each cell of the table stores exactly one attribute, multiple values for an attribute are not allowed in the relational data model.

We can write the table like this

Product(prodnr, prodname, prodtype, quantity)

and one specific row as

Product(41234, banana, fruit, 400)
# Python

# Define data as dict
product_data = {'prodnr' : [41234, 44321, 53153, 78234],
                'prodname' : ['banana', 'apple', 'carrot', 'hammer'],
                'prodtype' : ['fruit','fruit','vegetable','tool'],
                'quantity': [400, 2000, 200, 10]
               }

# Create pandas DataFrame
product_ = pd.DataFrame(data=product_data)
display(product_)
# R 

# Create a tibble
product <- tibble(
    prodnr = c(41234, 44321, 53153, 78234),
    prodname = c('banana', 'apple', 'carrrot', 'hammer'),
    prodtype = c('fruit', 'fruit', 'vegetable', 'tool'),
    quantity = c(400, 2000, 200, 10)
)
print(product)
--"SQL"

--"Delete table if it already exists (allows re-executing of this cell)"
DROP TABLE IF EXISTS Product;

--"Create table with column definitions"
CREATE TABLE Product
(prodnr TEXT,
 prodname TEXT,
 prodtype TEXT, 
 quantity INTEGER);

--"Populate with data"
INSERT INTO Product VALUES(41234, "banana", "fruit", 400);
INSERT INTO Product VALUES(44321, "apple", "fruit", 2000);
INSERT INTO Product VALUES(53153, "carrot", "vegetable", 200);
INSERT INTO Product VALUES(78234, "hammer", "tool", 10);

--"Show all entries"
SELECT * FROM Product

Keys

Keys are an important concept in the relational data model. A key is an attribute (or set of attributes) such that no two rows in a table have the same values for these attributes. In other words, a key is an attribute (or set of attributes) that allows you to uniquly identify each row in the table.

In the Product table, each product is uniquly identified by its prodnr or its prodname. These attributes fullfill the requirements to be a key and are thus called candidate keys. All other attributes (prodtype and quantity) are redundant because two rows can have the same value for these attributes.

Primary key

If there is more than one candidate key in a relation, one must be chosen to be the primary key. The remaining ones are then called alternative keys. Let's choose prodnr to be the primary key of the Product table.

# Python

# Check if 'prodnr' satisfies the requirement for a primary key
print(product_.prodnr.nunique() == len(product_))


# Define 'prodnr' as primary key
# In pandas the primary key is called 'index'
product = product_.set_index("prodnr")
display(product)
# R

# The `dplyr` package will shadow some base classes with the same name, 
# e.g. `filter`, to provide for more advanced features. Using the
# `warn.conflicts` options the corresponding warning messages will be
# suppressed.
library(dplyr, warn.conflicts = FALSE)

# There is no way to define a primary key in an R tibble. 
# We can, however, check if 'prodnr' satisfies the requirements for a
# primary key. We do so by counting the occurrences of each individual
# instance of the primary key and requiring the results to be 1.
all(select(count(product, prodnr), n))
--"SQL"

--"The common way to define primary keys in SQL is on table definiton"
DROP TABLE IF EXISTS Product;

--"Define primary key and unique (alternative) key which both must not be NULL"
CREATE TABLE Product
(prodnr TEXT NOT NULL PRIMARY KEY,
 prodname TEXT NOT NULL UNIQUE,
 prodtype TEXT,
 quantity INTEGER);

INSERT INTO Product VALUES(41234, "banana", "fruit", 400);
INSERT INTO Product VALUES(44321, "apple", "fruit", 2000);
INSERT INTO Product VALUES(53153, "carrot", "vegetable", 200);
INSERT INTO Product VALUES(78234, "hammer", "tool", 10);
SELECT * from Product

Relations

Finally, we arrive at the concept that give the relational data model its name: Relations. In a typical real world data analysis project you will have to deal with more than one table in order to solve a given problem. The department store, for example, might have two more tables containing information about suppliers and purchase orders:

supnr supname supadress supcity
32 Apple Farm 41, Lansing Rd Lansing
72 Tony's Tools 152, Jackson St Chicago
12 Water Delivery 12, Mississippi St Merillville
ponr podate
12 2019-11-10
13 2019-11-14
14 2019-12-24

Each supplier has a supnr and some further information given by supname, supadress and supcity. Each purchase order is defined by a ponr and a podate. For the sake of brevity, the dates are supplied as strings and not real date types specific to a certain language.

The entire database we have created so far looks like this (primary keys are marked with an asterisk).

Product(prodnr*, prodname, prodtype, quantity)
Supplier(supnr*, supname, supadress, supcity)
Purchase_order(ponr*, podate)
# Python 

supplier_data = {'supnr' : [32, 72, 12],
                 'supname' : ["Apple Farm", "Tony's Tools", "Water Delivery"],
                 'supadress' : ["41, Lansing Rd", "142, Jackson St", "12, Missisippi St"],
                 'supcity' : ["Lansing", "Chicago", "Merillville"]
                }

purchase_order_data = {'ponr' : [12, 13, 14],
                       'podate' : ["2019-11-10", "2019-11-14", "2019-12-24"],
                      }

supplier = pd.DataFrame(data=supplier_data).set_index("supnr")
purchase_order = pd.DataFrame(data=purchase_order_data).set_index("ponr")
display(supplier, purchase_order)
# R

supplier <- tibble(
    supnr = c(32, 72, 12),
    supname = c("Apple Farm", "Tony's Tools", "Water Delivery"),
    supadress = c("41, Lansing Rd", "142, Jackson St", "12, Missisippi St"),
    supcity = c("Lansing", "Chicago", "Merillville")
)

purchase.order <- tibble(
    ponr = c(12, 13, 14),
    podate = c("2019-11-10", "2019-11-14", "2019-12-24")
)

print(supplier)
print(purchase.order)
--"SQL"

DROP TABLE IF EXISTS Supplier;

CREATE TABLE Supplier
(supnr TEXT NOT NULL PRIMARY KEY,
 supname TEXT NOT NULL,
 supadress TEXT,
 supcity TEXT
);

INSERT into Supplier VALUES(32, "Apple Farm", "41, Lansing Rd", "Lansing");
INSERT into Supplier VALUES(72, "Tony's Tools", "142, Jackson St", "Chicago");
INSERT into Supplier VALUES(12, "Water Delivery", "12, Mississippi St", "Merillville");


DROP TABLE IF EXISTS Purchase_order;

CREATE TABLE Purchase_order
(ponr INTEGER NOT NULL PRIMARY KEY,
 podate TEXT
);

INSERT into Purchase_order VALUES(12, "2019-11-10");
INSERT into Purchase_order VALUES(13, "2019-11-10");
INSERT into Purchase_order VALUES(14, "2019-11-10");

SELECT * FROM Supplier;

Foreign key

A relation is always defined between a pair of tables. We want for example to associate each purchase order with a supplier and each supplier with one or more products. In order to establish such relations, we need to introduce the concept of a foreign key. Simply put, a foreign key is an attribute of one table that appears as primary key in another table. It can therefore be used to uniquely identify a row in another table.

The simplest example of a relation is a one-to-one relation, that associates each row of a table with exactly one row from another table. In the department store example we could have added a table

Tax_number(taxnr*, supnr_)

that has supnr as a foreign key (marked with a trailing underscore) and thus associates each tax number with exactly one supplier. Obviously, this number could have been added as attribute to the Supplier table. However, one might want to store the taxing information in another table for reasons of data protection or administration. One-to-one relations seldomly appear in real-word datasets and we will discard the Tax_number table in the following.

One-to-many

Things become more interesting when we try to establish a relation between Supplier and Purchase_order. We might consider adding supnr as a foreign key to Purchase_order or adding ponr as a foreign key to Supplier. Having in mind that each attribute can only have one value, it becomes clear that adding ponr to Supplier is not an option because each supplier might be associated with more than one Purchase order. Therefore, we add supnr as foreign key to Purchase_order since each purchase order is associated with exactly one supplier. This is called a one-to-many relation.

ponr podate supnr
12 2019-11-10 32
13 2019-11-14 32
14 2019-12-24 72

The entire database now looks like this

Product(prodnr*, prodname, prodtype, quantity)
Supplier(supnr*, supname, supadress, supcity)
Purchase_order(ponr*, podate, supnr_)

Joins

Joins are a way to combine attributes from two tables that are having a relation. There are different types of joins, but in the scope of this article we will only consider left joins. A left join between two tables on a foreign key replaces all values of this key in the first table with the attributes of the row that this key identifies in the other table. To give a concrete example: Left joining Purchase_order and Supplier on the foreign key supnr results in the following table

ponr podate supname supadress supcity
12 2019-11-10 Apple Farm 41, Lansing Rd Lansing
13 2019-11-14 Apple Farm 41, Lansing Rd Landing
14 2019-12-24 Tony's Tools 142, Jackson St Chicago

The information on a given Supplier has been added to each row in Purchase_order based on the matching supnr attribute.

# Python

# Add foreign key
purchase_order["supnr"] = [32, 32, 72]

# Left join
purchase_order.join(supplier, how='left', on='supnr')
# R

# Add foreign key
purchase.order["supnr"] <- c(32, 32, 72)

# Left join
purchase.order <- left_join(purchase.order, supplier, by='supnr')
--"SQL"

--"Foreign keys are also defined on table definiton"
DROP TABLE IF EXISTS Purchase_order;

CREATE TABLE Purchase_order
(ponr INTEGER NOT NULL PRIMARY KEY,
 podate TEXT,
 supnr TEXT NOT NULL,
 FOREIGN KEY (supnr) REFERENCES Supplier (supnr)
);

INSERT into Purchase_order VALUES(12, "2019-11-10", 32);
INSERT into Purchase_order VALUES(13, "2019-11-14", 32);
INSERT into Purchase_order VALUES(14, "2019-12-24", 72);

--"Left join"
SELECT * from Purchase_order
LEFT JOIN Supplier
ON Purchase_order.supnr = Supplier.supnr;

Many-to-many

A many-to-many relation exists if each row in one table is associated with many rows of a second one and each row of the second is in turn again related to many rows of the first. You can image a many-to-many relation as a one-to-many plus a many-to-one relation.

In our example a many-to-many relation between Supplier and Product exists. Each supplier can supply more than one product, but each product can in turn be supplied by more than one supplier. However, since it is not allowed to have multivalue attribute, we can neither add supnr as foreign key to Product nor can we add prodnr as foreign key to Supplier.

This dilemma is solved by adding a new table Supplies with two foreign keys that together form the primary key of a new relation

Supplies([supnr_, prodnr_]*, purchase_price)
supnr prodnr purchase_price
32 44321 2.99
32 78234 13.5
72 78234 12
12 41234 3.2
12 53153 2.13

Note the new attribute purchase_price that has been added to the relation.

# Python

# Generate primary key tuple
supnr = [32, 32, 72, 12, 12]
prodnr = [44321, 78234, 78234, 41234, 53153]
index_tuples = [(s_, p_) for (s_,p_) in zip(supnr, prodnr)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['supnr', 'prodnr'])

# Add attributes
supplies_data = {'purchase_price' : [2.99, 13.5, 12, 3.2, 2.13]}
supplies = pd.DataFrame(data=supplies_data).set_index(index)

# Left join
supplies.merge(supplier, right_index=True, left_index=True).merge(product, right_index=True, left_index=True)
# R

# Create new tibble
supplies <- tibble(
    supnr = c(32, 32, 72, 12, 12),
    prodnr = c(44321, 78234, 78234, 41234, 53153),
    purchase.price = c(2.99, 13.5, 12, 3.2, 2.13)
)

# Left join
supplies <- left_join(left_join(supplies, supplier, by='supnr'), 
                      product, by='prodnr')
print(supplies)
--"SQL"

DROP TABLE IF EXISTS Supplies;

CREATE TABLE Supplies
(supnr TEXT NOT NULL,
 prodnr TEXT NOT NULL,
 purchase_price REAL,
 PRIMARY KEY (supnr, prodnr),
 FOREIGN KEY (supnr) REFERENCES Supplier(supnr),
 FOREIGN KEY (prodnr) REFERENCES Product (prodnr)
);

INSERT into Supplies VALUES(32, 44321, 2.99);
INSERT into Supplies VALUES(32, 78234, 13.5);
INSERT into Supplies VALUES(72, 78234, 12);
INSERT into Supplies VALUES(12, 41234, 3.2);
INSERT into Supplies VALUES(12, 53153, 2.13);

select * from Supplies
LEFT JOIN Supplier
ON Supplies.supnr = Supplier.supnr
LEFT JOIN Product
ON Supplies.prodnr = Product.prodnr;

Another many-to-many relation in the supply management database of your department store is between Product and Purchase_order because you can ask the question which product(s) were purchased in which purchase_order(s).

PO_line([ponr_, prodnr_]*, quantitiy)
ponr prodnr quantity_purchased
12 44321 200
12 78234 50
13 78234 20
# Python

ponr = [12, 12, 13]
prodnr = [44321, 78234, 78234]
index_tuples = [(s_, p_) for (s_,p_) in zip(ponr, prodnr)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['ponr', 'prodnr'])

po_line_data = {'quantity_purchased' : [200, 50, 20]}
po_line = pd.DataFrame(data=po_line_data).set_index(index)

po_line.merge(purchase_order, right_index=True, left_index=True).merge(product, right_index=True, left_index=True)
# R

po.line <- tibble(
    ponr = c(12, 12, 13),
    prodnr = c(44321, 78234, 78234),
    quantity.purchased = c(200, 50, 20)
)

# Nested joins can be done elegantly using R pipes (%>%)
po.line <- po.line %>% left_join(purchase.order, by='ponr') %>% 
    left_join(product, by='prodnr')
print(po.line)
--"SQL"

DROP TABLE IF EXISTS Po_line;

CREATE TABLE Po_line
(ponr TEXT NOT NULL,
 prodnr TEXT NOT NULL,
 quantity_purchased INTEGER,
 PRIMARY KEY (ponr, prodnr),
 FOREIGN KEY (ponr) REFERENCES Purchase_order (ponr),
 FOREIGN KEY (prodnr) REFERENCES Product (prodnr)
);

INSERT into Po_line VALUES(12, 44321, 200);
INSERT into Po_line VALUES(12, 78234, 50);
INSERT into Po_line VALUES(13, 78234, 20);

select * from Po_line
LEFT JOIN Purchase_order
ON Po_line.ponr = Purchase_order.ponr
LEFT JOIN Product
ON Po_line.prodnr = Product.prodnr;

The complete database now looks like this

Supplier(*supnr, supname, supadress, supcity)
Product(*prodnr, prodname, prodtype, quantity)
Supplies(*[supnr_, prodnr_], purchase_price)
Purchase_order(*ponr, podate)
PO_line(*[ponr_, prodnr_], quantitiy)

The following figure illustrates the tables in the database and their relation to each other. It was created from the SQLite database file created by the commands above using sqleton.

./relational_data_model.svg

What's next

I hope that this notebook helped you getting your head around the relational data model. Check out the blog section on https://get.ml to see examples of data analyses based on the relational data model that solve real world problems using machine learning.

Sources