Characteristics of relational data structures with code examples in SQL, R & python
The company I work at, getML, offers a tool for automated feature engineering on relational business data. When I was onboarding at getML I faced one major issue: I didn't know what relational data was. I had some experience with data analysis in physics but there I always dealt with what I considered "normal data". Back then I created this notebook in order to understand the concepts behind the relational data model.
When talking to former colleague from science I frequently notice that most of them also have no idea what relational data is. So I decided that this notebook could be useful for other as well. It is aimed at people who have some experience in data analysis but have never heard the term "relational data". The relational data model is introduced step by step and each new concept is illustrated with code examples in Python, R and SQL (I also did't know R and SQL when I started at getML).
# Python
import pandas as pd
# R
library(tibble)
# SQL
sqlite:///relational_data_model.sqlite
Let's start with a simple table. A table is an arrangement of data in rows and columns. Imaging you're working in the analytics department of a big department store. In order to keep track of the available products, you might create the following table
prodnr | prodname | prodtype | quantity |
---|---|---|---|
41234 | banana | fruit | 400 |
44321 | apple | fruit | 2000 |
53153 | carrot | vegetable | 200 |
78234 | hammer | tool | 10 |
This table states that there are 400
units of product 41234
that is of type fruit
and is called banana
, 2000
units of product 44321
that is also of type fruit
but is called apple
, and so on.
In order to identify a table, it should have an unique name. It's good practice to associate a certain table with a real-world entity. The above table represent the products in the department store. Therefore, let's call it: Product
.
The rows of the table contain information about one specific product in the department store. Each row is a unique, ordered list of attributes. The attributes are specified in the colum names of the table, in this case: prodnr
, prodname
, prodtype
and quantity
. Each cell of the table stores exactly one attribute, multiple values for an attribute are not allowed in the relational data model.
We can write the table like this
Product(prodnr, prodname, prodtype, quantity)
and one specific row as
Product(41234, banana, fruit, 400)
# Python
# Define data as dict
product_data = {'prodnr' : [41234, 44321, 53153, 78234],
'prodname' : ['banana', 'apple', 'carrot', 'hammer'],
'prodtype' : ['fruit','fruit','vegetable','tool'],
'quantity': [400, 2000, 200, 10]
}
# Create pandas DataFrame
product_ = pd.DataFrame(data=product_data)
display(product_)
# R
# Create a tibble
product <- tibble(
prodnr = c(41234, 44321, 53153, 78234),
prodname = c('banana', 'apple', 'carrrot', 'hammer'),
prodtype = c('fruit', 'fruit', 'vegetable', 'tool'),
quantity = c(400, 2000, 200, 10)
)
print(product)
--"SQL"
--"Delete table if it already exists (allows re-executing of this cell)"
DROP TABLE IF EXISTS Product;
--"Create table with column definitions"
CREATE TABLE Product
(prodnr TEXT,
prodname TEXT,
prodtype TEXT,
quantity INTEGER);
--"Populate with data"
INSERT INTO Product VALUES(41234, "banana", "fruit", 400);
INSERT INTO Product VALUES(44321, "apple", "fruit", 2000);
INSERT INTO Product VALUES(53153, "carrot", "vegetable", 200);
INSERT INTO Product VALUES(78234, "hammer", "tool", 10);
--"Show all entries"
SELECT * FROM Product
Keys are an important concept in the relational data model. A key is an attribute (or set of attributes) such that no two rows in a table have the same values for these attributes. In other words, a key is an attribute (or set of attributes) that allows you to uniquly identify each row in the table.
In the Product
table, each product is uniquly identified by its prodnr
or its prodname
. These attributes fullfill the requirements to be a key and are thus called candidate keys. All other attributes (prodtype
and quantity
) are redundant because two rows can have the same value for these attributes.
If there is more than one candidate key in a relation, one must be chosen to be the primary key. The remaining ones are then called alternative keys. Let's choose prodnr
to be the primary key of the Product
table.
# Python
# Check if 'prodnr' satisfies the requirement for a primary key
print(product_.prodnr.nunique() == len(product_))
# Define 'prodnr' as primary key
# In pandas the primary key is called 'index'
product = product_.set_index("prodnr")
display(product)
# R
# The `dplyr` package will shadow some base classes with the same name,
# e.g. `filter`, to provide for more advanced features. Using the
# `warn.conflicts` options the corresponding warning messages will be
# suppressed.
library(dplyr, warn.conflicts = FALSE)
# There is no way to define a primary key in an R tibble.
# We can, however, check if 'prodnr' satisfies the requirements for a
# primary key. We do so by counting the occurrences of each individual
# instance of the primary key and requiring the results to be 1.
all(select(count(product, prodnr), n))
--"SQL"
--"The common way to define primary keys in SQL is on table definiton"
DROP TABLE IF EXISTS Product;
--"Define primary key and unique (alternative) key which both must not be NULL"
CREATE TABLE Product
(prodnr TEXT NOT NULL PRIMARY KEY,
prodname TEXT NOT NULL UNIQUE,
prodtype TEXT,
quantity INTEGER);
INSERT INTO Product VALUES(41234, "banana", "fruit", 400);
INSERT INTO Product VALUES(44321, "apple", "fruit", 2000);
INSERT INTO Product VALUES(53153, "carrot", "vegetable", 200);
INSERT INTO Product VALUES(78234, "hammer", "tool", 10);
SELECT * from Product
Finally, we arrive at the concept that give the relational data model its name: Relations. In a typical real world data analysis project you will have to deal with more than one table in order to solve a given problem. The department store, for example, might have two more tables containing information about suppliers and purchase orders:
supnr | supname | supadress | supcity |
---|---|---|---|
32 | Apple Farm | 41, Lansing Rd | Lansing |
72 | Tony's Tools | 152, Jackson St | Chicago |
12 | Water Delivery | 12, Mississippi St | Merillville |
ponr | podate |
---|---|
12 | 2019-11-10 |
13 | 2019-11-14 |
14 | 2019-12-24 |
Each supplier has a supnr
and some further information given by supname
, supadress
and supcity
. Each purchase order is defined by a ponr
and a podate
. For the sake of brevity, the dates are supplied as strings and not real date types specific to a certain language.
The entire database we have created so far looks like this (primary keys are marked with an asterisk).
Product(prodnr*, prodname, prodtype, quantity)
Supplier(supnr*, supname, supadress, supcity)
Purchase_order(ponr*, podate)
# Python
supplier_data = {'supnr' : [32, 72, 12],
'supname' : ["Apple Farm", "Tony's Tools", "Water Delivery"],
'supadress' : ["41, Lansing Rd", "142, Jackson St", "12, Missisippi St"],
'supcity' : ["Lansing", "Chicago", "Merillville"]
}
purchase_order_data = {'ponr' : [12, 13, 14],
'podate' : ["2019-11-10", "2019-11-14", "2019-12-24"],
}
supplier = pd.DataFrame(data=supplier_data).set_index("supnr")
purchase_order = pd.DataFrame(data=purchase_order_data).set_index("ponr")
display(supplier, purchase_order)
# R
supplier <- tibble(
supnr = c(32, 72, 12),
supname = c("Apple Farm", "Tony's Tools", "Water Delivery"),
supadress = c("41, Lansing Rd", "142, Jackson St", "12, Missisippi St"),
supcity = c("Lansing", "Chicago", "Merillville")
)
purchase.order <- tibble(
ponr = c(12, 13, 14),
podate = c("2019-11-10", "2019-11-14", "2019-12-24")
)
print(supplier)
print(purchase.order)
--"SQL"
DROP TABLE IF EXISTS Supplier;
CREATE TABLE Supplier
(supnr TEXT NOT NULL PRIMARY KEY,
supname TEXT NOT NULL,
supadress TEXT,
supcity TEXT
);
INSERT into Supplier VALUES(32, "Apple Farm", "41, Lansing Rd", "Lansing");
INSERT into Supplier VALUES(72, "Tony's Tools", "142, Jackson St", "Chicago");
INSERT into Supplier VALUES(12, "Water Delivery", "12, Mississippi St", "Merillville");
DROP TABLE IF EXISTS Purchase_order;
CREATE TABLE Purchase_order
(ponr INTEGER NOT NULL PRIMARY KEY,
podate TEXT
);
INSERT into Purchase_order VALUES(12, "2019-11-10");
INSERT into Purchase_order VALUES(13, "2019-11-10");
INSERT into Purchase_order VALUES(14, "2019-11-10");
SELECT * FROM Supplier;
A relation is always defined between a pair of tables. We want for example to associate each purchase order with a supplier and each supplier with one or more products. In order to establish such relations, we need to introduce the concept of a foreign key. Simply put, a foreign key is an attribute of one table that appears as primary key in another table. It can therefore be used to uniquely identify a row in another table.
The simplest example of a relation is a one-to-one relation, that associates each row of a table with exactly one row from another table. In the department store example we could have added a table
Tax_number(taxnr*, supnr_)
that has supnr
as a foreign key (marked with a trailing underscore) and thus associates each tax number with exactly one supplier. Obviously, this number could have been added as attribute to the Supplier
table. However, one might want to store the taxing information in another table for reasons of data protection or administration. One-to-one relations seldomly appear in real-word datasets and we will discard the Tax_number
table in the following.
Things become more interesting when we try to establish a relation between Supplier
and Purchase_order
. We might consider adding supnr
as a foreign key to Purchase_order
or adding ponr
as a foreign key to Supplier
. Having in mind that each attribute can only have one value, it becomes clear that adding ponr
to Supplier
is not an option because each supplier might be associated with more than one Purchase order. Therefore, we add supnr
as foreign key to Purchase_order
since each purchase order is associated with exactly one supplier. This is called a one-to-many relation.
ponr | podate | supnr |
---|---|---|
12 | 2019-11-10 | 32 |
13 | 2019-11-14 | 32 |
14 | 2019-12-24 | 72 |
The entire database now looks like this
Product(prodnr*, prodname, prodtype, quantity)
Supplier(supnr*, supname, supadress, supcity)
Purchase_order(ponr*, podate, supnr_)
Joins are a way to combine attributes from two tables that are having a relation. There are different types of joins, but in the scope of this article we will only consider left joins. A left join between two tables on a foreign key replaces all values of this key in the first table with the attributes of the row that this key identifies in the other table. To give a concrete example: Left joining Purchase_order
and Supplier
on the foreign key supnr
results in the following table
ponr | podate | supname | supadress | supcity |
---|---|---|---|---|
12 | 2019-11-10 | Apple Farm | 41, Lansing Rd | Lansing |
13 | 2019-11-14 | Apple Farm | 41, Lansing Rd | Landing |
14 | 2019-12-24 | Tony's Tools | 142, Jackson St | Chicago |
The information on a given Supplier
has been added to each row in Purchase_order
based on the matching supnr
attribute.
# Python
# Add foreign key
purchase_order["supnr"] = [32, 32, 72]
# Left join
purchase_order.join(supplier, how='left', on='supnr')
# R
# Add foreign key
purchase.order["supnr"] <- c(32, 32, 72)
# Left join
purchase.order <- left_join(purchase.order, supplier, by='supnr')
--"SQL"
--"Foreign keys are also defined on table definiton"
DROP TABLE IF EXISTS Purchase_order;
CREATE TABLE Purchase_order
(ponr INTEGER NOT NULL PRIMARY KEY,
podate TEXT,
supnr TEXT NOT NULL,
FOREIGN KEY (supnr) REFERENCES Supplier (supnr)
);
INSERT into Purchase_order VALUES(12, "2019-11-10", 32);
INSERT into Purchase_order VALUES(13, "2019-11-14", 32);
INSERT into Purchase_order VALUES(14, "2019-12-24", 72);
--"Left join"
SELECT * from Purchase_order
LEFT JOIN Supplier
ON Purchase_order.supnr = Supplier.supnr;
A many-to-many relation exists if each row in one table is associated with many rows of a second one and each row of the second is in turn again related to many rows of the first. You can image a many-to-many relation as a one-to-many plus a many-to-one relation.
In our example a many-to-many relation between Supplier
and Product
exists. Each supplier can supply more than one product, but each product can in turn be supplied by more than one supplier. However, since it is not allowed to have multivalue attribute, we can neither add supnr
as foreign key to Product
nor can we add prodnr
as foreign key to Supplier
.
This dilemma is solved by adding a new table Supplies
with two foreign keys that together form the primary key of a new relation
Supplies([supnr_, prodnr_]*, purchase_price)
supnr | prodnr | purchase_price |
---|---|---|
32 | 44321 | 2.99 |
32 | 78234 | 13.5 |
72 | 78234 | 12 |
12 | 41234 | 3.2 |
12 | 53153 | 2.13 |
Note the new attribute purchase_price
that has been added to the relation.
# Python
# Generate primary key tuple
supnr = [32, 32, 72, 12, 12]
prodnr = [44321, 78234, 78234, 41234, 53153]
index_tuples = [(s_, p_) for (s_,p_) in zip(supnr, prodnr)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['supnr', 'prodnr'])
# Add attributes
supplies_data = {'purchase_price' : [2.99, 13.5, 12, 3.2, 2.13]}
supplies = pd.DataFrame(data=supplies_data).set_index(index)
# Left join
supplies.merge(supplier, right_index=True, left_index=True).merge(product, right_index=True, left_index=True)
# R
# Create new tibble
supplies <- tibble(
supnr = c(32, 32, 72, 12, 12),
prodnr = c(44321, 78234, 78234, 41234, 53153),
purchase.price = c(2.99, 13.5, 12, 3.2, 2.13)
)
# Left join
supplies <- left_join(left_join(supplies, supplier, by='supnr'),
product, by='prodnr')
print(supplies)
--"SQL"
DROP TABLE IF EXISTS Supplies;
CREATE TABLE Supplies
(supnr TEXT NOT NULL,
prodnr TEXT NOT NULL,
purchase_price REAL,
PRIMARY KEY (supnr, prodnr),
FOREIGN KEY (supnr) REFERENCES Supplier(supnr),
FOREIGN KEY (prodnr) REFERENCES Product (prodnr)
);
INSERT into Supplies VALUES(32, 44321, 2.99);
INSERT into Supplies VALUES(32, 78234, 13.5);
INSERT into Supplies VALUES(72, 78234, 12);
INSERT into Supplies VALUES(12, 41234, 3.2);
INSERT into Supplies VALUES(12, 53153, 2.13);
select * from Supplies
LEFT JOIN Supplier
ON Supplies.supnr = Supplier.supnr
LEFT JOIN Product
ON Supplies.prodnr = Product.prodnr;
Another many-to-many relation in the supply management database of your department store is between Product
and Purchase_order
because you can ask the question which product(s) were purchased in which purchase_order(s).
PO_line([ponr_, prodnr_]*, quantitiy)
ponr | prodnr | quantity_purchased |
---|---|---|
12 | 44321 | 200 |
12 | 78234 | 50 |
13 | 78234 | 20 |
# Python
ponr = [12, 12, 13]
prodnr = [44321, 78234, 78234]
index_tuples = [(s_, p_) for (s_,p_) in zip(ponr, prodnr)]
index = pd.MultiIndex.from_tuples(index_tuples, names=['ponr', 'prodnr'])
po_line_data = {'quantity_purchased' : [200, 50, 20]}
po_line = pd.DataFrame(data=po_line_data).set_index(index)
po_line.merge(purchase_order, right_index=True, left_index=True).merge(product, right_index=True, left_index=True)
# R
po.line <- tibble(
ponr = c(12, 12, 13),
prodnr = c(44321, 78234, 78234),
quantity.purchased = c(200, 50, 20)
)
# Nested joins can be done elegantly using R pipes (%>%)
po.line <- po.line %>% left_join(purchase.order, by='ponr') %>%
left_join(product, by='prodnr')
print(po.line)
--"SQL"
DROP TABLE IF EXISTS Po_line;
CREATE TABLE Po_line
(ponr TEXT NOT NULL,
prodnr TEXT NOT NULL,
quantity_purchased INTEGER,
PRIMARY KEY (ponr, prodnr),
FOREIGN KEY (ponr) REFERENCES Purchase_order (ponr),
FOREIGN KEY (prodnr) REFERENCES Product (prodnr)
);
INSERT into Po_line VALUES(12, 44321, 200);
INSERT into Po_line VALUES(12, 78234, 50);
INSERT into Po_line VALUES(13, 78234, 20);
select * from Po_line
LEFT JOIN Purchase_order
ON Po_line.ponr = Purchase_order.ponr
LEFT JOIN Product
ON Po_line.prodnr = Product.prodnr;
The complete database now looks like this
Supplier(*supnr, supname, supadress, supcity)
Product(*prodnr, prodname, prodtype, quantity)
Supplies(*[supnr_, prodnr_], purchase_price)
Purchase_order(*ponr, podate)
PO_line(*[ponr_, prodnr_], quantitiy)
The following figure illustrates the tables in the database and their relation to each other. It was created from the SQLite database file created by the commands above using sqleton.
./relational_data_model.svg
I hope that this notebook helped you getting your head around the relational data model. Check out the blog section on https://get.ml to see examples of data analyses based on the relational data model that solve real world problems using machine learning.