Relational Databases
A large portion of applications use relational databases to store information as they allow us to work with large amounts of data efficiently. The general idea is that you can create tables and relations between values in those tables.
Raw SQL
Structured Query Language (SQL) comes in many dialects, since we’re using PostgreSQL as our relational database we’ll be using the PostgreSQL dialect.
Lets explore databases by directly communicating with them.
We can do this using a command line tool.
For example, we can use the tool pgcli.
Grab the premade sandbox
In your own time you can figure out how to install postgresql and pgcli.
For the purposes of this tutorial we have provided a premade docker setup that provide an out of the box sandbox.
To Do
- Clone madetech/relational-database-workshop
- Follow the README for pgcli. We will get to the Koans later.

Create a database
The sandbox environment comes out of the box with a database premade, called workshop_one.
You can also create your own database
CREATE DATABASE my_fancy_database;
By default you will be connected to workshop_one database.
You can USE your new database by doing the following
USE my_fancy_database;
This lets us execute queries against postgres without specifying the database in each query, which saves a lot of time.
Create a table
In postgresql we can show all tables in our database using the following command
\dt
Now that we have a database we need to tell postgresql what our schema is.
Without a schema, we aren’t able to store any data.
Let’s define a simple table using the following query.
CREATE TABLE fruit (
id SERIAL PRIMARY KEY,
name VARCHAR NOT NULL
);
- What does
PRIMARY KEYmean? - What does
VARCHARmean? - What does
NOT NULLmean? - Why do you think we name the table singular
fruitinstead offruits?
You can see this table by rerunning \dt. You should see something like the following:
+------+-----+-----+--------+
|Schema|Name |Type |Owner |
|------+-----+-----+--------|
|public|fruit|table|workshop|
+------+-----+-----+--------+
Insert
We don’t have any data in our table yet because we just created it.
Lets go ahead and add a row:
INSERT INTO fruit ( name )
VALUES (
'Orange'
);
If you get an error about a missing column Orange, make sure you are using the correct quotes!
- Why do you think we don’t need to specify a value for id?
Select
We can then query this new row with a SELECT statement
SELECT * FROM fruit;
To Do
- Add a few more fruit using insert and run the
SELECTagain.
Joins
Using a basic SELECT query we can fetch data from a single table.
If we want to include related data (relational), we can use a JOIN with our SELECT.
Assuming we still have some data in our fruit table from earlier…
Let’s go ahead and create another table…
CREATE TABLE harvest (
id SERIAL PRIMARY KEY,
fruit_id INT NOT NULL REFERENCES fruit(id),
date DATE NOT NULL,
yield INT NOT NULL
);
and add a row…
INSERT INTO harvest ( fruit_id, date, yield )
VALUES (
1, '2018-01-01', 102919
);
- What do you think that
REFERENCES fruit(id)does?
Running another simple SELECT query should yield our new harvest.
SELECT * FROM harvest;
Unfortunately, fruit_id isn’t very useful information - it would be more useful to have the name of the fruit in the result set.
One solution could be to do another query and manually join the tables together, but this isn’t very efficient.
Instead, in postgres we can perform an INNER JOIN as part of a SELECT query (more on different join types soon) with the following sql
Idiomatic PostgreSQL Syntax
SELECT * FROM harvest, fruit
WHERE fruit.id = harvest.fruit_id;
Alternative Syntax
SELECT * FROM harvest
INNER JOIN fruit ON fruit.id = harvest.fruit_id;
- What do you think of the two syntax approaches?
Notice we use the * to which grabs all columns.
It is good practice to specify which columns you want to query.
The main reason for this is that it’s not possible to distinguish between columns with the same name (e.g. harvest.id and fruit.id) in the result set.
Another reason to limit columns is performance related (memory and cpu).
SELECT harvest.date, harvest.yield, fruit.name
FROM harvest, fruit
WHERE fruit.id = harvest.fruit_id;
It is also possible to alias columns in the result set
SELECT
harvest.date as harvest_date,
harvest.yield as harvest_yield,
fruit.name as fruit_name
FROM harvest, fruit
WHERE fruit.id = harvest.fruit_id;
Inner Join
Go ahead and add five more fruit
INSERT INTO fruit ( name )
VALUES ('Kiwi'),
('Mango'),
('Pineapple'),
('Guava'),
('Tomato');
Let’s rerun our INNER JOIN from earlier…
SELECT
harvest.date as harvest_date,
harvest.yield as harvest_yield,
fruit.name as fruit_name
FROM harvest, fruit
WHERE fruit.id = harvest.fruit_id;
- Do you notice that we only get results where there is both a fruit and a harvest.
Even if we make the “left hand” table fruit and use the alternative syntax the result is the same
SELECT
harvest.date as harvest_date,
harvest.yield as harvest_yield,
fruit.name as fruit_name
FROM fruit
INNER JOIN harvest ON fruit.id = harvest.fruit_id;
This is because an INNER JOIN is the most restrictive type of join.
Left Join
Lets say we wanted to get all fruit and their harvests (if they have one):
SELECT
harvest.date as harvest_date,
harvest.yield as harvest_yield,
fruit.name as fruit_name
FROM fruit
LEFT JOIN harvest ON fruit.id = harvest.fruit_id;
- What happened, and why did that happen?
What happens if you swap fruit and harvest tables?
SELECT
harvest.date as harvest_date,
harvest.yield as harvest_yield,
fruit.name as fruit_name
FROM harvest
LEFT JOIN fruit ON fruit.id = harvest.fruit_id;
- What does that tell us about
LEFTand what leftness means?
Right Join
SELECT
harvest.date as harvest_date,
harvest.yield as harvest_yield,
fruit.name as fruit_name
FROM harvest
RIGHT JOIN fruit ON fruit.id = harvest.fruit_id;
Compare the results you get to the LEFT JOIN queries…
- What do you think
RIGHTand rightness means?
Full Outer Join
Try out this query:
SELECT
harvest.date as harvest_date,
harvest.yield as harvest_yield,
fruit.name as fruit_name
FROM harvest
FULL OUTER JOIN fruit ON fruit.id = harvest.fruit_id;
And this query
SELECT
harvest.date as harvest_date,
harvest.yield as harvest_yield,
fruit.name as fruit_name
FROM fruit
FULL OUTER JOIN harvest ON fruit.id = harvest.fruit_id;
- What do you think
FULL OUTER JOINdoes?
Altering tables
Suppose we want to adjust the schema to add an extra field discontinued_on
One solution could be to DROP TABLE fruit; and recreate, but this would delete the data contained within.
Instead we can use an ALTER TABLE query.
ALTER TABLE fruit
ADD COLUMN discontinued_on DATE NULL
If we then do a simple SELECT
SELECT * FROM fruit;
We should see the new column.
You can read all the options for ALTER TABLE at postgresql.org/docs.
Indexes
If we use WHERE clauses to query subsets of information, we can greatly reduce the time taken to perform these queries by using appropriate indexes.
It is good practice to ensure there are indexes on any columns use in WHERE or JOIN .. ON ... clauses.
- By default
PRIMARY KEYcolumns - It does not create an index for any foreign key
REFERENCES
We can list the indexes using this command
\di
We can add an index for fruit_id with this query
CREATE INDEX ON harvest (fruit_id);
Using Explain Analyse
If our tables contain very little data, postgres will automatically ignore any index and perform a basic “sequential scan” because that is quicker on small datasets.
We need to force postgres to use indexes. So lets first create lots of data (you’ll need to make a dataset with roughly 1500 or more rows).
INSERT INTO fruit ( name )
VALUES ('Kiwi'),
('Mango'),
('Pineapple'),
('Guava'),
('Tomato'),
('Kiwi'),
('Mango'),
('Pineapple'),
('Guava'),
('Tomato'),
... copy this many times ...
('Kiwi'),
('Mango'),
('Pineapple'),
('Guava'),
('Tomato');
Hint: You can find out the number of rows that you have created with this aggregate query
SELECT count(*) FROM fruit;
We can find out how postgres is going to “plan” to perform a query by prepending EXPLAIN ANALYSE to a query.
For example consider this query which uses our unindexed column discounted_on.
EXPLAIN ANALYSE SELECT * FROM fruit
WHERE discontinued_on > '2020-01-01';
You will see postgres will plan with Seq Scan on fruit.
If you add an index to this column
CREATE INDEX ON fruit (discontinued_on);
Now rerun the same EXPLAIN ANALYSE, you will see a plan with Index Scan using fruit_discontinued_on_idx on fruit which is much more efficient.
Since postgres is so fast, even at 1500 rows it’s unlikely you will see any performance difference.
Adding Indexes to Large Tables
Adding indexes to large existing tables, for example to fix a performance issue in production caused by a missing index can be incredibly problematic.
- By default
ADD INDEXwill lock a table for writes.
For large tables, adding an index could take several hours. This is problematic as this is effectively downtime from the perspective of your customers.
Always add indexes as early as possible as it’s easier to add indexes to small datasets.
Postgres does provide a way to CREATE INDEX CONCURRENTLY which doesn’t lock the table. There are a few caveats to this approach.
- Postgres may silently fail to create the index.
- These broken indexes still cause a penalty to writes
- Creating an index concurrently is more expensive and may slow down queries (since it uses CPU/Memory).
You can read about this in more depth on the postgres docs.
Ruby + Sequel Koans
We will be using a gem called sequel to provide a nice API to interact with PostgreSQL.
Follow the section in the madetech/relational-database-workshop repository readme to start working on the Koans.
Useful Documentation Links
You can get to Sequel documentation here.
These specific pages might be useful to you.