SQL: we all pretend to be experts at it, and mostly get away with it thanks to StackOverflow. Paired with our vast experience of learning how to code in the 90s, our field work of PHPMyAdmin and LAMP stacks basically makes us experts. Go ahead and chalk up a win for your resume.

SQL has been around longer than our careers have, so why start a series on it now? Surely there’s sufficient enough documentation that we can Google the specifics whenever the time comes for us to write a query? That, my friends, is precisely the problem. Regardless of what tools we have at our disposable, some skills are better learned and practiced by heart. SQL is one of those skills.

Sure, SQLAlchemy or similar ORMs might protect us here-and-there from writing raw queries. Considering SQL is just one of many query languages we'll use regularly (in addition to NoSQL, GraphQL, JQL, etc.), is becoming a SQL expert really that critical? In short, yes: relational databases are not only here to stay, but thinking in queries as a second language solidifies one's understanding of the fine details of data. Marc Laforet recently published a Medium post which drives home just how important leaning on SQL is:

What’s even more interesting is that when these transformation scripts were applied to the 6.5 GB dataset, python completely failed. Out of 3 attempts, python crashed 2 times and my computer completely froze the 3rd time… while SQL took 226 seconds.

Keeping logic out of our apps and pipelines and in SQL results in exponentially faster execution, while also being more readable and universally understood than whatever we’d write in our language of choice. The lower down we can push application logic in our stack, the better. This is why I’d much prefer to see the datasphere saturated with SQL tutorials as opposed to Pandas tutorials.

Relational Database Terminology

I hate it when informational material kicks off with covering obvious terminology definitions. Under normal circumstances, I find this to be cliche, unhelpful, and damaging to an author's credibility; but these aren't normal circumstances. In SQL, vocabulary commonly has multiple meanings depending on context, or even which flavor database you're using. Given this fact, it's entirely possible (and common) for individuals to rack up experience with relational databases while completely misinterpreting fundamental concepts. Let's make sure that doesn't happen:

  • Databases: Every Database instance is separated at the highest level into databases. Yes, a database is a collection of databases - we're already off to a great start.
  • Schemas: In PostgreSQL (and other databases), a schema is a grouping of tables and other objects, including views, relations, etc. A schema is a way of organizing data. Schemas imply that all the data belonging to it is at some form related, even if only by concept. Note that the term schema is sometimes used to describe other concepts depending on the context.
  • Tables: The meat and potatos of relational databases. Tables consist of rows and columns which hold our sweet, sweet data. Columns are best thought of as 'attributes', whereas rows are entries which consist of values for said attributes. All values in a column must share the same data type.
    • Keys: Keys are used to help us organize and optimize data, as well as place certain constraints on data coming in (for example, email addresses of user accounts must be unique). Keys can also help us keep count of our entries, ensure automatically unique values, and provide a bridge to link multiple tables of data.
      • Primary keys: Identification tags for each row of data. The primary key is different for every record in the relational database; values must be provided, and they must be unique between rows.
      • Foreign keys: Enable data searches and manipulation between the primary database table and other related databases.
  • Objects: A blanket term for anything (including relations) that exist in a schema (somewhat PostgreSQL-specific).
    • Views (PostgreSQL): Views display data in a fashion similar to tables, with the difference that views do not store data. Views are a snapshot of data pulled from other tables in the form of a query; a good way to think about views is to consider them to be 'virtual tables.'
    • Functions (PostgreSQL): Logic for interacting with data saved for the purpose of being reused.
In MySQL, a schema is synonymous with a database. These keywords can even be swapped to use SCHEMA and DATABASE interchangably in MySQL. Thus, using CREATE SCHEMA acheives the same effect as instead of CREATE DATABASE.

We've got to start somewhere, so it might as well be with database management. Admittedly, this will be the most useless of the things we'll cover. The act of navigating databases is best suited for a GUI.

Show Databases

If you access your database via command line shell (for some reason), the first logical thing to do is to list the available databases:

SHOW DATABASES;
 
+--------------------+
| Database           |
+--------------------+
| classicmodels      |
| information_schema |
| mysql              |
| performance_schema |
| sys                |
+--------------------+
5 rows in set (0.00 sec)

USE Database

Now that we've listed the possible databases we can connect to, we can explore what each of these contains. To do this, we have to specify which database we want to connect to, AKA "use."

db> USE database_name;
Database changed

Create Database

Creating databases is straightforward. Be sure to pay attention to the character set when creating a database: this will determine which types of characters your database will be able to accept. For example, if we try to insert special encoded characters into a simple UTF-8 database, those characters won’t turn out as we’d expect.

CREATE DATABASE IF NOT EXISTS database_name
CHARACTER SET utf-8
[COLLATE collation_name]

Bonus: here's the shorthand for creating a database and then showing the result:

SHOW CREATE DATABASE database_name;

Creating and Modifying Tables

Creating tables via SQL syntax can be critical when automating data imports. When creating a table, we also set the column names, types, and keys:

CREATE TABLE [IF NOT EXISTS] table_name (
   column_name_1 [COLUMN_DATA_TYPE] [KEY_TYPE] [KEY_ATTRIBUTES] DEFAULT [DEFAULT_VALUE],
   column_name_2 [COLUMN_DATA_TYPE] [KEY_TYPE] [KEY_ATTRIBUTES] DEFAULT [DEFAULT_VALUE],
   PRIMARY KEY (column_name_1)
) ENGINE=[ENGINE_TYPE];

We can specify IF NOT EXISTS when creating our table if we'd like to include validation in our query. When present, the table will only be created if a table of the specified name does not exist.

When creating each of our columns, there are a number of things we can specify per-column:

  • Data Type (required): The data which can be saved to cells of this column (such as INTEGER, TEXT, etc).
  • Key Type: Creates a key for the column.
  • Key Attributes: Any key-related attributes, such as auto-incrementing.
  • Default: If rows are created in the table without values passed to the current column, the value specified as DEFAULT
  • Primary Key: Allows any of the previous specified columns to be set as the table's primary key.

MySQL tables can have a 'storage engine' specified via ENGINE=[engine_type], which determines the core logic of how the table will interpret data. Leaving this blank defaults to InnoDB and is almost certainly fine to be left alone. In case you're interested, you can find more about MySQL engines here.

Here's an example of what an actual CREATE TABLE query would look like:

CREATE TABLE IF NOT EXISTS awards (
   id INTEGER PRIMARY KEY AUTO_INCREMENT,
   recipient TEXT NOT NULL,
   award_name TEXT DEFAULT 'Grammy',
   PRIMARY KEY (id)
) ENGINE=INNODB;

Managing Keys for Existing Tables

If we don't specify our keys at table creation time, we can always do so after the fact. SQL tables can accept the following key types:

  • Primary Key: One or more fields/columns that uniquely identify a record in the table. It can not accept null, duplicate values.
  • Candidate Key: Candidate keys are kind of like groups of non-committed Primary Keys; these keys only accept unique values, and could potentially be used in the place of a Primary Key if need be, but are not actual Primary Keys. Unlike Primary Keys, multiple Candidate Keys may exist per table.
  • Alternate Key: Refers to a single Candidate Key (an alternative which can satisfy the duty of a Primary Key id need be).
  • Composite/Compound Key: Defined by combing the values of multiple columns; the sum of which will always produce a unique value. There can be multiple Candidate Keys in one table. Each Candidate Key can work as Primary Key.
  • Unique Key: A set of one or more fields/columns of a table that uniquely identify a record in a database table. Similar to Primary key, but it can accept only one null value, and it can not have duplicate values.
  • Foreign Key: Foreign keys denote fields that serve as another table's Primary key. Foreign keys are useful for building relationships between tables. While a foreign key is required in the parent table where they are primary, foreign keys can be null or empty in the tables intended to relate to the other table.

Let's look at an example query where we add a key to a table and dissect the pieces:

ALTER TABLE table_name
ADD FOREIGN KEY foreign_key_name (column_name)
REFERENCES parent_table(columns)
ON DELETE { NO ACTION | CASCADE | SET NULL | SET DEFAULT }
ON UPDATE { NO ACTION | CASCADE | SET NULL | SET DEFAULT }

ALTER TABLE is used to make any changes to a table's structure, whether that be modifying columns or keys.

In this example, we ADD a key which happens to be a FOREIGN KEY. While keys always refer to columns, keys themselves must have names of their own to distinguish the column's data and a key's conceptual logic. We name our key foreign_key_name and specify which column the key will act on with (column_name). Because this is a foreign key, we need to specify which table's primary key we want this to be associated with. REFERENCES parent_table(primary_key_column) is stating that the foreign key in this table corresponds to values held in a column named primary_key_column, in a table named parent_table.

The statements ON DELETE and ON UPDATE are actions which take place if the parent table's primary key is deleted or updated, respectively. ON DELETE CASCADE would result in our tables foreign key being deleted if the corresponding primary key were to disappear.

Adding Columns

Adding columns follows the same syntax we used when creating tables. An interesting additional feature is the ability to place the new column before or after preexisting columns:

ALTER TABLE table_name
ADD COLUMN column_name [DATA_TYPE] [FIRST|AFTER existing_column];
When referencing tables in PostgreSQL databases, we must specify the schema belongs to. Thus, ALTER TABLE table_name becomes ALTER TABLE schema_name.table_name. This applies to any time we reference tables, including when we create and delete tables.

Pop Quiz

The below statement uses elements of everything we've learned about modifying and creating table structures thus far. Can you discern what is happening here?

CREATE TABLE vendors(
    vdr_id int not null auto_increment primary key,
    vdr_name varchar(255)
)ENGINE=InnoDB;
 
ALTER TABLE products 
ADD COLUMN vdr_id int not null AFTER cat_id;

ALTER TABLE products
ADD FOREIGN KEY fk_vendor(vdr_id)
REFERENCES vendors(vdr_id)
ON DELETE NO ACTION
ON UPDATE CASCADE;

Dropping Data

DANGER ZONE: this is where we can start to mess things up. Dropping columns or tables results in a complete loss of data: whenever you see the word "drop," be scared.

If you're sure you know what you're doing and would like to remove a table column, this can be done as such:

ALTER TABLE table
DROP column;

Dropping a table destroys the table structure as well as all data within it:

DROP TABLE table_name;

Truncating a table, on the other hand, will purge the table of data but retain the table itself:

TRUNCATE TABLE table_name;

Drop Foreign Key

Like tables and columns, we can drop keys as well:

ALTER TABLE table_name 
DROP FOREIGN KEY constraint_name;

This can also be handed by dropping CONSTRAINT:

ALTER TABLE public.jira_epiccolors
DROP CONSTRAINT jira_epiccolors_pkey;

Working with Views (Specific to PostgreSQL)

Lastly, let's explore the act of creating views. There are three types of views PostgreSQL can handle:

  • Simple Views: Virtual tables which represent data of underlying tables. Simple views are automatically updatable: the system will allow INSERT, UPDATE and DELETE statements to be used on the view in the same way as on a regular table.
  • Materialized Views: PostgreSQL extends the view concept to a next level that allows views to store data 'physically', and we call those views are materialized views. A materialized view caches the result of a complex query and then allow you to refresh the result periodically.
  • Recursive Views: Recursive views are a bit difficult to explain without delving deep into the complicated (but cool!) functionality of recursive reporting. I won't get into the details, but these views are able to represent relationships which go multiple layers deep. Here's a quick taste, if you;re curious:

Sample RECURSIVE query:

WITH RECURSIVE reporting_line AS (
 SELECT
 employee_id,
 full_name AS subordinates
 FROM
 employees
 WHERE
 manager_id IS NULL
 UNION ALL
 SELECT
 e.employee_id,
 (
 rl.subordinates || ' > ' || e.full_name
 ) AS subordinates
 FROM
 employees e
 INNER JOIN reporting_line rl ON e.manager_id = rl.employee_id
) SELECT
 employee_id,
 subordinates
FROM
 reporting_line
ORDER BY
 employee_id;

Output:

 employee_id |                         subordinates
-------------+--------------------------------------------------------------
           1 | Michael North
           2 | Michael North > Megan Berry
           3 | Michael North > Sarah Berry
           4 | Michael North > Zoe Black
           5 | Michael North > Tim James
           6 | Michael North > Megan Berry > Bella Tucker
           7 | Michael North > Megan Berry > Ryan Metcalfe
           8 | Michael North > Megan Berry > Max Mills
           9 | Michael North > Megan Berry > Benjamin Glover
          10 | Michael North > Sarah Berry > Carolyn Henderson
          11 | Michael North > Sarah Berry > Nicola Kelly
          12 | Michael North > Sarah Berry > Alexandra Climo
          13 | Michael North > Sarah Berry > Dominic King
          14 | Michael North > Zoe Black > Leonard Gray
          15 | Michael North > Zoe Black > Eric Rampling
          16 | Michael North > Megan Berry > Ryan Metcalfe > Piers Paige
          17 | Michael North > Megan Berry > Ryan Metcalfe > Ryan Henderson
          18 | Michael North > Megan Berry > Max Mills > Frank Tucker
          19 | Michael North > Megan Berry > Max Mills > Nathan Ferguson
          20 | Michael North > Megan Berry > Max Mills > Kevin Rampling
(20 rows)

Creating a View

Creating a simple view is as simple as writing a standard query! All that is required is the addition of CREATE VIEW view_name AS before the query, and this will create a saved place for us to always come back and reference the results of this query:

CREATE VIEW comedies AS
    SELECT *
    FROM films
    WHERE kind = 'Comedy';

Get Out There and Start SQLing

I highly encourage anybody to get in the habit of always writing SQL queries by hand. With the right GUI, autocompletion can be your best friend.

Explicitly forcing one's self to write queries instead of copy & pasting anything forces us to come to realizations, such as SQL's order of operations. Indeed, this query holds the correct syntax...

SELECT *
FROM table_name
WHERE column_name = 'Value';

...Whereas this one does not:

SELECT *
WHERE column_name = 'Value'
FROM table_name;

Grasping the subtleties of SQL is the difference between being blazing fast and mostly clueless. The good news is, you’ll start to find that these concepts aren’t nearly as daunting as they may have once seemed, so the track from ‘bad data engineer’ to ‘expert’ is an easy win that would be foolish not to take.

Stick around for next time where we actually work with data in SQL: The Sequel, rated PG-13.