SQL: We all pretend to be experts at it. Most of us get away with it, thanks to a combination of StackOverflow searches and an overreliance on ORMs. Paired with our vast experience of learning how to code in the 90s, our fieldwork of PHPMyAdmin
and LAMP
stacks basically make us experts. Go ahead and chalk up a win for your resume.
SQL has been around longer than our careers have, so why start a series on it now? Surely there’s sufficient documentation to Google syntax whenever we need to write a query? That, my friends, is precisely the problem. Regardless of what tools we have, some skills are better learned and practiced by heart. SQL is one of those skills.
Sure, SQLAlchemy or similar ORMs might protect us here and there from writing raw queries. Considering SQL is just one of many query languages we'll use regularly (in addition to NoSQL, GraphQL, JQL, etc.), is becoming a SQL expert really that critical? In short, hard yes. Relational databases are not only here to stay, but thinking in queries as a second language solidifies one's understanding of the fine details of data. Marc Laforet recently published a Medium post which drives home just how important leaning on SQL is:
What’s even more interesting is that when these transformation scripts were applied to the 6.5 GB dataset, python completely failed. Out of 3 attempts, python crashed 2 times and my computer completely froze the 3rd time… while SQL took 226 seconds.
Keeping logic out of our apps and pipelines and in SQL results in exponentially faster execution while also being more readable and universally understood than whatever we’d write in our language of choice. The lower we can push application logic in our stack, the better. This is why I’d much prefer to see the data sphere saturated with SQL tutorials instead of Pandas
tutorials.
Relational Database Terminology
I hate it when informational material kicks off by covering obvious terminology definitions. Under normal circumstances, I find this to be cliche, unhelpful, and damaging to an author's credibility, but these aren't normal circumstances. In SQL, vocabulary has multiple meanings depending on the context or flavor database you're using. Paired with the prevalence of ORMs, it's not only possible for engineers to "get by" with a poor understanding of SQL, but this trend is becoming the norm.
Let's make sure that doesn't happen to us by understanding SQL terminology that's used when connected to an SQL shell:
- Databases: Every Database instance is separated at the highest level into databases. Yes, a database is a collection of databases - we're already off to a great start.
- Schemas: A schema is the underlying structure of a table (i.e.:
CREATE TABLE
statements). Schemas are the structure of our tables, containing the structure of columns, keys, and so forth.- NOTE: In
Postgres
, the term "schema" may refer to something in addition to the above definition. By default,Postgres
allows users to organize tables into tidy groups. Unfortunately, the term that was selected for these groups of tables is also "schema."
- NOTE: In
- Tables: Tables consist of rows and columns that hold our sweet, sweet data. Columns are best thought of as 'attributes,' whereas rows are entries consisting of values for said attributes. All values in a column must share the same data type.
- Keys: Database keys help us organize and optimize data reading and writing. Selecting which table columns are "keys" should complement the manner in which the data is used. Keys can also enforce constraints, such as ensuring all column values are unique.
- Primary keys: The main identifier for each row of data. The primary key is different for every record in the relational database. They typically take the form of incrementing integers or identifiers such as GUIDS.
- Foreign keys: Enable data searches and manipulation between the primary database table and other related databases.
- Keys: Database keys help us organize and optimize data reading and writing. Selecting which table columns are "keys" should complement the manner in which the data is used. Keys can also enforce constraints, such as ensuring all column values are unique.
- Objects: A blanket term for anything (including relations) that exists similarly in a schema (somewhat PostgreSQL-specific).
- Views (PostgreSQL): Views display data similarly to tables, with the difference that views do not store data. Views are a snapshot of data pulled from other tables in the form of a query; a good way to think about views is to consider them to be 'virtual tables.'
- Functions (PostgreSQL): Logic for interacting with data saved to be reused.
CREATE SCHEMA
achieves the same effect instead of CREATE DATABASE
.
Navigating and Creating Databases
We've got to start somewhere, so it might as well be with database management. Admittedly, this will be the most useless of the things we'll cover. The act of navigating databases is best suited for a GUI.
Show Databases
If you access your database via the command line shell (for some reason), the first logical thing to do is to list the available databases:
USE Database
Now that we've listed the possible databases we can connect to, we can explore what each of these contains. To do this, we specify which database we want to connect to, AKA "use."
Create Database
Creating databases is straightforward. Pay attention to the character set when creating a database: this will determine which types of characters your database will accept. For example, if we try to insert special encoded characters into a simple UTF-8 database, those characters won’t turn out as we’d expect.
CREATE DATABASE IF NOT EXISTS database_name
CHARACTER SET utf-8
[COLLATE collation_name]
Bonus: here's the shorthand for creating a database and then showing the result:
SHOW CREATE DATABASE database_name;
Creating and Modifying Tables
Creating tables via SQL syntax can be critical when automating data imports. When creating a table, we also set the column names, types, and keys:
CREATE TABLE [IF NOT EXISTS] table_name (
column_name_1 [COLUMN_DATA_TYPE] [KEY_TYPE] [KEY_ATTRIBUTES] DEFAULT [DEFAULT_VALUE],
column_name_2 [COLUMN_DATA_TYPE] [KEY_TYPE] [KEY_ATTRIBUTES] DEFAULT [DEFAULT_VALUE],
PRIMARY KEY (column_name_1)
) ENGINE=[ENGINE_TYPE];
We can specify IF NOT EXISTS
when creating our table if we'd like to include validation in our query. When present, the table will only be created if a table of the specified name does not exist.
When creating each of our columns, there are several things we can specify per column:
- Data Type (required): The data that can be saved to cells of this column (such as INTEGER, TEXT, etc).
- Key Type: Creates a key for the column.
- Key Attributes: Any key-related attributes, such as auto-incrementing.
- Default: If rows are created in the table without values passed to the current column, the value specified as
DEFAULT
- Primary Key: Allows any ]previously defined columns to be set as the table's primary key.
MySQL tables can have a 'storage engine' specified via ENGINE=[engine_type]
which determines the core logic of how the table will interpret data. Leaving this blank defaults to InnoDB and is usually fine to be left alone. You can find more about MySQL engines here.
Below is an example of what an actual CREATE TABLE
query would look like. Note the order in which things are defined: table name, columns, keys, engine, charset, and collate (AKA character collation):
CREATE TABLE IF NOT EXISTS `users` (
`id` varchar(24) NOT NULL,
`name` varchar(191) NOT NULL,
`slug` varchar(191) NOT NULL,
`password` varchar(60) NOT NULL,
`email` varchar(191) NOT NULL,
`profile_image` varchar(2000) DEFAULT NULL,
`cover_image` varchar(2000) DEFAULT NULL,
`bio` text,
`website` varchar(2000) DEFAULT NULL,
`location` text,
`status` varchar(50) NOT NULL DEFAULT 'active',
`locale` varchar(6) DEFAULT NULL,
`created_at` datetime NOT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `users_slug_unique` (`slug`),
UNIQUE KEY `users_email_unique` (`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
Managing Keys for Existing Tables
If we don't specify our keys at table creation time, we can always do so after the fact. SQL tables can accept the following key types:
- Primary Key: One or more fields/columns that uniquely identify a record in the table. It can not accept
NULL
, duplicate values. - Candidate Key: Candidate keys are like groups of non-committed Primary Keys. These keys only accept unique values and could potentially be used in place of a Primary Key if need be, but are not actual Primary Keys. Unlike Primary Keys, multiple Candidate Keys may exist per table.
- Alternate Key: Refers to a single Candidate Key (an alternative that can satisfy the duty of a Primary Key if need be).
- Composite/Compound Key: Defined by combining the values of multiple columns, the sum of which will always produce a unique value. There can be multiple Candidate Keys in one table. Each Candidate Key can work as Primary Key.
- Unique Key: A set of one or more fields/columns of a table that uniquely identify a record in a database table. Similar to Primary key, but it can accept only one null value, and it can not have duplicate values.
- Foreign Key: Foreign keys denote fields that serve as another table's Primary key. Foreign keys are useful for building relationships between tables. While a foreign key is required in the parent table where they are primary, foreign keys can be null or empty in the tables intended to relate to the other table.
Let's look at an example query where we add a key to a table and dissect the pieces:
ALTER TABLE table_name
ADD FOREIGN KEY foreign_key_name (column_name)
REFERENCES parent_table(columns)
ON DELETE { NO ACTION | CASCADE | SET NULL | SET DEFAULT }
ON UPDATE { NO ACTION | CASCADE | SET NULL | SET DEFAULT }
ALTER TABLE
is used to make any changes to a table's structure, whether that be modifying columns or keys.
In this example, we ADD
a key that happens to be a FOREIGN KEY
. While keys always refer to columns, keys themselves must have names of their own to distinguish the column's data and a key's conceptual logic. We name our key foreign_key_name
and specify which column the key will act on with (column_name)
. Because this is a foreign key, we need to specify which table's primary key we want this to be associated with. REFERENCES parent_table(primary_key_column)
is stating that the foreign key in this table corresponds to values held in a column named primary_key_column
, in a table named parent_table
.
The statements ON DELETE
and ON UPDATE
are actions that take place if the parent table's primary key is deleted or updated, respectively. ON DELETE CASCADE
would result in our tables foreign key being deleted if the corresponding primary key were to disappear.
Adding Columns
Adding columns follows the same syntax we used when creating tables. An interesting additional feature is the ability to place the new column before or after preexisting columns:
ALTER TABLE table_name
ADD COLUMN column_name [DATA_TYPE] [FIRST|AFTER existing_column];
ALTER TABLE table_name
becomes ALTER TABLE schema_name.table_name
. This applies to any time we reference tables, including when we create and delete tables.
Pop Quiz
The below statement uses elements of everything we've learned about modifying and creating table structures thus far. Can you discern what is happening here?
CREATE TABLE vendors(
vdr_id int not null auto_increment primary key,
vdr_name varchar(255)
)ENGINE=InnoDB;
ALTER TABLE products
ADD COLUMN vdr_id int not null AFTER cat_id;
ALTER TABLE products
ADD FOREIGN KEY fk_vendor(vdr_id)
REFERENCES vendors(vdr_id)
ON DELETE NO ACTION
ON UPDATE CASCADE;
Dropping Data
DANGER ZONE: This is where we can start to mess things up. Dropping columns or tables results in a complete loss of data. Whenever you see the word "drop," be scared.
If you're sure you know what you're doing and would like to remove a table column, this can be done as such:
Dropping a table destroys the table structure as well as all data within it:
Truncating a table, on the other hand, will purge the table of data but retain the table itself:
TRUNCATE TABLE table_name;
Drop Foreign Key
Like tables and columns, we can drop keys as well:
This can also be handled by dropping CONSTRAINT
:
Working with Views (Specific to PostgreSQL)
Lastly, let's explore the act of creating views. There are three types of views PostgreSQL can handle:
- Simple Views: Virtual tables that represent data of underlying tables. Simple views are automatically updatable: the system will allow
INSERT
,UPDATE
, andDELETE
statements to be used on the view in the same way as on a regular table. - Materialized Views: PostgreSQL extends the view concept to a next level that allows views to store data 'physically', and we call those views are materialized views. A materialized view caches the result of a complex query and then allow you to refresh the result periodically.
- Recursive Views: Recursive views are difficult to explain without delving deep into the complicated (but cool!) functionality of recursive reporting. I won't get into the details, but these views are able to represent relationships that go multiple layers deep. Here's a quick taste if you're curious:
Sample RECURSIVE
query:
Output:
Creating a View
Creating a simple view is as simple as writing a standard query! All that is required is the addition of CREATE VIEW view_name AS
before the query, and this will create a saved place for us to always come back and reference the results of this query:
CREATE VIEW comedies AS
SELECT *
FROM films
WHERE kind = 'Comedy';
Get Out There and Start SQLing
I highly encourage anybody to get in the habit of always writing SQL queries by hand. With the right GUI, autocompletion can be your best friend.
Explicitly forcing yourself to write queries (instead of copy & pasting) forces us to come to realizations, such as SQL's order of operations. Indeed, this query holds the correct syntax:
...Whereas this one does not:
SELECT *
WHERE column_name = 'Value'
FROM table_name;
Grasping the subtleties of SQL is the difference between being blazing fast and mostly clueless. The good news is, you’ll start to find that these concepts aren’t nearly as daunting as they may have once seemed, so the track from ‘bad data engineer’ to ‘expert’ is an easy win that would be foolish not to take.
Catch us next time where we actually work with data in SQL: The Sequel, rated PG-13. (Just kidding, it's actually right here):