Welcome to SQL 2: Selecting, Updating, and Deleting Data

Now that we've gotten the fundamentals of creating databases and tables out of the way, we can start getting into the meat and potatoes of SQL interactions: selecting, updating, and deleting data.

We'll start with the basic structure of these queries and then break into the powerful operations with enough detail to make you dangerous.

Selecting Data From a Table

As mentioned previously, SQL operations have a rather strict order of operations which clauses have to respect in order to make a valid query. We'll begin by dissecting a common SELECT statement:

SELECT
  column_name_1,
  column_name_2
FROM
  schema_name.table_name
WHERE
  column_name_1 = "Value";

This is perhaps the most common structure of SELECT queries. First, we list the names of the columns we'd like to select separated by commas. To receive all columns, we can simply say SELECT *.

These columns need to come from somewhere, so we specify the table we're referring to next. This either takes the form of FROM table_name (non-PostgreSQL), or FROM schema_name.table_name (PostgreSQL). In theory, a semicolon here would result in a valid query, but we usually want to select rows that meet certain criteria.

This is where the WHERE clause comes in: only rows which return "true" for our WHERE conditional will be returned. In the above example, we're validating that a string matches exactly "Value".

Selecting only Distinct Values

Something that often comes in handy is selecting distinct values in a column. In other words, if a value exists in the same column in 100 rows, running DISTINCT query will only show us that value once. This is a good way of seeing the unique content of a column without yet diving into the distribution of said value. The effect is similar to the United States Senate, or the Electoral College: forget the masses, and prop up Wyoming 2020:

SELECT DISTINCT column_name 
FROM table_name;

Offsetting and Limiting Results in our Queries

When selecting data, the combination of OFFSET and LIMIT are critical at times. If we're selecting from a database with hundreds of thousands of rows, we would be wasting an obscene amount of system resources to fetch all rows at once; instead, we can have our application or API paginate the results.

LIMIT is followed by an integer, which in essence says "return no more than X results."

OFFSET is also followed by an integer, which denotes a numerical starting point for returned results, aka: "return all results which occur after the Xth result:"

SELECT
 *
FROM
 table_name
LIMIT 50 OFFSET 0;

The above returns the first 50 results. If we wanted to build paginated results on the application side, we could construct our query like this:

from SQLAlchemy import engine, session

# Set up a SQLAlchemy session
Session = sessionmaker()
engine = create_engine('sqlite:///example.db')
Session.configure(bind=engine)
sess = Session()

# Appication variables
page_number = 3
page_size = 50
results_subset = page_number * results limit

# Query
session.query(TableName).limit(page_size).offset(results_subset)

Such an application could increment page_number by 1 each time the user clicks on to the next page, which would then appropriately modify our query to return the next page of results.

Another use for OFFSET could be to pick up where a failed script left off. If we were to write an entire database to a CSV and experience a failure. We could pick up where the script left off by setting OFFSET equal to the number of rows in the CSV, to avoid running the entire script all over again.

Sorting Results

Last to consider for now is sorting our results by using the ORDER BY clause. We can sort our results by any specified column, and state whether we'd like the results to be ascending (ASC) or descending (DESC):

SELECT
  *
FROM
  schema_name.table_name
WHERE
  column_name_1 = "Value"
ORDER BY
  updated_date DESC
LIMIT 50 OFFSET 10;

Sophisticated SELECT Statements

Of course, we can select rows with WHERE logic that goes much deeper than an exact match. One of the most versatile of these operations is LIKE.

Using Regex with LIKE

LIKE is perhaps the most powerful way to select columns with string values. With LIKE, we can leverage regular expressions to build highly complex logic. Let's start with some of my favorites:

SELECT
  *
FROM
  people
WHERE
  name LIKE "%Wade%";

Passing a string to LIKE with percentage signs on both sides is essentially a "contains" statement. % is equivalent to a wildcard, thus placing % on either side of our string will return true whether the person's first name, middle name, or last name is Wade. Check out other useful combinations for %:

a%: Finds any values that start with "a".
%a: Finds any values that end with "a".
%or%: Finds any values that have "or" in any position.
_r%: Finds any values that have "r" in the second position.
a_%_%: Finds any values that start with "a" and are at least 3 characters in length.
a%o: Finds any values that start with "a" and ends with "o".

Finding Values that are NOT LIKE

The opposite of LIKE is of course NOT LIKE, which runs the same conditional, but returns the opposite true/false value of LIKE:

SELECT
  *
FROM
  people
WHERE
  name NOT LIKE "%Wade%";

Conditionals With DateTime Columns

DateTime columns are extremely useful for selecting data. Unlike plain strings, we can easily extract numerical values for month, day, and year from a DateTime by using MONTH(column_name), DAY(column_name), and YEAR(column_name) respectively. For example, using MONTH() on a column that contains a DateTime of 2019-01-26 05:42:34 would return 1, aka January. Because the values come back as integers, it is then trivial to find results within a date range:

SELECT 
  * 
FROM 
  posts 
WHERE YEAR(created_at) < 2018;

Finding Rows with NULL Values

NULL is a special datatype which essentially denotes the "absence of something," therefore no conditional will never equal NULL. Instead, we find rows where a value IS NULL:

SELECT 
  * 
FROM 
  posts 
WHERE author IS NULL;

This should not come as a surprise to anybody familiar with validating datatypes.

The reverse of this, of course, is NOT NULL:

SELECT 
  * 
FROM 
  posts 
WHERE author IS NOT NULL;

Inserting Data

An INSERT query creates a new row, and is rather straightforward: we state the columns we'd like to insert data into, followed by the values to insert into said columns:

INSERT INTO table_name (column_1, column_2, column_3)
VALUES ("value1", "value2", "value3");

Many things could result in a failed insert. For one, the number of values must match the number of columns we specify; if we don't we've either provided too few or too many values.

Second, vales must respect a column's data type. If we try to insert an integer into a DateTime column, we'll receive an error.

Finally, we must consider the keys and constraints of the table. If keys exist that specify certain columns must not be empty, or must be unique, those keys must too be respected.

As a shorthand trick, if we're inserting values into all of a table's columns, we can skip the part where we explicitly list the column names:

INSERT INTO table_name
VALUES ("value1", "value2", "value3");

Here's a quick example of an insert query with real data:

INSERT INTO friends (id, name, birthday) 
VALUES (1, 'Jane Doe', '1990-05-30');

UPDATE Records: The Basics

Updating rows is where things get interesting. There's so much we can do here, so let's work our way up:

UPDATE table_name 
SET column_name_1 = 'value' 
WHERE column_name_2 = 'value';

That's as simple as it gets: the value of a column, in a row that matches our conditional. Note that SET always comes before WHERE. Here's the same query with real data:

UPDATE celebs 
SET twitter_handle = '@taylorswift13' 
WHERE id = 4;

UPDATE Records: Useful Logic

Joining Strings Using CONCAT

You will find that it's common practice to update rows based on data which already exists in said rows: in other words, sanitizing or modifying data. A great string operator is CONCAT(). CONCAT("string_1", "string_2") will join all the strings passed to a single string.

Below is a real-world example of using CONCAT() in conjunction with NOT LIKE to determine which post excerpts don't end in punctuation. If the excerpt does not end with a punctuation mark, we add a period to the end:

UPDATE
  posts
SET 
  custom_excerpt = CONCAT(custom_excerpt, '.')
WHERE
  custom_excerpt NOT LIKE '%.'
  AND custom_excerpt NOT LIKE '%!'
  AND custom_excerpt NOT LIKE '%?';

Using REPLACE

REPLACE() works in SQL as it does in nearly every programming language. We pass REPLACE() three values:

The string to be modified.
The substring within the string which will be replaced.
The value of the replacement.

We can do plenty of clever things with REPLACE(). This is an example that changes the featured image of blog posts to contain the “retina image” suffix:

UPDATE
  posts
SET
  feature_image = REPLACE(feature_image, '.jpg', '@2x.jpg');

Scenario: Folder Structure Based on Date

I across a fun exercise the other day when dealing with a nightmare situation involving changing CDNs. It touches on everything we’ve reviewed thus far and serves a great illustration of what can be achieved in SQL alone.

The challenge in moving hundreds of images for hundreds of posts came in the form of a file structure. Ghost likes to save images in a dated folder structure, like 2019/02/image.jpg. Our previous CDN did not abide by this at all, so had a dump of all images in a single folder. Not ideal.

Thankfully, we can leverage the metadata of our posts to discern this file structure. Because images are added to posts when posts are created, we can use the created_at column from our posts table to figure out the right dated folder:

UPDATE
  posts
SET
  feature_image = CONCAT("https://cdn.example.com/posts/", 
	YEAR(created_at),
	"/", 
	LPAD(MONTH(created_at), 2, '0'), 
	"/",
	SUBSTRING_INDEX(feature_image, '/', - 1)
  );

Let's break down the contents in our CONCAT:

https://cdn.example.com/posts/: The base URL of our new CDN.
YEAR(created_at): Extracting the year from our post creation date (corresponds to a folder).
LPAD(MONTH(created_at), 2, '0'): Using MONTH(created_at) returns a single digit for early months, but our folder structure wants to always have months double-digits (ie: 2018/01/ as opposed to 2018/1/). We can use LPAD() here to 'pad' our dates so that months are always two digits long, and shorter dates will be padded with the number 0.
SUBSTRING_INDEX(feature_image, '/', - 1): We're getting the filename of each post's image by finding everything that comes after the last slash in our existing image URL.

The result for every image will now look like this:

https://cdn.example.com/posts/2018/02/image.jpg

DELETE Records

Let's wrap up for today with our last type of query, deleting rows:

DELETE FROM celebs 
WHERE twitter_handle IS NULL;