Welcome to SQL 2: Selecting, Updating, and Deleting Data
Explore the many flavors of SQL data manipulation in part 2 of our series.
Now that we've gotten the fundamentals of creating databases and tables out of the way, we can start getting into the meat and potatoes of SQL interactions: selecting, updating, and deleting data.
We'll start with the basic structure of these queries and then break into the powerful operations with enough detail to make you dangerous.
Selecting Data From a Table
As mentioned previously, SQL operations have a rather strict order of operations which clauses have to respect in order to make a valid query. We'll begin by dissecting a common
SELECT column_name_1, column_name_2 FROM schema_name.table_name WHERE column_name_1 = "Value";
This is perhaps the most common structure of
SELECT queries. First, we list the names of the columns we'd like to select separated by commas. To receive all columns, we can simply say
These columns need to come from somewhere, so we specify the table we're referring to next. This either takes the form of
FROM table_name (non-PostgreSQL), or
FROM schema_name.table_name (PostgreSQL). In theory, a semicolon here would result in a valid query, but we usually want to select rows that meet certain criteria.
This is where the
WHERE clause comes in: only rows which return "true" for our
WHERE conditional will be returned. In the above example, we're validating that a string matches exactly
Selecting only Distinct Values
Something that often comes in handy is selecting distinct values in a column. In other words, if a value exists in the same column in 100 rows, running
DISTINCT query will only show us that value once. This is a good way of seeing the unique content of a column without yet diving into the distribution of said value. The effect is similar to the United States Senate, or the Electoral College: forget the masses, and prop up Wyoming 2020:
SELECT DISTINCT column_name FROM table_name;
Offsetting and Limiting Results in our Queries
When selecting data, the combination of
LIMIT are critical at times. If we're selecting from a database with hundreds of thousands of rows, we would be wasting an obscene amount of system resources to fetch all rows at once; instead, we can have our application or API paginate the results.
LIMIT is followed by an integer, which in essence says "return no more than X results."
OFFSET is also followed by an integer, which denotes a numerical starting point for returned results, aka: "return all results which occur after the Xth result:"
SELECT * FROM table_name LIMIT 50 OFFSET 0;
The above returns the first 50 results. If we wanted to build paginated results on the application side, we could construct our query like this:
from SQLAlchemy import engine, session # Set up a SQLAlchemy session Session = sessionmaker() engine = create_engine('sqlite:///example.db') Session.configure(bind=engine) sess = Session() # Appication variables page_number = 3 page_size = 50 results_subset = page_number * results limit # Query session.query(TableName).limit(page_size).offset(results_subset)
Such an application could increment
page_number by 1 each time the user clicks on to the next page, which would then appropriately modify our query to return the next page of results.
Another use for
OFFSET could be to pick up where a failed script left off. If we were to write an entire database to a CSV and experience a failure. We could pick up where the script left off by setting
OFFSET equal to the number of rows in the CSV, to avoid running the entire script all over again.
Last to consider for now is sorting our results by using the
ORDER BY clause. We can sort our results by any specified column, and state whether we'd like the results to be ascending (
ASC) or descending (
SELECT * FROM schema_name.table_name WHERE column_name_1 = "Value" ORDER BY updated_date DESC LIMIT 50 OFFSET 10;
Sophisticated SELECT Statements
Of course, we can select rows with
WHERE logic that goes much deeper than an exact match. One of the most versatile of these operations is
Using Regex with LIKE
LIKE is perhaps the most powerful way to select columns with string values. With
LIKE, we can leverage regular expressions to build highly complex logic. Let's start with some of my favorites:
SELECT * FROM people WHERE name LIKE "%Wade%";
Passing a string to
LIKE with percentage signs on both sides is essentially a "contains" statement.
% is equivalent to a wildcard, thus placing
% on either side of our string will return true whether the person's first name, middle name, or last name is Wade. Check out other useful combinations for
a%: Finds any values that start with "a".
%a: Finds any values that end with "a".
%or%: Finds any values that have "or" in any position.
_r%: Finds any values that have "r" in the second position.
a_%_%: Finds any values that start with "a" and are at least 3 characters in length.
a%o: Finds any values that start with "a" and ends with "o".
Finding Values that are NOT LIKE
The opposite of
LIKE is of course
NOT LIKE, which runs the same conditional, but returns the opposite true/false value of
SELECT * FROM people WHERE name NOT LIKE "%Wade%";
Conditionals With DateTime Columns
DateTime columns are extremely useful for selecting data. Unlike plain strings, we can easily extract numerical values for month, day, and year from a DateTime by using
YEAR(column_name) respectively. For example, using
MONTH() on a column that contains a DateTime of
2019-01-26 05:42:34 would return
1, aka January. Because the values come back as integers, it is then trivial to find results within a date range:
SELECT * FROM posts WHERE YEAR(created_at) < 2018;
Finding Rows with NULL Values
NULL is a special datatype which essentially denotes the "absence of something," therefore no conditional will never equal
NULL. Instead, we find rows where a value
SELECT * FROM posts WHERE author IS NULL;
This should not come as a surprise to anybody familiar with validating datatypes.
The reverse of this, of course, is
SELECT * FROM posts WHERE author IS NOT NULL;
INSERT query creates a new row, and is rather straightforward: we state the columns we'd like to insert data into, followed by the values to insert into said columns:
INSERT INTO table_name (column_1, column_2, column_3) VALUES ("value1", "value2", "value3");
Many things could result in a failed insert. For one, the number of values must match the number of columns we specify; if we don't we've either provided too few or too many values.
Second, vales must respect a column's data type. If we try to insert an integer into a DateTime column, we'll receive an error.
Finally, we must consider the keys and constraints of the table. If keys exist that specify certain columns must not be empty, or must be unique, those keys must too be respected.
As a shorthand trick, if we're inserting values into all of a table's columns, we can skip the part where we explicitly list the column names:
INSERT INTO table_name VALUES ("value1", "value2", "value3");
Here's a quick example of an insert query with real data:
INSERT INTO friends (id, name, birthday) VALUES (1, 'Jane Doe', '1990-05-30');
UPDATE Records: The Basics
Updating rows is where things get interesting. There's so much we can do here, so let's work our way up:
UPDATE table_name SET column_name_1 = 'value' WHERE column_name_2 = 'value';
That's as simple as it gets: the value of a column, in a row that matches our conditional. Note that
SET always comes before
WHERE. Here's the same query with real data:
UPDATE celebs SET twitter_handle = '@taylorswift13' WHERE id = 4;
UPDATE Records: Useful Logic
Joining Strings Using CONCAT
You will find that it's common practice to update rows based on data which already exists in said rows: in other words, sanitizing or modifying data. A great string operator is
CONCAT("string_1", "string_2") will join all the strings passed to a single string.
Below is a real-world example of using
CONCAT() in conjunction with
NOT LIKE to determine which post excerpts don't end in punctuation. If the excerpt does not end with a punctuation mark, we add a period to the end:
UPDATE posts SET custom_excerpt = CONCAT(custom_excerpt, '.') WHERE custom_excerpt NOT LIKE '%.' AND custom_excerpt NOT LIKE '%!' AND custom_excerpt NOT LIKE '%?';
REPLACE() works in SQL as it does in nearly every programming language. We pass
REPLACE() three values:
- The string to be modified.
- The substring within the string which will be replaced.
- The value of the replacement.
We can do plenty of clever things with
REPLACE(). This is an example that changes the featured image of blog posts to contain the “retina image” suffix:
UPDATE posts SET feature_image = REPLACE(feature_image, '.jpg', '@2x.jpg');
Scenario: Folder Structure Based on Date
I across a fun exercise the other day when dealing with a nightmare situation involving changing CDNs. It touches on everything we’ve reviewed thus far and serves a great illustration of what can be achieved in SQL alone.
The challenge in moving hundreds of images for hundreds of posts came in the form of a file structure. Ghost likes to save images in a dated folder structure, like 2019/02/image.jpg. Our previous CDN did not abide by this at all, so had a dump of all images in a single folder. Not ideal.
Thankfully, we can leverage the metadata of our posts to discern this file structure. Because images are added to posts when posts are created, we can use the created_at column from our posts table to figure out the right dated folder:
UPDATE posts SET feature_image = CONCAT("https://cdn.example.com/posts/", YEAR(created_at), "/", LPAD(MONTH(created_at), 2, '0'), "/", SUBSTRING_INDEX(feature_image, '/', - 1) );
Let's break down the contents in our
https://cdn.example.com/posts/: The base URL of our new CDN.
YEAR(created_at): Extracting the year from our post creation date (corresponds to a folder).
LPAD(MONTH(created_at), 2, '0'): Using MONTH(created_at) returns a single digit for early months, but our folder structure wants to always have months double-digits (ie: 2018/01/ as opposed to 2018/1/). We can use
LPAD()here to 'pad' our dates so that months are always two digits long, and shorter dates will be padded with the number 0.
SUBSTRING_INDEX(feature_image, '/', - 1): We're getting the filename of each post's image by finding everything that comes after the last slash in our existing image URL.
The result for every image will now look like this:
Let's wrap up for today with our last type of query, deleting rows:
DELETE FROM celebs WHERE twitter_handle IS NULL;