Welcome to SQL 4: Aggregate Functions

Welcome to SQL 4: Aggregate Functions

Become more intimate with your data- use SQL's aggregate functions to explore the traits which make your data unique and beautiful.

    Aggregate functions in SQL are super dope. When combining these functions with clauses such as GROUP BY and HAVING, we discover ways to view our data from completely new perspectives. Instead of looking at the same old endless flat table, we can use these functions to give us entirely new insights; aggregate functions help us to understand bigger-picture things. Those things might include finding outliers in datasets, or simply figuring out which employee with a family to feed should be terminated, based on some arbitrary metric such as sales numbers.

    With the basics of JOINs under our belts, this is when SQL starts feel really, really powerful. Our plain two-dimensional tables suddenly gain this power to be combined, aggregated, folded on to themselves, expand infinitely outward as the universe itself, and even transcend into the fourth dimension.*

    *Needs citation

    Our Base Aggregation Functions

    First up, let's see what we mean by "aggregate functions" anyway. These simple functions provide us with a way to mathematically quantify what exactly is in our database. Aggregate functions are performed on table columns to give us the make-up of said column. On their own, they seem quite simple:

    DISTINCT Aggregations

    A particularly useful way of using aggregate functions on their own is when we'd like to know the number of DISTINCT values. While aggregate values take all records into account, using DISTINCT limits the data returned to specifically refer to unique values. COUNT(column_name) will return the number of all records in a column, where COUNT(DISTINCT column_name) will ignore counting records where the value in the counted column is repeated.

    Using GROUP BY

    The GROUP BY statement is often used with aggregate functions (COUNT, MAX, MIN, SUM, AVG) to group the result-set by one or more columns.

    To demonstrate how aggregate functions work moving forward, I'll be using a familiar database: the database which contains all the content for this very blog. Let's get a quick preview of what we're working with:

    title slug feature_image meta_title meta_description created_at updated_at published_at custom_excerpt
    Welcome to Hackers and Slackers welcome-to-hackers-and-slackers /content/images/2017/11/[email protected] Welcome to Hackers and Slackers | Hackers and Slackers Technology for badasses 2017-11-17 20:29:13 2018-07-25 02:06:02 2017-11-13 20:37:00 Technology for badasses.
    Generating Tree Hierarchies with Treelib creating-trees-in-treelib /content/images/2017/11/[email protected] Tree Hierarchies with Treelib | Hackers and Slackers Treelib is a Python library that allows you to create a visual tree hierarchy: a simple plaintext representation of parent-child relationships. 2017-11-17 20:45:10 2019-03-28 09:02:39 2017-11-17 20:56:40 Using Python to visualize file hierarchies as trees.
    About the Squad about https://hackers.nyc3.cdn.digitaloceanspaces.com/posts/2017/11/[email protected] About | Hackers and Slackers Hackers and Slackers is a community which values technology, life, and improving the latter with the former. 2017-11-17 20:58:42 2019-04-22 08:47:02 2017-11-17 20:58:46 Hackers and Slackers is a community which values technology, life, and improving the latter with the former.
    Join join https://hackers.nyc3.cdn.digitaloceanspaces.com/posts/2017/11/[email protected] Join | Hackers and Slackers 2017-11-17 20:59:05 2018-07-25 02:06:02 2017-11-17 21:03:06
    Merge Sets of Data in Python Using Pandas merge-dataframes-with-pandas /content/images/2017/11/[email protected] Merging Dataframes with Pandas | Hackers and Slackers Perform merges of data similar to SQL JOINs using Python's Pandas library: the essential library for data analysis in Oython. 2017-11-18 00:09:32 2018-12-26 09:29:22 2017-11-18 00:22:25 Perform SQL-like merges of data using Python's Pandas.

    Item 1 on our agenda: we're going to use aggregates to find which authors have been posting most frequently:

    SELECT
      COUNT(title), author_id
    FROM
      posts
    GROUP BY author_id;
    

    And the result:

    Count author_id
    102 1
    280 5c12c3821345c22dced9f591
    17 5c12c3821345c22dced9f592
    5 5c12c3821345c22dced9f593
    2 5c12c3821345c22dced9f594
    2 5c12c3821345c22dced9f595

    Oh look, a real-life data problem to solve! It seems like authors are represented in Ghost's posts table simply by their IDs. This isn't very useful. Luckily, we've already learned enough about JOINs to know we can fill in the missing information from the users table!

    SELECT
      COUNT(posts.title),
      users.name
    FROM
      posts
    LEFT JOIN users
    ON 
      (posts.author_id = users.id)
    GROUP BY users.id
    ORDER BY COUNT(posts.title) DESC;
    

    Let's see the results this time around:

    Count author_id
    280 Matthew Alhonte
    102 Todd Birchard
    17 Max Mileaf
    5 Ryan Rosado
    2 Graham Beckley
    2 David Aquino

    Now that's more like it! Matt is crushing the game with his Lynx Roundup series, with myself in second place. Max had respectable numbers for a moment but has presumably moved on to other hobbies, such as living his life.

    For the remainder, well, I've got nothing to say other than we're hiring. We don't pay though. In fact, there's probably zero benefits to joining us.

    Conditional Grouping With "HAVING"

    HAVING is like the WHERE of aggregations. We can't use WHERE on aggregate values, so that's why HAVING exists. HAVING can't accept any conditional value, but instead it must accept a numerical conditional derived from a GROUP BY. Perhaps this would be easier to visualize in a query:

    SELECT
      tags.name,
      COUNT(DISTINCT posts_tags.post_id)
    FROM posts_tags 
      LEFT JOIN tags ON tags.id = posts_tags.tag_id
      LEFT JOIN posts ON posts.id = posts_tags.post_id
    GROUP BY
      tags.id
    HAVING 
      COUNT(DISTINCT posts_tags.post_id) > 10
    ORDER BY
      COUNT(DISTINCT posts_tags.post_id)
      DESC;
    

    In this scenario, we want to see which tags on our blog have the highest number of associated posts. The query is very similar to the one we made previously, only this time we have a special guest:

    HAVING 
      COUNT(DISTINCT posts_tags.post_id) > 10
    

    This usage of HAVING only gives us tags which have ten posts or more. This should clean up our report by letting Darwinism takes its course. Here's how it worked out:

    tag Count
    Roundup 263
    Python 80
    Machine Learning 29
    DevOps 28
    Data Science 28
    Software Development 27
    Data Engineering 23
    Excel 19
    SQL 18
    Architecture 18
    REST APIs 16
    #Adventures in Excel 16
    Pandas 15
    Flask 14
    Data Analysis 12
    JavaScript 12
    AWS 11
    MySQL 11

    As expected, Matt's roundup posts take the lead (and if we compare this to previous data, we can see Matt has made a total of 17 non-Lynx posts: meaning Max and Matt are officially TIED).

    If we hadn't included our HAVING statement, this list would be much longer, filled with tags nobody cares about. Thanks to explicit omission, now we don't need to experience the dark depression that comes when confronting those sad pathetic tags. Out of sight, out of mind.

    More Aggregates

    To explore some of the other aggregates, we're going to switch datasets. This time, we're going to look at wind speeds across US cities:

    datetime Vancouver Portland San Francisco Seattle Los Angeles San Diego Las Vegas Phoenix Albuquerque Denver San Antonio Dallas Houston Kansas City Minneapolis Saint Louis Chicago Nashville Indianapolis Atlanta Detroit Jacksonville Charlotte Miami Pittsburgh Toronto Philadelphia New York Montreal Boston Beersheba Tel Aviv District Eilat Haifa Nahariyya Jerusalem
    2012-10-01 12:00:00 8
    2012-10-01 13:00:00 0 0 2 0 0 0 0 2 4 4 0 3 1 0 3 4 0 4 4 3 0 3 4 3 0 3 4 7 4 3 1 0 8 2 2 2
    2012-10-01 14:00:00 0 0 2 0 0 0 0 2 4 4 0 3 1 0 3 4 0 4 4 3 0 3 4 3 0 3 4 7 4 3 3 0 8 2 2 2
    2012-10-01 15:00:00 0 0 2 0 0 0 0 2 4 3 0 3 1 0 3 4 0 4 4 3 0 3 4 3 0 3 3 7 4 3 3 0 8 2 2 2
    2012-10-01 16:00:00 0 0 2 0 0 0 0 2 4 3 0 3 1 0 3 3 0 4 4 3 0 3 4 3 0 3 3 7 4 3 3 0 8 2 2 2
    2012-10-01 17:00:00 0 0 2 0 0 0 0 2 4 3 0 3 1 0 3 3 0 4 4 3 0 3 4 3 0 3 3 6 3 3 3 0 8 2 2 2
    2012-10-01 18:00:00 0 0 2 0 0 0 0 2 4 3 0 3 2 0 3 3 0 4 4 3 0 3 4 3 0 3 3 6 3 3 3 0 8 2 2 2
    2012-10-01 19:00:00 0 0 2 0 0 0 0 2 4 3 0 3 2 0 3 3 0 4 4 3 0 3 4 4 0 3 3 6 3 3 2 1 8 2 2 2
    2012-10-01 20:00:00 0 0 1 0 0 0 0 1 4 3 0 3 2 0 3 3 0 4 4 3 0 3 4 4 0 3 3 6 3 3 2 1 8 2 2 2
    2012-10-01 21:00:00 0 0 1 0 0 0 0 1 4 3 0 3 2 0 3 3 0 4 4 3 0 3 4 4 0 3 3 6 3 3 2 1 8 2 2 2

    Let's figure our if Chicago really is the windy city, shall we?

    SELECT
    	AVG(Chicago),
    	AVG(`San Francisco`),
    	AVG(`Los Angeles`),
    	AVG (Seattle),
    	AVG(`New York`),
    	AVG(`Boston`),
    	AVG(Vancouver),
    	AVG(Miami)
    FROM
    	wind_speed;
    

    ...Aaand the results!:

    AVG(Chicago) AVG(`San Francisco`) AVG(`Los Angeles`) AVG (Seattle) AVG(`New York`) AVG(`Boston`) AVG(Vancouver) AVG(Miami)
    3.7593 2.7867 1.2195 2.1181 3.2110 3.3809 2.4327 3.2365

    Wow, so it looks like (at first glance), Chicago really is the windiest city! I'm... not sure if I was expecting that, for some reason. Let's see the range of wind speeds in Chicago as well:

    SELECT
    	AVG(Chicago),
    	MIN(Chicago),
    	MAX(Chicago)
    FROM
    	wind_speed;
    
    AVG(Chicago) MIN(Chicago) MAX(Chicago)
    3.7593 0 25

    Then lowest wind speed in our dataset for Chicago appears to be 0 (not shocking). On the flip side, the highest wind speed we have recorded for Chicago was 25mph! Wow! Isn't that... dangerous?

    Get Creative

    Aggregate functions aren't just about counting values or finding averages. Especially in Data Science, these functions are critical to drawing any statistical conclusions from data. That said, attention spans only last so long, and I'm not a scientist. Perhaps that can be your job.

    Todd Birchard's' avatar
    New York City Website
    Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

    Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.