Lynx Roundup, September 5th

Git tips! Python tail call recursion! Nested window functions in SQL!

Lynx Roundup, September 5th

    Resident Scientist Snkia works tirelessly towards robot utopia. These are his findings.

    https://medium.freecodecamp.org/follow-these-simple-rules-and-youll-become-a-git-and-github-master-e1045057468f

    Currying in calculus, PDEs, programming, and categories

    In an effort to keep up my grueling regime of publishing at least one blog post per year, I’ve decided to write today about a neat little problem I ran across. The problem is; how do you calculate the cumulative sum of a time series if there are transactions that reset the count to 0?

    Naturally, there are ways to engineer around this problem without complex queries doing the heavy lifting (recording the current balance along side your transaction history, for example). However, if you occasionally find yourself being asked for a number crunch, half with the expectation of your response being:

    leeroy

    …it’s not a bad idea to keep this query in your back pocket.

    A Convoluted Example

    Alice decides to start a bank. Her engineers are clever and they decide that they can save space in their transaction log by just reporting changes in the balance. They keep the current balance in another table and decide if Grace the auditor comes around, they’ll just sum up all of the transactions in their log. Good job guys.

    dilbert_and_wally

    Bob is an exceptional client with good credit history. Grace takes a quick glance at his balance history and decides everything looks fine. As she leafs through the other records and pauses at Sybil’s balance history. It’s massive. The problem is that Sybil’s been through a number of bankruptcies that should have set her balance to 0. She breaks the news to Alice. You can imagine that Alice was not too happy.

    alice

    To fix the problem, the engineers decide to introduce a new transaction that resets the balance to 0.

    The Data Set

    In order to satisfy Grace, we’ve got to come up with a balance history for both Bob and Sybil using just transactions. Here’s our data set in Postgres SQL. We’ll use text and integer columns along with small numbers to simplify the problem.

    CREATE TABLE transactions (
      id integer
      ,account text
      ,type text
      ,amount integer
    );
    
    INSERT INTO transactions
    VALUES
      (1, 'bob', 'deposit', 5)
      ,(2, 'bob', 'withdrawal', -2)
      ,(3, 'bob', 'deposit', 3)
      ,(4, 'sybil', 'deposit', 5)
      ,(5, 'sybil', 'withdrawal', -2)
      ,(6, 'sybil', 'bankrupt', 0)
      ,(7, 'sybil', 'deposit' , 3)
      ,(8, 'sybil', 'deposit', 7)
      ,(9, 'sybil', 'bankrupt', 0)
      ,(10, 'sybil', 'deposit' , 5)
    ;
    

    Simple Cumulative Sums

    Our first job is to calculate Bob’s balance history. This section is a review of basic window functions. If you’re used to using them to create cumulative sums, feel free to skip ahead. Here’s what we’re trying to produce:

     id | account |    type    | amount | sum
    ----+---------+------------+--------+-----
      1 | bob     | deposit    |      5 |   5
      2 | bob     | withdrawal |     -2 |   3
      3 | bob     | deposit    |      3 |   6
    

    A window function has two parts; the partition and the order. The partition defines which columns are going to be grouped together in the window. The order defines how to sort the rows in the window.

    In the example above, we want to group transactions by their account. We then want to order them by their ID. Now, when we use the Sum function, it will sum from the first transaction in the partition, to the current row:

    SELECT
      id
      ,account
      ,type
      ,amount
      ,sum(amount) OVER account
    FROM transactions
    WHERE account = 'bob'
    WINDOW account AS (
      PARTITION BY account
      ORDER BY id
    );
    

    However, if we use the same query for Sybil, we run into our next problem.

     id | account |    type    | amount | sum
    ----+---------+------------+--------+-----
      4 | sybil   | deposit    |      5 |   5
      5 | sybil   | withdrawal |     -2 |   3
      6 | sybil   | bankrupt   |      0 |   3
      7 | sybil   | deposit    |      3 |   6
      8 | sybil   | deposit    |      7 |  13
      9 | sybil   | bankrupt   |      0 |  13
     10 | sybil   | deposit    |      5 |  18
    

    Cumulative Sums With Resets

    The trick to calculating Sybil’s balance is using a nested set of window functions to make the query sensitive to the transaction type. Here’s what we’re trying to achieve:

     id | account |    type    | amount | sum
    ----+---------+------------+--------+-----
      4 | sybil   | deposit    |      5 |   5
      5 | sybil   | withdrawal |     -2 |   3
      6 | sybil   | bankrupt   |      0 |   0
      7 | sybil   | deposit    |      3 |   3
      8 | sybil   | deposit    |      7 |  10
      9 | sybil   | bankrupt   |      0 |   0
     10 | sybil   | deposit    |      5 |   5
    

    One possible solution is to try and subtract the cumulative sum using a CASE statement:

    SELECT
      id
      ,account
      ,type
      ,amount
      ,CASE
        WHEN type = 'bankrupt' THEN
          -1*sum(amount) OVER account
        ELSE
          sum(amount) OVER account
      END AS sum
    FROM transactions
    WHERE account = 'sybil'
    WINDOW account AS (
      PARTITION BY account
      ORDER BY id
    );
    

    The trouble with this approach is that the window function does not use the output of the last row to calculate the next row. The only thing we’ve done is to alter the values of the bankrupt transactions:

     id | account |    type    | amount | sum
    ----+---------+------------+--------+-----
      4 | sybil   | deposit    |      5 |   5
      5 | sybil   | withdrawal |     -2 |   3
      6 | sybil   | bankrupt   |      0 |  -3
      7 | sybil   | deposit    |      3 |   6
      8 | sybil   | deposit    |      7 |  13
      9 | sybil   | bankrupt   |      0 | -13
     10 | sybil   | deposit    |      5 |  18
    

    Nesting window functions solves this problem. We’ll build up the query by first identifying the transactions that reset the count:

    SELECT
      id
      ,account
      ,type
      ,amount
      ,(type = 'bankrupt')::int AS do_reset
    FROM transactions
    WHERE account = 'sybil'
    ;
    

    Running that query we get:

     id | account |    type    | amount | do_reset
    ----+---------+------------+--------+----------
      4 | sybil   | deposit    |      5 |        0
      5 | sybil   | withdrawal |     -2 |        0
      6 | sybil   | bankrupt   |      0 |        1
      7 | sybil   | deposit    |      3 |        0
      8 | sybil   | deposit    |      7 |        0
      9 | sybil   | bankrupt   |      0 |        1
     10 | sybil   | deposit    |      5 |        0
    

    Casting boolean values to integers allows us to then perform a cumulative sum over do_reset. This sum will allow us to group transactions chronologically, by their association to a resetting transaction.

    SELECT
      id
      ,account
      ,type
      ,amount
      ,sum((type = 'bankrupt')::int) OVER account AS reset_id
    FROM transactions
    WHERE account = 'sybil'
    WINDOW account AS (
      PARTITION BY account
      ORDER BY id
    );
    

    Running that query we get:

     id | account |    type    | amount | reset_id
    ----+---------+------------+--------+----------
      4 | sybil   | deposit    |      5 |        0
      5 | sybil   | withdrawal |     -2 |        0
      6 | sybil   | bankrupt   |      0 |        1
      7 | sybil   | deposit    |      3 |        1
      8 | sybil   | deposit    |      7 |        1
      9 | sybil   | bankrupt   |      0 |        2
     10 | sybil   | deposit    |      5 |        2
    

    Finally, we can nest this query inside another window function, and perform a cumulative sum on amount, using the reset_id we’ve just created:

    SELECT
      id
      ,account
      ,type
      ,amount
      ,sum(amount) OVER resets
    FROM (
      SELECT
        id
        ,account
        ,type
        ,amount
        ,sum((type = 'bankrupt')::int) OVER account AS reset_id
      FROM transactions
      WINDOW account AS (
        PARTITION BY account
        ORDER BY id
      )
    ) t
    WINDOW resets AS (
      PARTITION BY
        account
        ,reset_id
      ORDER BY id
    );
    

    Running the query we get the values we expect for both Bob and Sybil. Huzzah!

     id | account |    type    | amount | sum
    ----+---------+------------+--------+-----
      1 | bob     | deposit    |      5 |   5
      2 | bob     | withdrawal |     -2 |   3
      3 | bob     | deposit    |      3 |   6
      4 | sybil   | deposit    |      5 |   5
      5 | sybil   | withdrawal |     -2 |   3
      6 | sybil   | bankrupt   |      0 |   0
      7 | sybil   | deposit    |      3 |   3
      8 | sybil   | deposit    |      7 |  10
      9 | sybil   | bankrupt   |      0 |   0
     10 | sybil   | deposit    |      5 |   5
    

    Conclusion

    Window functions are a really good solution to processing time series with chronologically related data. This post scratches the surface of what’s possible.

    So, the next time your boss asks you to process some time series data, consider giving window functions a try.

    drake

    https://github.com/ac1235/python-tailrec

    https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/README.md

    What A Mathematical Formula Can Teach Us About Coincidence

    http://jkk.name/neural-tagger-tutorial/

    Matthew Alhonte's' avatar
    Center of the Universe
    Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
    Matthew Alhonte's' avatar
    Center of the Universe @MattAlhonte

    Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.