Lynx Roundup, September 5th

Git tips! Python tail call recursion! Nested window functions in SQL!

Resident Scientist Snkia works tirelessly towards robot utopia. These are his findings.

https://medium.freecodecamp.org/follow-these-simple-rules-and-youll-become-a-git-and-github-master-e1045057468f

Currying in calculus, PDEs, programming, and categories

In an effort to keep up my grueling regime of publishing at least one blog post per year, I’ve decided to write today about a neat little problem I ran across. The problem is; how do you calculate the cumulative sum of a time series if there are transactions that reset the count to 0?

Naturally, there are ways to engineer around this problem without complex queries doing the heavy lifting (recording the current balance along side your transaction history, for example). However, if you occasionally find yourself being asked for a number crunch, half with the expectation of your response being:

leeroy

…it’s not a bad idea to keep this query in your back pocket.

A Convoluted Example

Alice decides to start a bank. Her engineers are clever and they decide that they can save space in their transaction log by just reporting changes in the balance. They keep the current balance in another table and decide if Grace the auditor comes around, they’ll just sum up all of the transactions in their log. Good job guys.

dilbert_and_wally

Bob is an exceptional client with good credit history. Grace takes a quick glance at his balance history and decides everything looks fine. As she leafs through the other records and pauses at Sybil’s balance history. It’s massive. The problem is that Sybil’s been through a number of bankruptcies that should have set her balance to 0. She breaks the news to Alice. You can imagine that Alice was not too happy.

alice

To fix the problem, the engineers decide to introduce a new transaction that resets the balance to 0.

The Data Set

In order to satisfy Grace, we’ve got to come up with a balance history for both Bob and Sybil using just transactions. Here’s our data set in Postgres SQL. We’ll use text and integer columns along with small numbers to simplify the problem.

CREATE TABLE transactions (
  id integer
  ,account text
  ,type text
  ,amount integer
);

INSERT INTO transactions
VALUES
  (1, 'bob', 'deposit', 5)
  ,(2, 'bob', 'withdrawal', -2)
  ,(3, 'bob', 'deposit', 3)
  ,(4, 'sybil', 'deposit', 5)
  ,(5, 'sybil', 'withdrawal', -2)
  ,(6, 'sybil', 'bankrupt', 0)
  ,(7, 'sybil', 'deposit' , 3)
  ,(8, 'sybil', 'deposit', 7)
  ,(9, 'sybil', 'bankrupt', 0)
  ,(10, 'sybil', 'deposit' , 5)
;

Simple Cumulative Sums

Our first job is to calculate Bob’s balance history. This section is a review of basic window functions. If you’re used to using them to create cumulative sums, feel free to skip ahead. Here’s what we’re trying to produce:

 id | account |    type    | amount | sum
----+---------+------------+--------+-----
  1 | bob     | deposit    |      5 |   5
  2 | bob     | withdrawal |     -2 |   3
  3 | bob     | deposit    |      3 |   6

A window function has two parts; the partition and the order. The partition defines which columns are going to be grouped together in the window. The order defines how to sort the rows in the window.

In the example above, we want to group transactions by their account. We then want to order them by their ID. Now, when we use the Sum function, it will sum from the first transaction in the partition, to the current row:

SELECT
  id
  ,account
  ,type
  ,amount
  ,sum(amount) OVER account
FROM transactions
WHERE account = 'bob'
WINDOW account AS (
  PARTITION BY account
  ORDER BY id
);

However, if we use the same query for Sybil, we run into our next problem.

 id | account |    type    | amount | sum
----+---------+------------+--------+-----
  4 | sybil   | deposit    |      5 |   5
  5 | sybil   | withdrawal |     -2 |   3
  6 | sybil   | bankrupt   |      0 |   3
  7 | sybil   | deposit    |      3 |   6
  8 | sybil   | deposit    |      7 |  13
  9 | sybil   | bankrupt   |      0 |  13
 10 | sybil   | deposit    |      5 |  18

Cumulative Sums With Resets

The trick to calculating Sybil’s balance is using a nested set of window functions to make the query sensitive to the transaction type. Here’s what we’re trying to achieve:

 id | account |    type    | amount | sum
----+---------+------------+--------+-----
  4 | sybil   | deposit    |      5 |   5
  5 | sybil   | withdrawal |     -2 |   3
  6 | sybil   | bankrupt   |      0 |   0
  7 | sybil   | deposit    |      3 |   3
  8 | sybil   | deposit    |      7 |  10
  9 | sybil   | bankrupt   |      0 |   0
 10 | sybil   | deposit    |      5 |   5

One possible solution is to try and subtract the cumulative sum using a CASE statement:

SELECT
  id
  ,account
  ,type
  ,amount
  ,CASE
    WHEN type = 'bankrupt' THEN
      -1*sum(amount) OVER account
    ELSE
      sum(amount) OVER account
  END AS sum
FROM transactions
WHERE account = 'sybil'
WINDOW account AS (
  PARTITION BY account
  ORDER BY id
);

The trouble with this approach is that the window function does not use the output of the last row to calculate the next row. The only thing we’ve done is to alter the values of the bankrupt transactions:

 id | account |    type    | amount | sum
----+---------+------------+--------+-----
  4 | sybil   | deposit    |      5 |   5
  5 | sybil   | withdrawal |     -2 |   3
  6 | sybil   | bankrupt   |      0 |  -3
  7 | sybil   | deposit    |      3 |   6
  8 | sybil   | deposit    |      7 |  13
  9 | sybil   | bankrupt   |      0 | -13
 10 | sybil   | deposit    |      5 |  18

Nesting window functions solves this problem. We’ll build up the query by first identifying the transactions that reset the count:

SELECT
  id
  ,account
  ,type
  ,amount
  ,(type = 'bankrupt')::int AS do_reset
FROM transactions
WHERE account = 'sybil'
;

Running that query we get:

 id | account |    type    | amount | do_reset
----+---------+------------+--------+----------
  4 | sybil   | deposit    |      5 |        0
  5 | sybil   | withdrawal |     -2 |        0
  6 | sybil   | bankrupt   |      0 |        1
  7 | sybil   | deposit    |      3 |        0
  8 | sybil   | deposit    |      7 |        0
  9 | sybil   | bankrupt   |      0 |        1
 10 | sybil   | deposit    |      5 |        0

Casting boolean values to integers allows us to then perform a cumulative sum over do_reset. This sum will allow us to group transactions chronologically, by their association to a resetting transaction.

SELECT
  id
  ,account
  ,type
  ,amount
  ,sum((type = 'bankrupt')::int) OVER account AS reset_id
FROM transactions
WHERE account = 'sybil'
WINDOW account AS (
  PARTITION BY account
  ORDER BY id
);

Running that query we get:

 id | account |    type    | amount | reset_id
----+---------+------------+--------+----------
  4 | sybil   | deposit    |      5 |        0
  5 | sybil   | withdrawal |     -2 |        0
  6 | sybil   | bankrupt   |      0 |        1
  7 | sybil   | deposit    |      3 |        1
  8 | sybil   | deposit    |      7 |        1
  9 | sybil   | bankrupt   |      0 |        2
 10 | sybil   | deposit    |      5 |        2

Finally, we can nest this query inside another window function, and perform a cumulative sum on amount, using the reset_id we’ve just created:

SELECT
  id
  ,account
  ,type
  ,amount
  ,sum(amount) OVER resets
FROM (
  SELECT
    id
    ,account
    ,type
    ,amount
    ,sum((type = 'bankrupt')::int) OVER account AS reset_id
  FROM transactions
  WINDOW account AS (
    PARTITION BY account
    ORDER BY id
  )
) t
WINDOW resets AS (
  PARTITION BY
    account
    ,reset_id
  ORDER BY id
);

Running the query we get the values we expect for both Bob and Sybil. Huzzah!

 id | account |    type    | amount | sum
----+---------+------------+--------+-----
  1 | bob     | deposit    |      5 |   5
  2 | bob     | withdrawal |     -2 |   3
  3 | bob     | deposit    |      3 |   6
  4 | sybil   | deposit    |      5 |   5
  5 | sybil   | withdrawal |     -2 |   3
  6 | sybil   | bankrupt   |      0 |   0
  7 | sybil   | deposit    |      3 |   3
  8 | sybil   | deposit    |      7 |  10
  9 | sybil   | bankrupt   |      0 |   0
 10 | sybil   | deposit    |      5 |   5

Conclusion

Window functions are a really good solution to processing time series with chronologically related data. This post scratches the surface of what’s possible.

So, the next time your boss asks you to process some time series data, consider giving window functions a try.

drake

https://github.com/ac1235/python-tailrec

https://github.com/Avik-Jain/100-Days-Of-ML-Code/blob/master/README.md

What A Mathematical Formula Can Teach Us About Coincidence

http://jkk.name/neural-tagger-tutorial/

Author image
Center of the Universe Website
Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
Author image
Center of the Universe

Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.