Being REALLY Lazy With Multiple Aggregations in Pandas

Pandas 0.25, released over the summer, added an easier way to do multiple aggregations on multiple columns. It used to leave you with a DataFrame that had a multi-index on the top, which is a huge pain to deal with (usually not even a fan of regular multi-indices, tbh).

From the announcement:

animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
                         'height': [9.1, 6.0, 9.5, 34.0],
                         'weight': [7.9, 7.5, 9.9, 198.0]})
                         
animals.groupby("kind").agg(
        min_height=('height', 'min'),
        max_height=('height', 'max'),
        average_weight=('weight', np.mean),
    )

Nice! But, that's a lot of typing the same thing twice! min_height=('height', 'min')? Pfft, type min and height twice in one line? That sounds like it's for someone who's not lazy - ie, certainly not us. It's cool to be able to give the column a custom name, but let's be real here - chances are if I want the mean weight, I probably just wanna name it mean_weight.

Let's make a helper function!

from typing import List, Tuple, Union

def add_multiple_aggs(
    df: pd.DataFrame, groupbys: Union[str, List[str]], aggs: List[Tuple[str, str]]
) -> pd.DataFrame:
    return df.groupby(groupbys).agg(**{"_".join(agg): agg for agg in aggs})

Look at how much lazier we can be now! Now we can just pass a single tuple containing the target column and the desired aggregation, instead of having to do all that extra typing.

animals.pipe(
    add_multiple_aggs,
    "kind",
    [("height", "min"), ("height", "max"), ("weight", "mean")],
)

	height_min	height_max	weight_mean
kind
cat	9.1	9.5	8.90
dog	6.0	34.0	102.75

It also lets us be even lazier if we want to do something like get the same few summary stats from a bunch of different columns. Let's say we want to get the mean and standard deviation of both height and weight. Why type out every combination when the wonderful itertools package gives us a way to easily generate them?

from itertools import product

animals.pipe(
    add_multiple_aggs, "kind", list(product(["height", "weight"], ["mean", "std"]))
)

	height_mean	height_std	weight_mean	weight_std
kind
cat	9.3	0.282843	8.90	1.414214
dog	20.0	19.798990	102.75	134.703842

So, what exactly did we do here?

The agg method to a Pandas DataFrameGroupBy object takes a bunch of keywords. You can't programmatically generate keywords directly, but you CAN programmatically generate a dictionary and unpack with with the ** syntax to magically transform it into keywords.

Last thing that I think warrants some explanation are the type annotations. This sort of function is a really good candidate for them - there have been lots of times where I wrote a declarative little helper function that let you specify what you wanted in a really convenient way, but then came back to it and forgot exactly what the format was. This makes it super explicit. Let's walk through it! Or, at least the non-trivial ones.

groupbys: Union[str, List[str]]

You can group a Pandas DataFrame by a single column, or a list of columns - the syntax is the same either way. To represent the fact that there are two acceptable input types we use the Union type - this says that the groupbys argument to the function can either be a string, or a list of strings. Incidentally (and I didn't realize this until typing out the blog post!), this also helpfully provided some redundancy - I wound up naming the variable groupbys which implies plural. Helpful, then to have the type annotation make it clear that it can also be singular (and doesn't have to be a length-1 list)!

List[Tuple[str, str]]

This is definitely the sort of thing I used to wind up breaking, especially if there was a change in the data format an API returned. Nested data structures like this are good to keep as explicit as possible.

Related Posts