Recasting Low-Cardinality Columns as Categoricals

The other day, I was grabbing a way-too-big DB query for local exploration. It was literally over 2GB as a CSV - which is a pain for a number of reasons! Not the least of which being that, while you're doing Exploratory Data Analysis, things take way too long - it doesn't take long for Cognitive Drift to break your rhythm!

Numerical columns can be taken down to size with the downcasting functions from a previous post. But what about Object/String columns? One of the best ways to reduce the size of a file like this is to recast certain columns to the Category type. Rather than allocating the full amount of memory needed for a string, we can represent it internally with a small number. The savings can be huge - the 2GB file went down to about 660MB!

The simple way to do it is to find columns where the number of unique values is less than half (or else we might not actually be saving memory). Then, we recast those columns to the Categorical type, the same way we do with Ints or whatever. And, of course, we'd like to do this in one easy step instead of actually looking at this by hand - if we've got over a hundred columns and don't know which ones to drop (that's the point of doing Exploratory Data Analysis, after all), we don't want to inspect each of these manually!

Here's a test DataFrame:

import random


df = pd.DataFrame(
    {"A": [random.choice(["animal", "vegetable", "mineral"]) for _ in range(10)],
     "B": list(range(10)),
     "C": [random.choice(["alice", "bob", "charlie"]) for _ in range(10)]}
)

Here's our old multi_assign function.

from typing import Callable, List


def multi_assign(
    df: pd.DataFrame,
    transform_fn: Callable[[pd.Series], pd.Series],
    condition: List[str],
) -> pd.DataFrame:
    df_to_use = df.copy()

    return df_to_use.assign(
        **{col: transform_fn(df_to_use[col]) for col in condition(df_to_use)}
    )

Defining a "multi-assign" function

And here's a function that grabs all our low-cardinality columns, and gives us our DF back with them turned into Category columns.

def all_low_cardinality_to_categorical(df: pd.DataFrame) -> pd.DataFrame:
    df_to_use = df.copy()
    transform_fn = lambda x: x.astype("category")
    n_entries = df_to_use.shape[0]
    objects = df_to_use.select_dtypes(include=["object"]).nunique()
    condition = lambda x: (
        x.select_dtypes(include=["object"]).nunique()[
            lambda x: x.div(n_entries).lt(0.5)
        ]
    ).index
    return multi_assign(df_to_use, transform_fn, condition)
    
all_low_cardinality_to_categorical(df).dtypes

Grab low-cardinality columns and return Category columns

A    category
B       int64
C    category
dtype: object

Output

Nice and easy! And we can add it to a chain!

cleaned = (
    df.pipe(all_low_cardinality_to_categorical)
    .pipe(all_float_to_int)
    .pipe(downcast_all, "float")
    .pipe(downcast_all, "integer")
    .pipe(downcast_all, target_type="unsigned", inital_type="integer")
)

Chaining together previous steps

And ooone last step. This is great for Pandas, but what about when we have to save the file? It'd be nice if we didn't have that multi-GB file (which we might make multiple versions of) hogging our memory. We can't do that with a regular CSV - that only lets us save as plain text. Let's take a step into the wider world of binary data formats!

CSVs and JSON save things as plain text. That text is structured in a certain way that makes it easy to parse. Binary formats are more like a picture file - which, among other benefits, lets us encode things in a space-saving way. The format we'll be using is HDF5, because it was the first one I saw that looked easy to use and would let us save things as Categoricals. There are others - Wes McKinney, creator of Pandas itself, is working on one called Arrow.

HDF5 doesn't have anything to do with Hadoop, by the way. The name comes from "Hierarchical Data Format".

Pandas makes it pretty easy, thanks to a package called PyTables that's operating under the hood.

cleaned.to_hdf(
    "data.h5",
    "df",
    format="table"
)

Output to HDF5 table

So, it's almost exactly the same as saving a regular CSV, with a few small changes. For one thing, the .h5 file extension. We also need to specify a "key" argument, which I'm sure would make sense if I knew more about all the neat things you can do with HDF5 files. In order to save our Categoricals, we also need to specify that format should be "table" (as opposed to the default "fixed").

This will actually be saved and loaded a LOT faster than CSVs! Another benefit!

Related Posts