Recasting Low-Cardinality Columns as Categoricals

Downcast strings in Pandas to their proper data-types using HDF5.

Recasting Low-Cardinality Columns as Categoricals

    The other day, I was grabbing a way-too-big DB query for local exploration.  It was literally over 2GB as a CSV - which is a pain for a number of reasons!  Not the least of which being that, while you're doing Exploratory Data Analysis, things take way too long - it doesn't take long for Cognitive Drift to break your rhythm!

    Numerical columns can be taken down to size with the downcasting functions from a previous post.  But what about Object/String columns?  One of the best ways to reduce the size of a file like this is to recast certain columns to the Category type.  Rather than allocating the full amount of memory needed for a string, we can represent it internally with a small number.  The savings can be huge - the 2GB file went down to about 660MB!

    The simple way to do it is to find columns where the number of unique values is less than half (or else we might not actually be saving memory).  Then, we recast those columns to the Categorical type, same way we do with Ints or whatever.  And, of course, we'd like to do this in one easy step instead of actually looking at this by hand - if we've got over a hundred columns and don't know which ones to drop (that's the point of doing Exploratory Data Analysis, after all), we don't want to inspect each of these manually!

    Here's a test DataFrame:

    import random
    
    df = pd.DataFrame(
        {"A": [random.choice(["animal", "vegetable", "mineral"]) for _ in range(10)],
         "B": list(range(10)),
         "C": [random.choice(["alice", "bob", "charlie"]) for _ in range(10)]}
    )

    Here's our old multi_assign function.

    def multi_assign(df, transform_fn, condition):
        df_to_use = df.copy()
        
        return (df_to_use
            .assign(
                **{col: transform_fn(df_to_use[col])
                   for col in condition(df_to_use)})
               )

    And here's a function that grabs all our low-cardinality columns, and gives us our DF back with them turned into Category columns.

    def all_low_cardinality_to_categorical(df):
        df_to_use = df.copy()
        
        transform_fn = lambda x: x.astype("category")
        
        condition = lambda x:  (x.select_dtypes(include=['object']).nunique() > 0.5).index
        
        return multi_assign(df_to_use, transform_fn, condition)
        
    all_low_cardinality_to_categorical(df).dtypes
    A    category
    B       int64
    C    category
    dtype: object
    

    Nice and easy!  And we can add it to a chain!

    cleaned = (
        df.pipe(all_low_cardinality_to_categorical)
        .pipe(all_float_to_int)
        .pipe(downcast_all, "float")
        .pipe(downcast_all, "integer")
        .pipe(downcast_all, target_type="unsigned", inital_type="integer")
    )

    And ooone last step.  This is great for Pandas, but what about when we have to save the file?  It'd be nice if we didn't have that multi-GB file (which we might make multiple versions of) hogging our memory.  We can't do that with a regular CSV - that only lets us save as plain text.  Let's take a step into the wider world of binary data formats!

    CSVs and JSON save things as plain text.  That text is structured in a certain way that makes it easy to parse.  Binary formats are more like a picture file - which, among other benefits, lets us encode things in a space-saving way.  The format we'll be using is HDF5, because it was the first one I saw that looked easy to use and would let us save things as Categoricals.  There are others - Wes McKinney, creator of Pandas itself, is working on one called Arrow.

    HDF5 doesn't have anything to do with Hadoop, by the way.  The name comes from "Hierarchical Data Format".

    Pandas makes it pretty easy, thanks to a package called PyTables that's operating under the hood.  

    cleaned.to_hdf("data.h5",
                   "df",
                    format="table")

    So, it's almost exactly the same as saving a regular CSV, with a few small changes.  For one thing, the .h5 file extension.  We also need to specify a "key" argument, which I'm sure would make sense if I knew more about all the neat things you can do with HDF5 files.  In order to save our Categoricals, we also need to specify that format should be "table" (as opposed to the default "fixed").

    This will actually be saved and loaded a LOT faster than CSVs!  Another benefit!

    Matthew Alhonte's' avatar
    Center of the Universe
    Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
    Matthew Alhonte's' avatar
    Center of the Universe @MattAlhonte

    Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.