Picking Low-Hanging Fruit With Dask

Ah, laziness. You love it, I love it, everyone agrees it's better.

Flesh-and-blood is famously lazy. Pandas the package, however, uses Eager Evaluation. What's Eager Evaluation, you ask? Is Pandas really judgey, hanging out on the street corner and being fierce to the style choices of people walking by? Well, yes, but that's not the most relevant sense in which I mean it here.

Eager evaluation means that once you call pd.read_csv(), Pandas immediately jumps to read the whole CSV into memory.

"Wait!" I hear you ask. "Isn't that what we want? Why would I call the function if I didn't want it to happen?"

Eventually, yes that is what you want. But sometimes you want time between when you give the command and when the computer hops to it. That's why it's Lazy and not Inactive - it will get to the job at some point; it'll just procrastinate a bit.

For example, last week I was tasked with searching through the output of a government API. It had records since the 90s, and was about 300MB. Now, this isn't actually outside the realm of what Pandas can handle - it's quite optimized, and as long as the file can fit into memory, Pandas can mess with. However, it won't necessarily be fast. Furthermore, my laptop is old, and I didn't feel like offloading what I was doing to a remote machine.

Furthermore, I knew I only needed a subset of the file. Here's where Laziness comes in handy. With Eager evaluation, Pandas would have to load the whole thing into memory, then filter based on my criteria.

Enter Dask: Dask is a cool little library that seamlessly allows you to parallelize Pandas. Pandas by itself is pretty well-optimized, but it's designed to only work on one core. Dask, on the other hand, lets you split the work between different cores - both on a single machine or on a distributed system. It doesn't implement every single thing you can do with Pandas, though, so only use it when you have to.

I probably should have titled this post "Parallel Pandas", but hey, too late now - plus "Lazy Pandas" more easily lends itself to a nice visual metaphor. Anyway, Laziness is part of the process. Because Dask is lazy by default (much like your humble narrator), we can define our fileout loading it, like so:

import dask.dataframe as dd

df = dd.read_csv("giantThing.csv")

Create a Dask DataFrame from a CSV

Pandas was taking a long time to parse the file. What's more, this file had a few quirks - I'd figured out that it needed a special text encoding, and I wasn't sure if other weirdness was happening. Was my computer taking a long time to nom the file, or was I going to wait for a few minutes to find an error message? It's a catch-22 - I needed to figure out how to cut the file down to size, but to do that I would have needed to be able to load it into memory. Dask to the rescue!

This file wasn't terribly well-documented - I had an inkling as to what column would tell me which rows I wanted, but I wasn't sure. So, first thing I did was check out the first few rows. Remember, in order to see these first 5 rows in Pandas, I'd have to load the whole thing into memory (which might or might not even work!).

df.head()

With that, I was able to do a simple spot-check and see if there were any weird columns that might get in the way of parsing. Furthermore, I confirmed that the ID columns I was looking at contained something vaguely like what I was looking for. Even MORE interestingly, I found that it was formatted slightly irregularly. Even more use for laziness! Let's load just that one column into memory (you could do this with a loop, sure - but selecting a single column is a lot clumsier)

df["ORG_NAME"].compute()

Note the .compute() method at the end. That's necessary because of the Lazy Evaluation - just calling a column name doesn't make Dask think you necessarily want the thing now. I'm not sure why I didn't have to call it with df.head(), though (that's the Hackers & Slackers Codeblogging vérité style!).

So, now that I've seen the formatting, I found out that I'm going to have to filter it with a call of a str.contains() method instead of an exact value. Let's poke around a teensy bit more.

orgDF = df["ORG_NAME"]

orgFiltered = corp[
    corp.str.contains("baseName", na=False)
].compute().shape

Turns out it was only about 800 rows!

So, let's filter that and make a regular Pandas Dataframe (and probably a new CSV for later!)

df = dd.read_csv("giantThing.csv")

orgFiltered = df[df["ORG_NAME"].str.contains(
    "baseName",
    na=False
)].compute()

df2 = pd.DataFrame(orgFiltered)
df2.to_csv("filteredThing.csv")

Note that I actually could have done this with base Pandas, through the use of the iterator flag. However, I didn't realize that it's only wind up being so few rows. It also would have been slower - and the speed difference makes a huge difference in terms of how fluidly you can explore.

For instance, the na=False flag was something I discovered would be needed because of a quirk in the file - again, this sort of thing becomes a lot easier to diagnose when you can iterate quickly, and you know you're not going to time out from running out of memory.

For comparison's sake, here's the code for filtering on the fly and loading into Pandas:

iter_csv = pd.read_csv(
    "giantThing.csv",
    iterator=True, 
    chunksize=1000
)

df = pd.concat(
    [chunk[chunk["ORG_NAME"].str.contains("baseName", na=False)] for chunk in iter_csv]
)

On my computer, that took a little over 3 minutes. While the Dask code took about a minute.

Hackers and Slackers

Lazy Pandas and Dask

Related Posts