Sometimes you wind up with duplicate column names in your Pandas DataFrame. This isn't necessarily a huge deal if we're just messing with a smallish file in Jupyter. But, if we wanna do something like load it into a database, that'll be a problem. It can also interfere with our other cleaning functions - I ran into this the other day when reducing the size of a giant data file by downcasting it (as per this previous post). The cleaning functions required a 1D input (so, a Series or List) - but calling the name of a duplicate column gave me a 2D DataFrame.
Getting rid of these duplicates, especially when there's more than one pair, can actually be harder than ya might think. If you try renaming them, you'll just rename both. You can't use the
.drop() method to drop columns by location. As far as I know, the only way to do it is to find the locations, then reconstruct a new DataFrame excluding those locations.
The only problem is that we're lazy and in a hurry, and that's error-prone and boring. Soo... here's a cleaning function to
.pipe() into our process (plus some little helper functions).
Here's our test DataFrame:
First, we grab which columns have duplicates.
Here's a function that, given a DataFrame and the name of a duplicate column, gives us the indices of all instances after the first one.
Note that I'm using Type Annotations. They're a pretty recent addition to Python (good guide) and I'm finding them very useful. Especially if you make lots of little functions, it can be hard to remember what exactly they all do when you're coming back to code after a while (or if someone else has to use them). I like long, descriptive names - but that doesn't always get the point across. Combine those with a short description of the inputs and outputs, though, and it can be really helpful for remembering how all the teeth of the gears fit together. And since the types are optional, we still get the best of both worlds - we get the convenience of Python duck typing while we're hacking around, but when we're packaging up code (like, say, for a blog post :P ), we can add these and hopefully make them easier to follow.
So, for instance, this takes a DataFrame and a string.
df almost always refers to a Pandas DataFrame, but
col could refer just as easily to a string or a Pandas Series (or a List). And it outputs a list of integers.
And here's one that leverages the two previous functions to get the indices for all the duplicate columns in a DataFrame.
And this puts it all together. Note how, since the goal is to create a function that we can toss into a big method chain we're slowly trying to get to a point where we can just input a DataFrame and get a DataFrame back (or as close to that as possible).
And here we do the really tedious, error-prone part. Personally, I find packaging a steps like this into their its own function to be immensely stress-relieving.
And now we can get on with what we were doing before! This let me use my sequence of downcasting pipes the other day, letting me turn a ~660MB DataFrame down to ~300MB.