- Recasting Low-Cardinality Columns as Categoricals
- Removing Duplicate Columns in Pandas
- Downcast Numerical Data Types with Pandas
- Using Random Forests for Feature Selection with Categorical Features
- Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf
- Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth
- Tuning Machine Learning Hyperparameters with Binary Search
- Importing Excel Datetimes Into Pandas, Part II
- Importing Excel Datetimes Into Pandas, Part I
- All That Is Solid Melts Into Graphs
- Trash Pandas: Messy, Convenient DB Operations via Pandas
- A Dirty Way of Cleaning Data (ft. Pandas & SQL)
- Getting Conda Envs (And Environment Variables!) To Play Nicely With Cron
- Using Pandas and SQLAlchemy to Simplify Databases
Sometimes you wind up with duplicate column names in your Pandas DataFrame. This isn't necessarily a huge deal if we're just messing with a smallish file in Jupyter. But, if we wanna do something like load it into a database, that'll be a problem. It can also interfere with our other cleaning functions - I ran into this the other day when reducing the size of a giant data file by downcasting it (as per this previous post). The cleaning functions required a 1D input (so, a Series or List) - but calling the name of a duplicate column gave me a 2D DataFrame.
Getting rid of these duplicates, especially when there's more than one pair, can actually be harder than ya might think. If you try renaming them, you'll just rename both. You can't use the
.drop() method to drop columns by location. As far as I know, the only way to do it is to find the locations, then reconstruct a new DataFrame excluding those locations.
The only problem is that we're lazy and in a hurry, and that's error-prone and boring. Soo...here's a cleaning function to
.pipe() into our process (plus some little helper functions).
Here's our test DF
pd.concat( [pd.Series(list(range(10)), name=x) for x in ["A", "B", "A", "C", "D", "C", "E"]], axis=1, )
First, we grab which columns have duplicates.
def get_duplicate_cols(df: pd.DataFrame) -> pd.Series: return pd.Series(df.columns).value_counts()[lambda x: x>1]
get_duplicate_cols(df) A 2 C 2 dtype: int64
Here's a function that, given a DataFrame and the name of a duplicate column, gives us the indices of all instances after the first one.
from typing import List def get_dup_col_indices(df: pd.DataFrame, col: str) -> List[int]: return [x for x in enumerate(df.columns) if x==col][1:] get_dup_col_indices(df, "A") 
Note that I'm using Type Annotations. They're a pretty recent addition to Python (good guide) and I'm finding them very useful. Especially if you make lots of little functions, it can be hard to remember what exactly they all do when you're coming back to code after a while (or if someone else has to use them). I like long, descriptive names - but that doesn't always get the point across. Combine those with a short description of the inputs and outputs, though, and it can be really helpful for remembering how all the teeth of the gears fit together. And since the types are optional, we still get the best of both worlds - we get the convenience of Python duck typing while we're hacking around, but when we're packaging up code (like, say, for a blog post :P ), we can add these and hopefully make them easier to follow.
So, for instance, this takes a DataFrame and a string.
df almost always refers to a Pandas DataFrame, but
col could refer just as easily to a string or a Pandas Series (or a List). And it outputs a list of integers.
And here's one that leverages the two previous functions to get the indices for all the duplicate columns in a DataFrame.
import itertools as it def get_all_dup_col_indices(df: pd.DataFrame) -> List[int]: dup_cols = get_duplicate_cols(df).index return sorted(list(it.chain.from_iterable(get_dup_col_indices(df, x) for x in dup_cols))) get_all_dup_col_indices(df) [2, 5]
And this puts it all together. Note how, since the goal is to create a function that we can toss into a big method chain we're slowly trying to get to a point where we can just input a DataFrame and get a DataFrame back (or as close to that as possible).
from toolz.itertoolz import sliding_window def remove_dup_cols(df: pd.DataFrame) -> df: pd.DataFrame: indices_to_remove = get_all_dup_col_indices(df) if len(indices_to_remove)==0: return df window = list(sliding_window(2, indices_to_remove)) first = df.iloc[:, :indices_to_remove] middle = [df.iloc[:, x + 1 : x] for x in window] last = df.iloc[:, indices_to_remove[-1] + 1:] if (indices_to_remove[-1]) == (df.shape-1): return pd.concat([first] + middle, axis=1) else: return pd.concat([first] + middle + [last], axis=1) remove_dup_cols(df)
And here we do the really tedious, error-prone part. Personally, I find packaging a steps like this into their its own function to be immensely stress-relieving.
And now we can get on with what we were doing before! This let me use my sequence of downcasting pipes the other day, letting me turn a ~660MB dataframe down to ~300MB.
cleaned = ( df.pipe(remove_dup_cols) .pipe(all_float_to_int) .pipe(downcast_all, "float") .pipe(downcast_all, "integer") .pipe(downcast_all, target_type="unsigned", inital_type="integer") )