Removing Duplicate Columns in Pandas

Removing Duplicate Columns in Pandas

Dealing with duplicate column names in your Pandas DataFrame.

    Sometimes you wind up with duplicate column names in your Pandas DataFrame. This isn't necessarily a huge deal if we're just messing with a smallish file in Jupyter.  But, if we wanna do something like load it into a database, that'll be a problem.  It can also interfere with our other cleaning functions - I ran into this the other day when reducing the size of a giant data file by downcasting it (as per this previous post).  The cleaning functions required a 1D input (so, a Series or List) - but calling the name of a duplicate column gave me a 2D DataFrame.

    Getting rid of these duplicates, especially when there's more than one pair, can actually be harder than ya might think.  If you try renaming them, you'll just rename both.  You can't use the .drop() method to drop columns by location.  As far as I know, the only way to do it is to find the locations, then reconstruct a new DataFrame excluding those locations.

    The only problem is that we're lazy and in a hurry, and that's error-prone and boring.  Soo...here's a cleaning function to .pipe() into our process (plus some little helper functions).

    Here's our test DF

    pd.concat(
        [pd.Series(list(range(10)), name=x) for x in ["A", "B", "A", "C", "D", "C", "E"]],
        axis=1,
    )

    First, we grab which columns have duplicates.

    def get_duplicate_cols(df: pd.DataFrame) -> pd.Series:
        return pd.Series(df.columns).value_counts()[lambda x: x>1]
    
    get_duplicate_cols(df)
    A    2
    C    2
    dtype: int64

    Here's a function that, given a DataFrame and the name of a duplicate column, gives us the indices of all instances after the first one.

    from typing import List
    
    def get_dup_col_indices(df: pd.DataFrame, col: str) -> List[int]:
        return [x[0] for x in enumerate(df.columns)
                if x[1]==col][1:]
                
    get_dup_col_indices(df, "A")
    [2]

    Note that I'm using Type Annotations.  They're a pretty recent addition to Python (good guide) and I'm finding them very useful.  Especially if you make lots of little functions, it can be hard to remember what exactly they all do when you're coming back to code after a while (or if someone else has to use them).  I like long, descriptive names - but that doesn't always get the point across.  Combine those with a short description of the inputs and outputs, though, and it can be really helpful for remembering how all the teeth of the gears fit together.  And since the types are optional, we still get the best of both worlds - we get the convenience of Python duck typing while we're hacking around, but when we're packaging up code (like, say, for a blog post :P ), we can add these and hopefully make them easier to follow.

    So, for instance, this takes a DataFrame and a string.  df almost always refers to a Pandas DataFrame, but col could refer just as easily to a string or a Pandas Series (or a List).  And it outputs a list of integers.

    And here's one that leverages the two previous functions to get the indices for all the duplicate columns in a DataFrame.

    import itertools as it
    
    def get_all_dup_col_indices(df: pd.DataFrame) -> List[int]:
        dup_cols = get_duplicate_cols(df).index
        
        return sorted(list(it.chain.from_iterable(get_dup_col_indices(df, x)
                               for x in dup_cols)))
                               
    get_all_dup_col_indices(df)
    [2, 5]

    And this puts it all together.  Note how, since the goal is to create a function that we can toss into a big method chain we're slowly trying to get to a point where we can just input a DataFrame and get a DataFrame back (or as close to that as possible).

    from toolz.itertoolz import sliding_window
    
    def remove_dup_cols(df: pd.DataFrame) -> df: pd.DataFrame:
        indices_to_remove =  get_all_dup_col_indices(df)
        
        if len(indices_to_remove)==0:
            return df
        window = list(sliding_window(2, indices_to_remove))
        
        first = df.iloc[:, :indices_to_remove[0]]
        middle = [df.iloc[:, x[0] + 1 : x[1]] for x in window]
        last = df.iloc[:, indices_to_remove[-1] + 1:]
        
        if (indices_to_remove[-1]) == (df.shape[1]-1):
            return pd.concat([first] + middle, axis=1) 
        else:
            return pd.concat([first] + middle + [last], axis=1)
            
    remove_dup_cols(df)
    A B C D E
    0 0 0 0 0 0
    1 1 1 1 1 1
    2 2 2 2 2 2
    3 3 3 3 3 3
    4 4 4 4 4 4
    5 5 5 5 5 5
    6 6 6 6 6 6
    7 7 7 7 7 7
    8 8 8 8 8 8
    9 9 9 9 9 9

    And here we do the really tedious, error-prone part.  Personally, I find packaging a steps like this into their its own function to be immensely stress-relieving.  

    And now we can get on with what we were doing before!  This let me use my sequence of downcasting pipes the other day, letting me turn a ~660MB dataframe down to ~300MB.

    cleaned = (
        df.pipe(remove_dup_cols)
        .pipe(all_float_to_int)
        .pipe(downcast_all, "float")
        .pipe(downcast_all, "integer")
        .pipe(downcast_all, target_type="unsigned", inital_type="integer")
    )
    Matthew Alhonte's' avatar
    Center of the Universe
    Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
    Matthew Alhonte's' avatar
    Center of the Universe @MattAlhonte

    Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.