Splitting Columns With Pandas

The Jupyter notebook for this post can be found here:

Tidy Data principles tell us that we need to have one "entry" per row and one attribute per column. Sometimes, though, you have a column that's got more than one attribute in it. Sometimes it's a few different attributes - sometimes it's a various different instances of the same attribute. This will not do - at least, not if those columns mean anything of value.

Generally, this will be a string, so let's assume that it is. R (or at least the Tidyverse) actually has a built-in DataFrame operation for this, but for Pandas we'll have to approximate. For an example of multiple attributes in the same column, let's look at this dataset of municipal employees in Houston, Texas.

	title	dept	salary	race	gender	hire_date
0	POLICE OFFICER	Houston Police Department-HPD	45279.0	White	Male	2015-02-03
1	ENGINEER/OPERATOR	Houston Fire Department (HFD)	63166.0	White	Male	1982-02-08
2	SENIOR POLICE OFFICER	Houston Police Department-HPD	66614.0	Black	Male	1984-11-26
3	ENGINEER	Public Works & Engineering-PWE	71680.0	Asian	Male	2012-03-26
4	CARPENTER	Houston Airport System (HAS)	42390.0	White	Male	2013-11-04

The full name of the department and the department's acronym is currently in the same column, which could become a problem. For instance, we might have an output from a database table with department info that we'd like to merge it with, and it just uses the acronym as its foreign key. And let's also say that we're ingesting this as part of a long function chain and don't want to break our stride - so let's put that all in one convenient function.

Split

def assign_split_col(df: pd.DataFrame, col: str, name_list: List[str], pat: str=None):
    df = df.copy()
    split_col = df[col].str.split(pat, expand=True)

    return df.assign(
        **dict(
            zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
        )
    )

	title	dept	salary	race	gender	hire_date	Dept Name	Dept Acronym
0	POLICE OFFICER	Houston Police Department-HPD	45279.0	White	Male	2015-02-03	Houston Police Department	HPD
1	ENGINEER/OPERATOR	Houston Fire Department (HFD)	63166.0	White	Male	1982-02-08	Houston Fire Department	HFD)
2	SENIOR POLICE OFFICER	Houston Police Department-HPD	66614.0	Black	Male	1984-11-26	Houston Police Department	HPD
3	ENGINEER	Public Works & Engineering-PWE	71680.0	Asian	Male	2012-03-26	Public Works & Engineering	PWE
4	CARPENTER	Houston Airport System (HAS)	42390.0	White	Male	2013-11-04	Houston Airport System	HAS)

So, what did we do here?

df = df.copy()

First, we make a copy of our DataFrame. This can slow things down, but for our purposes, it's better to have a "Pure Function" that won't change the underlying data.

split_col = df[col].str.split(pat, expand=True)

Next, we call the str method of the column in question (more on these here), which lets us directly access a vectorized version of string methods on a string column. In particular, it uses the split method (docs here), which has one additional element in Pandas. Normally, split takes a string and returns a list. Lists aren't a native Pandas datatype, and in our case don't really help us. But, when you set the expand tag to True, instead of a list it gives you a new DataFrame with each column being one element of the split list.

return df.assign(
        **dict(
            zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
        )
    )

The final line returns a new DataFrame that's the same as our original, except where we've assigned each of the new string columns to a new column in the new DataFrame. It uses similar syntax to the multi-assigner from this previous post. You can unpack a dictionary (with the ** syntax) into the assign method, and it will give you a DataFrame with the new columns you wanted.

Hrm, looks like this isn't actually as clean as we wanted! Some of the abbreviations end with a close-parentheses ) and some do not. We could add an extra step and clean those up, or we could look at an alternate way of doing this!

In our original version, we told it to split on either a - or ( character. That'd work for something where we want the whole thing. In this case, though, we want to do a little cleaning as we extract - and know enough about the format that we can specify what we want to take. So, let's make a new function that instead uses the str.extract method.

Extract

def assign_regex_col(df: pd.DataFrame, col: str, name_list: List[str], pat: str=None):
    df = df.copy()
    split_col = df[col].str.extract(pat, expand=True)

    return df.assign(
        **dict(
            zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
        )
    )

Let's set up some capture groups. We only want two things - a bunch of characters that are not - or (, and then a continuous bunch of uppercase letters. We can leave the separator behind.

emp.pipe(
    assign_regex_col,
    col="dept",
    name_list=["Dept Name", "Dept Acronym"],
    pat=r"([\w\s]*)[-(]([A-Z]*)",
).head()

	title	dept	salary	race	gender	hire_date	Dept Name	Dept Acronym
0	POLICE OFFICER	Houston Police Department-HPD	45279.0	White	Male	2015-02-03	Houston Police Department	HPD
1	ENGINEER/OPERATOR	Houston Fire Department (HFD)	63166.0	White	Male	1982-02-08	Houston Fire Department	HFD
2	SENIOR POLICE OFFICER	Houston Police Department-HPD	66614.0	Black	Male	1984-11-26	Houston Police Department	HPD
3	ENGINEER	Public Works & Engineering-PWE	71680.0	Asian	Male	2012-03-26	Public Works & Engineering	PWE
4	CARPENTER	Houston Airport System (HAS)	42390.0	White	Male	2013-11-04	Houston Airport System	HAS

Bam! Perfectly captured.

List-like Data

Let's also look at an example where there's a more list-like

The data I had handy that had this available was for work, so let's generate some fake data with the wonderful faker package. Let's make a bunch of addresses, each with a variable number of people attached to them, and the number of items each address ordered.

from faker import Faker
import random

fake = Faker()

res_df = pd.DataFrame(
    {
        "Address": [fake.address() for _ in range(10)],
        "Residents": [
            ",".join(fake.name() for _ in range(random.randint(1, 3)))
            for _ in range(10)
        ],
        "N_Items_Ordered": [random.randint(1, 15) for _ in range(10)]
    }
)

	Address	Residents	N_Items_Ordered
0	544 Hill Harbor Suite 928\nWest Angel, NV 51251	Jessica Kelly,Christina Morales,Melissa Jennings	3
1	748 Phillip Hollow\nPort Timothyview, AL 93192	Charles Roberts,Patricia Barber	7
2	1884 Christopher Cove Suite 471\nWilliamstad, NC 72167	Emma Ware,Veronica Watson,Ms. Jennifer Lang	9
3	324 Johnson Bridge Apt. 348\nCookmouth, VA 73966	Bobby Madden,Colleen Wallace,Erica Savage	13
4	948 Newton Fields Apt. 455\nGallegosberg, WV 04569	Brandon Jones,Jared Cole,Ryan Rivera	13

Now let's make a function for painlessly splitting these.

def split_list_like(df: pd.DataFrame, col: str, new_col_prefix: str, pat: str = None):
    df = df.copy()
    split_col = df[col].str.split(pat, expand=True)

    return df.assign(
        **{
            f"{new_col_prefix}_{x}": split_col.iloc[:, x]
            for x in range(split_col.shape[1])
        }
    )

Notice that the last line is a little bit different. For the other two, we had a fixed number of output columns, and so it made sense to zip the split column with a list of column names, and then make a dictionary with that. We could do that here, but it would involve an extra step of seeing how long the longest list is - and we'd like this to work as painlessly as possible! Plus, it'll be resilient to new data - what if we get a new output that has an entry with an even longer list? Better to supply it with a prefix that can be used on an arbitrary number of columns, rather than a premade list.

split = res_df.pipe(
    split_list_like,
    col="Residents",
    new_col_prefix="Resident",
    pat=",",
)

	Address	Residents	N_Items_Ordered	Resident_0	Resident_1	Resident_2
0	544 Hill Harbor Suite 928\nWest Angel, NV 51251	Jessica Kelly,Christina Morales,Melissa Jennings	3	Jessica Kelly	Christina Morales	Melissa Jennings
1	748 Phillip Hollow\nPort Timothyview, AL 93192	Charles Roberts,Patricia Barber	7	Charles Roberts	Patricia Barber	None
2	1884 Christopher Cove Suite 471\nWilliamstad, NC 72167	Emma Ware,Veronica Watson,Ms. Jennifer Lang	9	Emma Ware	Veronica Watson	Ms. Jennifer Lang
3	324 Johnson Bridge Apt. 348\nCookmouth, VA 73966	Bobby Madden,Colleen Wallace,Erica Savage	13	Bobby Madden	Colleen Wallace	Erica Savage
4	948 Newton Fields Apt. 455\nGallegosberg, WV 04569	Brandon Jones,Jared Cole,Ryan Rivera	13	Brandon Jones	Jared Cole	Ryan Rivera

Now that we've split them into different columns, we can do fun stuff like we can now do fun things like index it according to individuals instead of households with Pandas' melt (docs here), or its more-automated cousin named wide_to_long (docs here). This could be useful if you wanted to join the data to an output that had the name as a foreign key.

wide_to_long is better for this instance because it was made for when you've got multiple columns representing different instances of the same thing.

long = pd.wide_to_long(split, stubnames='Resident', i="Address", j='nth_res',
                     sep='_')

		Residents	N_Items_Ordered	Resident
Address	nth_res
544 Hill Harbor Suite 928\nWest Angel, NV 51251	0	Jessica Kelly,Christina Morales,Melissa Jennings	3	Jessica Kelly
748 Phillip Hollow\nPort Timothyview, AL 93192	0	Charles Roberts,Patricia Barber	7	Charles Roberts
1884 Christopher Cove Suite 471\nWilliamstad, NC 72167	0	Emma Ware,Veronica Watson,Ms. Jennifer Lang	9	Emma Ware
324 Johnson Bridge Apt. 348\nCookmouth, VA 73966	0	Bobby Madden,Colleen Wallace,Erica Savage	13	Bobby Madden
948 Newton Fields Apt. 455\nGallegosberg, WV 04569	0	Brandon Jones,Jared Cole,Ryan Rivera	13	Brandon Jones

Wrap-up

One of the main advantages of cleaning & manipulating data in code as opposed to a GUI is that you can get much, much more precise control - which lets you deal with whatever weird format your data came to you in. The downside is that you can lose some fluency as you go from a visual interface to code. I think the sweet spot is to build up a little collection of functions that essentially make a new "interface" for dealing with your dataset. Make functions that let you focus on substance instead of having to fuss with indices.

Split

Extract

List-like Data

Wrap-up

Related Posts