The Jupyter notebook for this post can be found here:

mattalhonte/code_for_blog_posts
Notebooks where I laid out the code for my blogposts - mattalhonte/code_for_blog_posts

Tidy Data principles tell us that we need to have one "entry" per row and one attribute per column.  Sometimes, though, you have a column that's got more than one attribute in it.  Sometimes it's a few different attributes - sometimes it's a various different instances of the same attribute.  This will not do  - at least, not if those columns mean anything of value.

Generally, this will be a string, so let's assume that it is.  R (or at least the Tidyverse) actually has a built-in DataFrame operation for this, but for Pandas we'll have to approximate.  For an example of multiple attributes in the same column, let's look at this dataset of municipal employees in Houston, Texas.  

title dept salary race gender hire_date
0 POLICE OFFICER Houston Police Department-HPD 45279.0 White Male 2015-02-03
1 ENGINEER/OPERATOR Houston Fire Department (HFD) 63166.0 White Male 1982-02-08
2 SENIOR POLICE OFFICER Houston Police Department-HPD 66614.0 Black Male 1984-11-26
3 ENGINEER Public Works & Engineering-PWE 71680.0 Asian Male 2012-03-26
4 CARPENTER Houston Airport System (HAS) 42390.0 White Male 2013-11-04

The full name of the department and the department's acronym is currently in the same column, which could become a problem.  For instance, we might have an output from a database table with department info that we'd like to merge it with, and it just uses the acronym as its foreign key.  And let's also say that we're ingesting this as part of a long function chain and don't want to break our stride - so let's put that all in one convenient function.

Split

def assign_split_col(df: pd.DataFrame, col: str, name_list: List[str], pat: str=None):
    df = df.copy()
    split_col = df[col].str.split(pat, expand=True)

    return df.assign(
        **dict(
            zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
        )
    )
title dept salary race gender hire_date Dept Name Dept Acronym
0 POLICE OFFICER Houston Police Department-HPD 45279.0 White Male 2015-02-03 Houston Police Department HPD
1 ENGINEER/OPERATOR Houston Fire Department (HFD) 63166.0 White Male 1982-02-08 Houston Fire Department HFD)
2 SENIOR POLICE OFFICER Houston Police Department-HPD 66614.0 Black Male 1984-11-26 Houston Police Department HPD
3 ENGINEER Public Works & Engineering-PWE 71680.0 Asian Male 2012-03-26 Public Works & Engineering PWE
4 CARPENTER Houston Airport System (HAS) 42390.0 White Male 2013-11-04 Houston Airport System HAS)

So, what did we do here?  

df = df.copy()

First, we make a copy of our DataFrame.  This can slow things down, but for our purposes, it's better to have a "Pure Function" that won't change the underlying data.

split_col = df[col].str.split(pat, expand=True)

Next, we call the str method of the column in question (more on these here), which lets us directly access a vectorized version of string methods on a string column.  In particular, it uses the split method (docs here), which has one additional element in Pandas.  Normally, split takes a string and returns a list. Lists aren't a native Pandas datatype, and in our case don't really help us.  But, when you set the expand tag to True, instead of a list it gives you a new DataFrame with each column being one element of the split list.

return df.assign(
        **dict(
            zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
        )
    )

The final line returns a new DataFrame that's the same as our original, except where we've assigned each of the new string columns to a new column in the new DataFrame.  It uses similar syntax to the multi-assigner from this previous post.  You can unpack a dictionary (with the ** syntax) into the assign method, and it will give you a DataFrame with the new columns you wanted.

Hrm, looks like this isn't actually as clean as we wanted!  Some of the abbreviations end with a close-parentheses ) and some do not.  We could add an extra step and clean those up, or we could look at an alternate way of doing this!

In our original version, we told it to split on either a - or ( character.  That'd work for something where we want the whole thing.  In this case, though, we want to do a little cleaning as we extract - and know enough about the format that we can specify what we want to take.  So, let's make a new function that instead uses the str.extract method.

Extract

def assign_regex_col(df: pd.DataFrame, col: str, name_list: List[str], pat: str=None):
    df = df.copy()
    split_col = df[col].str.extract(pat, expand=True)

    return df.assign(
        **dict(
            zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
        )
    )

Let's set up some capture groups.  We only want two things - a bunch of characters that are not - or (, and then a continuous bunch of uppercase letters.  We can leave the separator behind.

emp.pipe(
    assign_regex_col,
    col="dept",
    name_list=["Dept Name", "Dept Acronym"],
    pat=r"([\w\s]*)[-(]([A-Z]*)",
).head()
title dept salary race gender hire_date Dept Name Dept Acronym
0 POLICE OFFICER Houston Police Department-HPD 45279.0 White Male 2015-02-03 Houston Police Department HPD
1 ENGINEER/OPERATOR Houston Fire Department (HFD) 63166.0 White Male 1982-02-08 Houston Fire Department HFD
2 SENIOR POLICE OFFICER Houston Police Department-HPD 66614.0 Black Male 1984-11-26 Houston Police Department HPD
3 ENGINEER Public Works & Engineering-PWE 71680.0 Asian Male 2012-03-26 Public Works & Engineering PWE
4 CARPENTER Houston Airport System (HAS) 42390.0 White Male 2013-11-04 Houston Airport System HAS

Bam!  Perfectly captured.

List-like Data

Let's also look at an example where there's a more list-like

The data I had handy that had this available was for work, so let's generate some fake data with the wonderful faker package.  Let's make a bunch of addresses, each with a variable number of people attached to them, and the number of items each address ordered.

from faker import Faker
import random

fake = Faker()

res_df = pd.DataFrame(
    {
        "Address": [fake.address() for _ in range(10)],
        "Residents": [
            ",".join(fake.name() for _ in range(random.randint(1, 3)))
            for _ in range(10)
        ],
        "N_Items_Ordered": [random.randint(1, 15) for _ in range(10)]
    }
)
Address Residents N_Items_Ordered
0 544 Hill Harbor Suite 928\nWest Angel, NV 51251 Jessica Kelly,Christina Morales,Melissa Jennings 3
1 748 Phillip Hollow\nPort Timothyview, AL 93192 Charles Roberts,Patricia Barber 7
2 1884 Christopher Cove Suite 471\nWilliamstad, NC 72167 Emma Ware,Veronica Watson,Ms. Jennifer Lang 9
3 324 Johnson Bridge Apt. 348\nCookmouth, VA 73966 Bobby Madden,Colleen Wallace,Erica Savage 13
4 948 Newton Fields Apt. 455\nGallegosberg, WV 04569 Brandon Jones,Jared Cole,Ryan Rivera 13

Now let's make a function for painlessly splitting these.

def split_list_like(df: pd.DataFrame, col: str, new_col_prefix: str, pat: str = None):
    df = df.copy()
    split_col = df[col].str.split(pat, expand=True)

    return df.assign(
        **{
            f"{new_col_prefix}_{x}": split_col.iloc[:, x]
            for x in range(split_col.shape[1])
        }
    )

Notice that the last line is a little bit different.  For the other two, we had a fixed number of output columns, and so it made sense to zip the split column with a list of column names, and then make a dictionary with that.  We could do that here, but it would involve an extra step of seeing how long the longest list is - and we'd like this to work as painlessly as possible!  Plus, it'll be resilient to new data - what if we get a new output that has an entry with an even longer list?  Better to supply it with a prefix that can be used on an arbitrary number of columns, rather than a premade list.

split = res_df.pipe(
    split_list_like,
    col="Residents",
    new_col_prefix="Resident",
    pat=",",
)
Address Residents N_Items_Ordered Resident_0 Resident_1 Resident_2
0 544 Hill Harbor Suite 928\nWest Angel, NV 51251 Jessica Kelly,Christina Morales,Melissa Jennings 3 Jessica Kelly Christina Morales Melissa Jennings
1 748 Phillip Hollow\nPort Timothyview, AL 93192 Charles Roberts,Patricia Barber 7 Charles Roberts Patricia Barber None
2 1884 Christopher Cove Suite 471\nWilliamstad, NC 72167 Emma Ware,Veronica Watson,Ms. Jennifer Lang 9 Emma Ware Veronica Watson Ms. Jennifer Lang
3 324 Johnson Bridge Apt. 348\nCookmouth, VA 73966 Bobby Madden,Colleen Wallace,Erica Savage 13 Bobby Madden Colleen Wallace Erica Savage
4 948 Newton Fields Apt. 455\nGallegosberg, WV 04569 Brandon Jones,Jared Cole,Ryan Rivera 13 Brandon Jones Jared Cole Ryan Rivera

Now that we've split them into different columns, we can do fun stuff like we can now do fun things like index it according to individuals instead of households with Pandas' melt (docs here), or its more-automated cousin named wide_to_long (docs here).  This could be useful if you wanted to join the data to an output that had the name as a foreign key.

wide_to_long is better for this instance because it was made for when you've got multiple columns representing different instances of the same thing.

long = pd.wide_to_long(split, stubnames='Resident', i="Address", j='nth_res',
                     sep='_')
Residents N_Items_Ordered Resident
Address nth_res
544 Hill Harbor Suite 928\nWest Angel, NV 51251 0 Jessica Kelly,Christina Morales,Melissa Jennings 3 Jessica Kelly
748 Phillip Hollow\nPort Timothyview, AL 93192 0 Charles Roberts,Patricia Barber 7 Charles Roberts
1884 Christopher Cove Suite 471\nWilliamstad, NC 72167 0 Emma Ware,Veronica Watson,Ms. Jennifer Lang 9 Emma Ware
324 Johnson Bridge Apt. 348\nCookmouth, VA 73966 0 Bobby Madden,Colleen Wallace,Erica Savage 13 Bobby Madden
948 Newton Fields Apt. 455\nGallegosberg, WV 04569 0 Brandon Jones,Jared Cole,Ryan Rivera 13 Brandon Jones

Wrap-up

One of the main advantages of cleaning & manipulating data in code as opposed to a GUI is that you can get much, much more precise control - which lets you deal with whatever weird format your data came to you in.  The downside is that you can lose some fluency as you go from a visual interface to code.  I think the sweet spot is to build up a little collection of functions that essentially make a new "interface" for dealing with your dataset.  Make functions that let you focus on substance instead of having to fuss with indices.