The Jupyter notebook for this post can be found here:
Tidy Data principles tell us that we need to have one "entry" per row and one attribute per column. Sometimes, though, you have a column that's got more than one attribute in it. Sometimes it's a few different attributes - sometimes it's a various different instances of the same attribute. This will not do - at least, not if those columns mean anything of value.
Generally, this will be a string, so let's assume that it is. R (or at least the Tidyverse) actually has a built-in DataFrame operation for this, but for Pandas we'll have to approximate. For an example of multiple attributes in the same column, let's look at this dataset of municipal employees in Houston, Texas.
title | dept | salary | race | gender | hire_date | |
---|---|---|---|---|---|---|
0 | POLICE OFFICER | Houston Police Department-HPD | 45279.0 | White | Male | 2015-02-03 |
1 | ENGINEER/OPERATOR | Houston Fire Department (HFD) | 63166.0 | White | Male | 1982-02-08 |
2 | SENIOR POLICE OFFICER | Houston Police Department-HPD | 66614.0 | Black | Male | 1984-11-26 |
3 | ENGINEER | Public Works & Engineering-PWE | 71680.0 | Asian | Male | 2012-03-26 |
4 | CARPENTER | Houston Airport System (HAS) | 42390.0 | White | Male | 2013-11-04 |
The full name of the department and the department's acronym is currently in the same column, which could become a problem. For instance, we might have an output from a database table with department info that we'd like to merge it with, and it just uses the acronym as its foreign key. And let's also say that we're ingesting this as part of a long function chain and don't want to break our stride - so let's put that all in one convenient function.
Split
def assign_split_col(df: pd.DataFrame, col: str, name_list: List[str], pat: str=None):
df = df.copy()
split_col = df[col].str.split(pat, expand=True)
return df.assign(
**dict(
zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
)
)
title | dept | salary | race | gender | hire_date | Dept Name | Dept Acronym | |
---|---|---|---|---|---|---|---|---|
0 | POLICE OFFICER | Houston Police Department-HPD | 45279.0 | White | Male | 2015-02-03 | Houston Police Department | HPD |
1 | ENGINEER/OPERATOR | Houston Fire Department (HFD) | 63166.0 | White | Male | 1982-02-08 | Houston Fire Department | HFD) |
2 | SENIOR POLICE OFFICER | Houston Police Department-HPD | 66614.0 | Black | Male | 1984-11-26 | Houston Police Department | HPD |
3 | ENGINEER | Public Works & Engineering-PWE | 71680.0 | Asian | Male | 2012-03-26 | Public Works & Engineering | PWE |
4 | CARPENTER | Houston Airport System (HAS) | 42390.0 | White | Male | 2013-11-04 | Houston Airport System | HAS) |
So, what did we do here?
df = df.copy()
First, we make a copy of our DataFrame. This can slow things down, but for our purposes, it's better to have a "Pure Function" that won't change the underlying data.
split_col = df[col].str.split(pat, expand=True)
Next, we call the str
method of the column in question (more on these here), which lets us directly access a vectorized version of string methods on a string column. In particular, it uses the split
method (docs here), which has one additional element in Pandas. Normally, split
takes a string and returns a list. Lists aren't a native Pandas datatype, and in our case don't really help us. But, when you set the expand
tag to True
, instead of a list it gives you a new DataFrame with each column being one element of the split list.
return df.assign(
**dict(
zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
)
)
The final line returns a new DataFrame that's the same as our original, except where we've assigned each of the new string columns to a new column in the new DataFrame. It uses similar syntax to the multi-assigner from this previous post. You can unpack a dictionary (with the **
syntax) into the assign
method, and it will give you a DataFrame with the new columns you wanted.
Hrm, looks like this isn't actually as clean as we wanted! Some of the abbreviations end with a close-parentheses )
and some do not. We could add an extra step and clean those up, or we could look at an alternate way of doing this!
In our original version, we told it to split on either a -
or (
character. That'd work for something where we want the whole thing. In this case, though, we want to do a little cleaning as we extract - and know enough about the format that we can specify what we want to take. So, let's make a new function that instead uses the str.extract
method.
Extract
def assign_regex_col(df: pd.DataFrame, col: str, name_list: List[str], pat: str=None):
df = df.copy()
split_col = df[col].str.extract(pat, expand=True)
return df.assign(
**dict(
zip(name_list, [split_col.iloc[:, x] for x in range(split_col.shape[1])])
)
)
Let's set up some capture groups. We only want two things - a bunch of characters that are not -
or (
, and then a continuous bunch of uppercase letters. We can leave the separator behind.
emp.pipe(
assign_regex_col,
col="dept",
name_list=["Dept Name", "Dept Acronym"],
pat=r"([\w\s]*)[-(]([A-Z]*)",
).head()
title | dept | salary | race | gender | hire_date | Dept Name | Dept Acronym | |
---|---|---|---|---|---|---|---|---|
0 | POLICE OFFICER | Houston Police Department-HPD | 45279.0 | White | Male | 2015-02-03 | Houston Police Department | HPD |
1 | ENGINEER/OPERATOR | Houston Fire Department (HFD) | 63166.0 | White | Male | 1982-02-08 | Houston Fire Department | HFD |
2 | SENIOR POLICE OFFICER | Houston Police Department-HPD | 66614.0 | Black | Male | 1984-11-26 | Houston Police Department | HPD |
3 | ENGINEER | Public Works & Engineering-PWE | 71680.0 | Asian | Male | 2012-03-26 | Public Works & Engineering | PWE |
4 | CARPENTER | Houston Airport System (HAS) | 42390.0 | White | Male | 2013-11-04 | Houston Airport System | HAS |
Bam! Perfectly captured.
List-like Data
Let's also look at an example where there's a more list-like
The data I had handy that had this available was for work, so let's generate some fake data with the wonderful faker package. Let's make a bunch of addresses, each with a variable number of people attached to them, and the number of items each address ordered.
from faker import Faker
import random
fake = Faker()
res_df = pd.DataFrame(
{
"Address": [fake.address() for _ in range(10)],
"Residents": [
",".join(fake.name() for _ in range(random.randint(1, 3)))
for _ in range(10)
],
"N_Items_Ordered": [random.randint(1, 15) for _ in range(10)]
}
)
Address | Residents | N_Items_Ordered | |
---|---|---|---|
0 | 544 Hill Harbor Suite 928\nWest Angel, NV 51251 | Jessica Kelly,Christina Morales,Melissa Jennings | 3 |
1 | 748 Phillip Hollow\nPort Timothyview, AL 93192 | Charles Roberts,Patricia Barber | 7 |
2 | 1884 Christopher Cove Suite 471\nWilliamstad, NC 72167 | Emma Ware,Veronica Watson,Ms. Jennifer Lang | 9 |
3 | 324 Johnson Bridge Apt. 348\nCookmouth, VA 73966 | Bobby Madden,Colleen Wallace,Erica Savage | 13 |
4 | 948 Newton Fields Apt. 455\nGallegosberg, WV 04569 | Brandon Jones,Jared Cole,Ryan Rivera | 13 |
Now let's make a function for painlessly splitting these.
def split_list_like(df: pd.DataFrame, col: str, new_col_prefix: str, pat: str = None):
df = df.copy()
split_col = df[col].str.split(pat, expand=True)
return df.assign(
**{
f"{new_col_prefix}_{x}": split_col.iloc[:, x]
for x in range(split_col.shape[1])
}
)
Notice that the last line is a little bit different. For the other two, we had a fixed number of output columns, and so it made sense to zip the split column with a list of column names, and then make a dictionary with that. We could do that here, but it would involve an extra step of seeing how long the longest list is - and we'd like this to work as painlessly as possible! Plus, it'll be resilient to new data - what if we get a new output that has an entry with an even longer list? Better to supply it with a prefix that can be used on an arbitrary number of columns, rather than a premade list.
split = res_df.pipe(
split_list_like,
col="Residents",
new_col_prefix="Resident",
pat=",",
)
Address | Residents | N_Items_Ordered | Resident_0 | Resident_1 | Resident_2 | |
---|---|---|---|---|---|---|
0 | 544 Hill Harbor Suite 928\nWest Angel, NV 51251 | Jessica Kelly,Christina Morales,Melissa Jennings | 3 | Jessica Kelly | Christina Morales | Melissa Jennings |
1 | 748 Phillip Hollow\nPort Timothyview, AL 93192 | Charles Roberts,Patricia Barber | 7 | Charles Roberts | Patricia Barber | None |
2 | 1884 Christopher Cove Suite 471\nWilliamstad, NC 72167 | Emma Ware,Veronica Watson,Ms. Jennifer Lang | 9 | Emma Ware | Veronica Watson | Ms. Jennifer Lang |
3 | 324 Johnson Bridge Apt. 348\nCookmouth, VA 73966 | Bobby Madden,Colleen Wallace,Erica Savage | 13 | Bobby Madden | Colleen Wallace | Erica Savage |
4 | 948 Newton Fields Apt. 455\nGallegosberg, WV 04569 | Brandon Jones,Jared Cole,Ryan Rivera | 13 | Brandon Jones | Jared Cole | Ryan Rivera |
Now that we've split them into different columns, we can do fun stuff like we can now do fun things like index it according to individuals instead of households with Pandas' melt
(docs here), or its more-automated cousin named wide_to_long
(docs here). This could be useful if you wanted to join the data to an output that had the name as a foreign key.
wide_to_long
is better for this instance because it was made for when you've got multiple columns representing different instances of the same thing.
long = pd.wide_to_long(split, stubnames='Resident', i="Address", j='nth_res',
sep='_')
Residents | N_Items_Ordered | Resident | ||
---|---|---|---|---|
Address | nth_res | |||
544 Hill Harbor Suite 928\nWest Angel, NV 51251 | 0 | Jessica Kelly,Christina Morales,Melissa Jennings | 3 | Jessica Kelly |
748 Phillip Hollow\nPort Timothyview, AL 93192 | 0 | Charles Roberts,Patricia Barber | 7 | Charles Roberts |
1884 Christopher Cove Suite 471\nWilliamstad, NC 72167 | 0 | Emma Ware,Veronica Watson,Ms. Jennifer Lang | 9 | Emma Ware |
324 Johnson Bridge Apt. 348\nCookmouth, VA 73966 | 0 | Bobby Madden,Colleen Wallace,Erica Savage | 13 | Bobby Madden |
948 Newton Fields Apt. 455\nGallegosberg, WV 04569 | 0 | Brandon Jones,Jared Cole,Ryan Rivera | 13 | Brandon Jones |
Wrap-up
One of the main advantages of cleaning & manipulating data in code as opposed to a GUI is that you can get much, much more precise control - which lets you deal with whatever weird format your data came to you in. The downside is that you can lose some fluency as you go from a visual interface to code. I think the sweet spot is to build up a little collection of functions that essentially make a new "interface" for dealing with your dataset. Make functions that let you focus on substance instead of having to fuss with indices.