A Dirty Way of Cleaning Data (ft. Pandas & SQL)

Code Snippet Corner ft. Pandas & SQL

Warning The following is FANTASTICALLY not-secure.  Do not put this in a script that's going to be running unsupervised.  This is for interactive sessions where you're prototyping the data cleaning methods that you're going to use, and/or just manually entering stuff.  Especially if there's any chance there could be something malicious hiding in the data to be uploaded.  We're going to be executing formatted strings of SQL unsanitized code.  Also, this will lead to LOTS of silent failures, which are arguably The Worst Thing - if guaranteed correctness is a requirement, leave this for the tinkering table.  Alternatively, if it's a project where "getting something in there is better than nothing", this can provide a lot of bang for your buck.  Actually, it's purely for entertainment purposes and not for human consumption.

Let's say you were helping someone take a bunch of scattered Excel files and CSVs and input them all into a MySQL database.  This is a very iterative, trial & error process.  We certainly don't want to be re-entering a bunch of boilerplate.  Pandas to the rescue!  We can painlessly load those files into a DataFrame, then just export them to the db!

Well, not so fast  First off, loading stuff into a DB is a task all its own - Pandas and your RDBMS have different kinds of tolerance for mistakes, and differ in often-unpredictable ways.  For example, one time I was performing a task similar to the one described here (taking scattered files and loading them into a DB) - I was speeding along nicely, but then ran into a speedbump: turns out Pandas generally doesn't infer that a column is a date unless you tell it specifically, and will generally parse dates as strings.  Now, this was fine when the dates were present - MySQL is pretty smart about accepting different forms of dates & times.  But one thing it doesn't like is accepting an empty string '' into a date or time column.  Not a huge deal, just had to cast the column as a date:

df['date'] = pd.to_datetime(df['date'])

Now the blank strings are NaT, which MySQL knows how to handle!

This was simple enough, but there's all kinds of little hiccups that can happen.  And, unfortunately, writing a DataFrame to a DB table is an all-or-nothing affair - if there's one error, that means none of the rows will write.  Which can get pretty annoying if you were trying to write a decent-sized DataFrame, especially if the first error doesn't show up until one of the later rows.  Waiting sucks.  And it's not just about being impatient - long waiting times can disrupt your flow.

Rapid prototyping & highly-interactive development are some of Python's greatest strengths, and they are great strengths indeed!  Paul Graham (one of the guys behind Y Combinator) once made the comparison between REPL-heavy development and the popularizing of oil paints (he was talking about LISP, but it's also quite true of Python, as Python took a lot of its cues from LISP):

Before oil paint became popular, painters used a medium, called tempera , that cannot be blended or over-painted. The cost of mistakes was high, and this tended to make painters conservative. Then came oil paint, and with it a great change in style. Oil "allows for second thoughts". This proved a decisive advantage in dealing with difficult subjects like the human figure.The new medium did not just make painters' lives easier. It made possible a new and more ambitious kind of painting. Janson writes:Without oil, the Flemish Masters'conquest of visible reality would have been much more limited. Thus, from a technical point of view, too, they deserve to be called the "fathers of modern painting" , for oil has been the painter's basic medium ever since. As a material, tempera is no lesss beautiful than oil. But the flexibility of oil paint gives greater scope to the imagination--that was the deciding factor.
Programming is now undergoing a similar change...Meanwhile, ideas borrowed from Lisp increasingly turn up in the mainstream: interactive programming environments, garbage collection, and run-time typing to name a few.More powerful tools are taking the risk out of exploration. That's good news for programmers, because it means that we will be able to undertake more ambitious projects. The use of oil paint certainly had this effect. The period immediately following its adoption was a golden age for painting. There are signs already that something similar is happening in programming.
(Emphasis mine)
From here: http://www.cs.oswego.edu/~blue/xhx/books/ai/ns1/section02/main.html

A little scenario to demonstrate:

Let's pretend we have a MySQL instance running, and have already created a database named items

import pymysql
from sqlalchemy import create_engine
import sqlalchemy
import pandas as pd

cnx = create_engine('mysql+pymysql://analyst:[email protected]:3306/items)

pd.io.sql.execute("""CREATE TABLE books( \
id                               VARCHAR(40) PRIMARY KEY NOT NULL \
,author                          VARCHAR(255) \
,copies                          INT)""", cnx)

df = pd.DataFrame({
    "author": ["Alice", "Bob", "Charlie"],
    "copies": [2, "", 7, ],}, 
    index = [1, 2, 3])
    #Notice that one of these has the wrong data type!
    
df.to_sql(name='books',con=cnx,if_exists='append',index=False)
#Yeah, I'm not listing this whole stacktrace.  Fantastic package with some extremely helpful Exceptions, but you've gotta scroll a whole bunch to find em.  Here's the important part:
InternalError: (pymysql.err.InternalError) (1366, "Incorrect integer value: '' for column 'copies' at row 1") [SQL: 'INSERT INTO books (id, author, copies) VALUES (%(id)s, %(author)s, %(copies)s)'] [parameters: {'id': 2, 'author': 'Bob', 'copies': ''}] (Background on this error at: http://sqlalche.me/e/2j85)

Soo, let's tighten this feedback loop, shall we?

We'll iterate through the DataFrame with the useful iterrows() method.  This gives us essentially an enum made from our DataFrame - we'll get a bunch of tuples giving us the index as the first element and the row as its own Pandas Series as the second.

for x in df.iterrows():
    try:
        pd.DataFrame(x[1]).transpose().to_sql(name='books',
                          con=cnx,
                          if_exists='append',
                         index_label='id')
    except:
        continue

Let's unpack that a bit.

Remember that we're getting a two-element tuple, with the good stuff in the second element, so

x[1]

Next, we convert the Series to a one-entry DataFrame, because the Series doesn't have the DataFrame's to_sql() method.

pd.DataFrame(x[1])

The default behavior will assume this is a single column with, each variable being the address of a different row.  MySQL isn't going to be having it.  Sooo, we transpose!

pd.DataFrame(x[1]).transpose()

And finally, we use our beloved to_sql method on that.

Let's check our table now!

pd.io.sql.read_sql_table("books", cnx, index_col='id')
  	author	copies
id
1	Alice	2

It wrote the first row!  Not much of a difference with this toy example, but once you were writing a few thousand rows and the error didn't pop up until the 3000th, this would make a pretty noticeable difference in your ability to quickly experiment with different cleaning schemes.

Note that this will still short-circuit as soon as we hit the error.  If we wanted to make sure we got all the valid input before working on our tough cases, we could make a little try/except block.


for x in df.iterrows():
    try:
        pd.DataFrame(x[1]).transpose().to_sql(name='books',
                          con=cnx,
                          if_exists='append',
                         index=False,)
    except:
        continue

This will try to write each line, and if it encounters an Exception it'll continue the loop.

pd.io.sql.read_sql_table("books", cnx, index_col='id')
	author	copies
id		
1	Alice	2
3	Charlie	7

Alright, now the bulk of our data's in the db!  Whatever else happens, you've done that much!  Now you can relax a bit, which is useful for stimulating the creativity you'll need for the more complicated edge cases.

So, we're ready to start testing new cleaning schemes?  Well, not quite yet...

Let's say we went and tried to think up a fix.  We go to test it out and...

#Note that we want to see our exceptions here, so either do without the the try/except block
for x in df.iterrows():
    pd.DataFrame(x[1]).transpose().to_sql(name='books',
                              con=cnx,
                              if_exists='append',
                             index=False,
                             )

#OR have it print the exception
for x in df.iterrows():
    try:
        pd.DataFrame(x[1]).transpose().to_sql(name='books',
                          con=cnx,
                          if_exists='append',
                         index_label='id')
    except Exception as e:
        print(e)
        continue
        
#Either way, we get...
(pymysql.err.IntegrityError) (1062, "Duplicate entry '1' for key 'PRIMARY'") [SQL: 'INSERT INTO books (id, author, copies) VALUES (%(id)s, %(author)s, %(copies)s)'] [parameters: {'id': 1, 'author': 'Alice', 'copies': 2}] (Background on this error at: http://sqlalche.me/e/gkpj)
(pymysql.err.InternalError) (1366, "Incorrect integer value: '' for column 'copies' at row 1") [SQL: 'INSERT INTO books (id, author, copies) VALUES (%(id)s, %(author)s, %(copies)s)'] [parameters: {'id': 2, 'author': 'Bob', 'copies': ''}] (Background on this error at: http://sqlalche.me/e/2j85)
(pymysql.err.IntegrityError) (1062, "Duplicate entry '3' for key 'PRIMARY'") [SQL: 'INSERT INTO books (id, author, copies) VALUES (%(id)s, %(author)s, %(copies)s)'] [parameters: {'id': 3, 'author': 'Charlie', 'copies': 7}] (Background on this error at: http://sqlalche.me/e/gkpj)            

The error we're interested is in there, but what's all this other nonsense crowding it?

Well, one of the handy things about a database is that it'll enforce uniqueness based on the constraints you give it.  It's already got an entry with an id value of 1, so it's going to complain if you try to put another one.  In addition to providing a lot of distraction, this'll also slow us down considerably - after all, part of the point was to make our experiments with data-cleaning go faster!

Luckily, Pandas' wonderful logical indexing will make it a snap to ensure that we only bother with entries that aren't in the database yet.

#First, let's get the indices that are in there
usedIDs = pd.read_sql_table("books", cnx, columns=["id"])["id"].values

df[~df.index.isin(usedIDs)]
    author	copies
2	Bob	
#Remember how the logical indexing works: We want every element of the dataframe where the index ISN'T in our array of IDs that are already in the DB

This will also be shockingly quick - Pandas' logical indexing takes advantage of all that magic going on under the hood.  Using it, instead of manually iteration, can literally bring you from waiting minutes to waiting seconds.

Buuut, that's a lot of stuff to type!  We're going to be doing this A LOT, so how about we just turn it into a function?

#Ideally we'd make a much more modular version, but for this toy example we'll be messy and hardcode some paramaters
def filterDFNotInDB(df):
    usedIDs = pd.read_sql_table("books", cnx, columns=["id"])["id"].values
    return df[~df.index.isin(usedIDs)]

So, next time we think we've made some progress on an edge case, we just call...

#Going back to the to_sql method here - we don't want to have to loop through every single failing case, or get spammed with every variety of error message the thing can throw at us.

filterDFNotInDB(cleanedDF).to_sql(name='books',
                          con=cnx,
                          if_exists='append',
                         index_label='id')

Actually, let's clean that up even more - the more keys we hit, the more opportunities to make a mistake!  The most bug-free code is the code you don't write.

def writeNewRows(df):
    filterDFNotInDB(df).to_sql(name='books',
                          con=cnx,
                          if_exists='append',
                         index_label='id')

So, finally, we can work on our new cleaning scheme, and whenever we think we're done...

writeNewRows(cleanedDF)

And boom!  Instant feedback!

Author image
Center of the Universe Website
Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
Author image
Center of the Universe

Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.