Trash Pandas: Messy, Convenient DB Operations via Pandas

(And a way to clean it up with SQLAlchemy)

Let's say you were continuing our task from last week: Taking a bunch of inconsistent Excel files and CSVs, and putting them into a database.

Let's say you've been given a new CSV that conflicts with some rows you've already entered, and you're told that these rows are the correct values.

Why Not Use Pandas' Built-in Method?

Pandas' built-in to_sql DataFrame method won't be useful here.  Remember, it writes as a block - if you set the if_exists flag to "replace", that'll make it replace the entire DB table with a new one based on the DF you're uploading.  And if you're doing this piecemeal, you presumably don't want that.

Last week, we just made a new DataFrame out of each row and appended it to the DB table.  That won't work here - we need an Update.  Writing each update by hand would be annoying, though - luckily we can use code to generate more code!  Python has an extremely handy little tool called f-strings that make string templating a snap!  

def updateStr(row):
    return (f"UPDATE books "
            f"""SET author = '{x.author}' """
            f"""WHERE id = {x.id};""")

Let's walk through that.  It takes a row from a Dataframe - note that we're using dot notation here instead of the bracket notation.  The reason we're doing that is because, instead of using iterrows() like last week, we'll be using itertuples because the docstring for iterrows() said I should.  One reason for this is that iterrows() gives a pandas Series, which will store everything as the same datatype, which will be annoying in some cases.  I think it's supposed to be faster too?  itertuples() instead gives us Named Tuples, which is kind of like a dictionary, except we have to use dot notation instead of square brackets.  

Sooo, we take a Named Tuple, and then the f-string goes to work.  It's mostly just a convenient way of formatting strings with variables - any code inside curly parentheses will be evaluated.  They're convenient, flexible, and supposedly pretty well-optimized!  Let's give it a spin.  Let's say we have a DataFrame df2 that only contains the rows to be updated...

cnx = create_engine('mysql+pymysql://root:[email protected]/appointments', echo=False)
for x in df2.itertuples(index=False):
    print(updateStr(x))
UPDATE books SET author = 'Abby' WHERE id = 3;
UPDATE books SET author = 'Brian' WHERE id = 7;
UPDATE books SET author = 'Celestine' WHERE id = 9;

Sweet!  Now let's actually execute it.  We'll be using the execute() function in Pandas' io.sql module.  I get the feeling I'm not supposed to, primarily because it doesn't have an entry in the official Pandas documentation, and I only found it by poking around the module code.  But hey, it works!  (Warning from last time applies super-duper-extra-double this time!)

for x in df2.itertuples(index=False):
    pd.io.sql.execute(updateStr(x), cnx)

And now let's see if it worked...

pd.io.sql.read_sql_table("books", cnx)
   author copies  id
     Abby      2   3
    Brian          7
Celestine      7   9`

Sweet!

Now all that's well and good, but surely we're not the first person to try to make SQL statements by calling Python functions!  How about a slightly less error-prone way of doing this?

SQLAlchemy

I'll level with you - I've never actually used SQLAlchemy for anything but connecting Pandas to databases before via the create_engine() function.  But that's why blogging's great - gives you an excuse to finally check out that thing you knew was gonna be useful!

SQLAlchemy first needs some information about our table, then it'll let us create statements doing things to said table automagically.  We can either define it ourselves (maybe in a future post!) or read an existing table.  I found the default way of doing this a little to "has-a-couple-too-many-steps-and-function-args"-y, so I packaged the version of the command that worked into a little function.  I encourage you all to do the same!

def loadTable(cnx, tableName):
    meta = MetaData(bind=cnx) 
    return Table(tableName, meta, autoload=True, autoload_with=cnx)

#Binding it to the Engine will make sure it uses the right SQL dialect

There we go!  Now, let's load our books table...

books = loadTable(cnx, "books")

And here's the cool part!  Now that we have our table object, it has a bunch of built-in methods for doing SQL things!  We can print an example...

str(books.update())
'UPDATE books SET index=:index, author=:author, copies=:copies, id=:id'

If we call books.update, it'll do exactly that.  It also has a handy string representation, for debugging and sanity checks.

SQLAlchemy wants us to have a Connection in addition to our Engine.  Well, alright then.

conn = cnx.connect()

Fine, happy now?  Good.

SQLAlchemy lets us build SQL statements by chain methods, which is fantastically useful.  Less error-prone, easier to pass collections.  Our basic pattern would be, based on iterating with itertuples...

for x in df2.itertuples(index=False):
    stmt = (books
          .update()
          .where(books.c.id == x.id)
          .values(author=x.author)
         )
    conn.execute(stmt)

Success!

Author image
Center of the Universe Website
Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
Author image
Center of the Universe

Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.