Easily Adding DB Variables To Match New Input

APIs. They're wonderful. For every headache they've given me, I'm glad I live in the age where hitting an API endpoint is a standard way of retrieving data - I recently had to crawl a bunch of records from the Brazilian census in 200, and that was an ordeal (but this isn't about that!).

The thing about APIs is that you generally shouldn't be hitting them all day long - generally you should be doing regular imports to a database (or whatever). And the other thing about APIs is that they're not quite as fussy about names as databases. There's nothing stopping you from having the first row of your API's output include field names like "Account Canceled?", which a typical SQL RDBMS will not care for one bit.

How do we translate them? Well, simple enough - we just have to think of everything that might be in our input string that won't be allowed in our database, and change that! I'm going to use the pipe function from my beloved toolz library, provider of Functional Programming goodies for Python.

from toolz import pipe
import string

def dbReady(toConvert):
    return pipe(toConvert,
                lambda x: x.lower(),
                lambda x: filter(lambda y: 
                                y in string.ascii_lowercase + " ",
                                x),
                lambda x: "".join(x),
                lambda x: x.split(),
                lambda x: "_".join(x))

We made it lowercase.
We filtered everything that wasn't a lowercase letter or a space.
We joined the filter back into a string.
We split the resulting string (in case we wound up with any double spaces, such as from deleting a &.
We joined the list of strings with underscores.

dbReady("Account Canceled?")
'account_canceled'

For comparison's sake, here's what that same function looks like without pipe:

def dbReady(toConvert):
    return '_'.join(
               "".join(
                   filter(lambda x:
                          x in string.ascii_lowercase + " ",
                          (toConvert
                           .lower()))
                   ).split())

But wait! There's more!

What if your API changes? Oh no, that could break your import!

Well, if you don't care about the new columns, you could just filter them out. Let's say we have a list of lists called latestResponse that came from a request to our API, with the first row as the labels.

import pymysql
from sqlalchemy import create_engine
import sqlalchemy
import pandas as pd

latestCols = latestResponse[0]

#Change to what they'll be in the database
dbReadies = [dbReady(x) for x in latestCols] 

#Grab the columns currently in the database
cnx = create_engine('mysql+pymysql://root:cracked1@localhost/appointments', echo=False)
dbCols = pd.io.sql.read_sql_table("appointments_tableau", 
                                  cnx).columns

#Make a dataframe with what the columns would be named in a database
df = pd.DataFrame(latestResponse[1:], columns = dbReady)

#Only select the columns that are currently in the db, and upload
df[dbCols].to_sql(name="my_table",
                  con=cnx, 
                  if_exists='append')

But what if you DO want the new columns from now on? But you're already in the zone, don't feel like manually searching for which columns are new, and opening a new terminal window to add the new variables? What if you are, in a word, lazy? And what if it's sorta important to preserve the order of the fields, and the new ones are in the middle?

Never fear!

First, let's cook up a little function to produce an SQL ALTER TABLE statement (standard disclaimers apply: do NOT do this blindly, or automatically). Oh, and for our purposes let's say these new columns all have the same type ( VARCHAR(255)), because if that's not the case then we have to be slightly less lazy.

def alterStatement(existingCol, newCol):
    return (f"ALTER TABLE appointments_tableau "
            f"ADD COLUMN {newCol} VARCHAR(255) AFTER {existingCol};")

Let's use the wonderful sliding_window function from toolz to feed us a bunch of column names.

Example from the official docs:

list(sliding_window(2, [1, 2, 3, 4]))
[(1, 2), (2, 3), (3, 4)]

Back to the show!

from toolz.itertoolz import sliding_window

# Get the variables that aren't currently in the db
newVars = [x for x in dbReadies if x not in dbCols]

# Get a list where each entry is a tuple that has every new variable, and the existing preceding one
tuples = list(sliding_window(2, dbReadies))
newVarTups = [x for x in tuples if x[1] in newVars]

And, finally, let's set up our statements, and have Pandas execute them! I know I promised I was going to start using SQLAlchemy for this kind of thing instead of unsanitized raw SQL, but I'm back on my bullshit again.

for x in newVarTups:
    pd.io.sql.execute(alterStatement(*x), cnx)

Related Posts