Back in August of last year (roughly 8 months ago), I hunched over my desk at 4 am desperate to fire off a post before boarding a flight the next morning. The article was titled Creating Database Schemas: a Job for Robots, or Perhaps Pandas. It was my intent at the time to solve a common annoyance: creating database tables out of raw data, without the obnoxious process of explicitly setting each column's datatype. I had a few leads that led me to believe I had the answer... boy was I wrong.

The task seems somewhat reasonable from the surface. Surely we can spot columns where the data is always in integers, or match the expected format of a date, right? If anything, we'll fall back to text or varchar and call it a day. Hell, even MongoDB's Compass does a great job of this by merely uploading a CSV... this has got to be some trivial task handled by third-party libraries by now.

For one reason or another, searching for a solution to this problem almost always comes up empty. Software developers probably have little need for dynamically generated tables if their applications run solely on self-defined models. Full-time Data Scientists have access to plenty of expensive tools which seem to claim this functionality, yet it all seems so... inaccessible.

Is This NOT a Job For Pandas?

From my experience, no. Pandas does offer hope but doesn't seem to get the job done quite right. Let's start with a dataset so you can see what I mean. Here's a bunch of fake identities I'll be using to mimic the outcome I experienced when working with real data:

id initiated hiredate email firstname lastname title department location country type
100035435 2015-12-11T09:16:20.722-08:00 3/22/67 GretchenRMorrow@jourrapide.com Gretchen Morrow Power plant operator Physical Product Britling Cafeterias United Kingdom Employee
100056435 2015-12-15T10:11:24.604-08:00 6/22/99 ElizabethLSnow@armyspy.com Elizabeth Snow Oxygen therapist Physical Product Grade A Investment United States of America Employee
100037955 2015-12-16T14:31:32.765-08:00 5/31/74 AlbertMPeterson@einrot.com Albert Peterson Psychologist Physical Product Grass Roots Yard Services United States of America Employee
100035435 2016-01-20T11:15:47.249-08:00 9/9/69 JohnMLynch@dayrep.com John Lynch Environmental hydrologist Physical Product Waccamaw's Homeplace United States of America Employee
100057657 2016-01-21T12:45:38.261-08:00 4/9/83 TheresaJCahoon@teleworm.us Theresa Cahoon Personal chef Physical Product Cala Foods United States of America Employee
100056747 2016-02-01T11:25:39.317-08:00 6/26/98 KennethHPayne@dayrep.com Kenneth Payne Central office operator Frontline Magna Consulting United States of America Employee
100035435 2016-02-01T11:28:11.953-08:00 4/16/82 LeifTSpeights@fleckens.hu Leif Speights Staff development director Frontline Rivera Property Maintenance United States of America Employee
100035435 2016-02-01T12:21:01.756-08:00 8/6/80 JamesSRobinson@teleworm.us James Robinson Scheduling clerk Frontline Discount Furniture Showcase United States of America Employee
100074688 2016-02-01T13:29:19.147-08:00 12/14/74 AnnaDMoberly@jourrapide.com Anna Moberly Playwright Physical Product The Wiz United States of America Employee
100665778 2016-02-04T14:40:05.223-08:00 9/13/66 MarjorieBCrawford@armyspy.com Marjorie Crawford Court, municipal, and license clerk Physical Product The Serendipity Dip United Kingdom Employee
100876876 2016-02-24T12:39:25.872-08:00 12/19/67 LyleCHackett@fleckens.hu Lyle Hackett Airframe mechanic Physical Product Infinity Investment Plan United States of America Employee
100658565 2016-02-29T15:52:12.933-08:00 11/17/83 MaryJDensmore@jourrapide.com Mary Densmore Employer relations representative Frontline One-Up Realtors United States of America Employee
100766547 2016-03-01T12:32:53.357-08:00 10/1/87 CindyRDiaz@armyspy.com Cindy Diaz Student affairs administrator Physical Product Mr. AG's United States of America Employee
100045677 2016-03-02T12:07:44.264-08:00 8/16/65 AndreaTLigon@einrot.com Andrea Ligon Railroad engineer Central Growth Robinson Furniture United States of America Employee

There are some juicy datatypes in there: integers, timestamps, dates, strings.... and those are only the first four columns! Let's load this thing into a DataFrame and see what information we can get that way:

import pandas as pd

csv = 'data/fake.csv'

workers_df = pd.read_csv(csv, header=0, encoding='utf-8')
meta = workers_df.info(verbose=True)
print(meta)

Using Pandas' info() should do the trick! This returns a list of columns and their data types:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 11 columns):
id            14 non-null float64
initiated     14 non-null object
hiredate      14 non-null object
email         14 non-null object
firstname     14 non-null object
lastname      14 non-null object
title         14 non-null object
department    14 non-null object
location      14 non-null object
country       14 non-null object
type          14 non-null object
dtypes: float64(1), object(10)
memory usage: 2.0+ KB
None

...Or not. What is this garbage? Only one of our 11 columns identified a data type, and it was incorrectly listed as a float! Okay, so maybe Pandas doesn't have a secret one-liner for this. So who does?

What about PySpark?

It's always been a matter of time before we'd turn to Apache's family of aged data science products. Hadoop, Spark, Kafka... all of them have a particular musty stench about them that tastes like "I feel like I should be writing in Java right now." Heads up: they do want you to write in Java. Misery loves company.

Nonetheless, PySpark does support reading data as DataFrames in Python, and also comes with the elusive ability to infer schemas. Installing Hadoop and Spark locally still kind of sucks for solving this one particular problem. Cue Databricks: a company that spun off from the Apache team way back in the day, and offers free cloud notebooks integrated with- you guessed it: Spark.

With Databricks, we can upload our CSV and load it into a DataFrame by spinning up a free notebook. The source looks something like this:

# File location and type
file_location = "/FileStore/tables/fake.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

Let's see out the output looks:

df:pyspark.sql.dataframe.DataFrame
id:integer
initiated:timestamp
hiredate:string
email:string
firstname:string
lastname:string
title:string
department:string
location:string
country:string
type:string

Not bad! We correctly 'upgraded' our ID from float to integer, and we managed to get the timestamp correct also. With a bit of messing around, we could probably have even gotten the date correct too, given that we stated the format beforehand.

A look at the Databricks Notebook interface.

And Yet, This Still Kind of Sucks

Even though we can solve our problem in a notebook, we still haven't solved the use case: I want a drop-in solution to create tables out of CSVs... whenever I want! I want to accomplish this while writing any app, at the drop of a hat without warning. I don't want to install Hadoop and have Java errors coming back at me through my terminal. Don't EVER let me see Java in my terminal. UGH:

py4j.protocol.Py4JJavaError: An error occurred while calling o43.count.
: java.lang.IllegalArgumentException: Unsupported class file major version 55
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
        at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
        at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
        at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
        at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
        at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)

        at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
        at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
        at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
        at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
        at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
        at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
        at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
        at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)

Python's "tableschema" Library

Thankfully, there's at least one other person out there who has shared this desire. That brings us to tableschema, a not-quite-perfect-but-perhaps-good-enough library to gunsling data like some kind of wild data cowboy. Let's give it a go:

import csv
from tableschema import Table


data = 'data/fake.csv'
schema = infer(data, limit=500, headers=1, confidence=0.85)
print(schema)

If our dataset is particularly large, we can use the limit attribute to limit the sample size to the first X number of rows. Another nice feature is the confidence attribute: a 0-1 ratio for allowing casting errors during the inference. Here's what comes back:

{
  "fields": [{
    "name": "id",
    "type": "integer",
    "format": "default"
  }, {
    "name": "initiated",
    "type": "string",
    "format": "default"
  }, {
    "name": "hiredate",
    "type": "date",
    "format": "default"
  }, {
    "name": "email",
    "type": "string",
    "format": "default"
  }, {
    "name": "firstname",
    "type": "string",
    "format": "default"
  }, {
    "name": "lastname",
    "type": "string",
    "format": "default"
  }, {
    "name": "title",
    "type": "string",
    "format": "default"
  }, {
    "name": "department",
    "type": "string",
    "format": "default"
  }, {
    "name": "location",
    "type": "string",
    "format": "default"
  }, {
    "name": "country",
    "type": "string",
    "format": "default"
  }, {
    "name": "type",
    "type": "string",
    "format": "default"
  }],
  "missingValues": [""]
}

Hey, that's good enough for me! Now let's automate the shit out this.

Creating a Table in SQLAlchemy With Our New Schema

I'm about to throw a bunch in your face right here. Here's a monster of a class:

from sqlalchemy import create_engine
import config
import pandas as pd
import psycopg2
from tableschema import Table, infer, Schema
from functions.recursivejson import extract_values
from sqlalchemy.types import Integer, Text, Date


class CreateTablesFromCSVs:
    """Infer a table schema from a CSV."""

    __uri = config.PG_URI
    __engine = create_engine(__uri, convert_unicode=True, echo=True)
    __data = 'data/fake.csv'
    
    @classmethod
    def get_data(cls):
        """Pull latest data."""
        test_df = pd.read_csv(cls.__data, header=0, encoding='utf-8')
        return test_df

    @classmethod
    def get_schema_from_csv(cls, csv):
        """Infers schema from CSV."""
        table = Table(csv)
        table.infer(limit=500, confidence=0.55)
        schema = table.schema.descriptor
        names = cls.get_column_names(schema, 'name')
        datatypes = cls.get_column_datatypes(schema, 'type')
        schema_dict = dict(zip(names, datatypes))
        return schema_dict

    @classmethod
    def get_column_names(cls, schema, key):
        """Get names of columns."""
        names = extract_values(schema, key)
        return names

    @classmethod
    def get_column_datatypes(cls, schema, key):
        """Convert schema to recognizable by SQLAlchemy."""
        values = extract_values(schema, key)
        for i, value in enumerate(values):
            if value == 'integer':
                values[i] = Integer
            elif value == 'string':
                values[i] = Text
            elif value == 'date':
                values[i] = Date
        return values
        
    @classmethod
    def create_new_table(cls, data, schema):
          """Create new table from CSV and generated schema."""
          workday_table.to_sql('faketable',
                               con=cls.__engine,
                               schema='testschema',
                               if_exists='replace',
                               chunksize=300,
                               dtype=schema)
                                 
data = CreateTablesFromCSVs.get_schema_from_csv()
schema = CreateTablesFromCSVs.get_schema_from_csv(data)
CreateTablesFromCSVs.create_new_table(data, schema)

The first thing worth mentioning is I'm importing a function from my personal secret library to extract values from JSON objects. I've spoken about it before.

Let's break down this class:

  • get_data() reads our CSV into a Pandas DataFrame.
  • get_schema_from_csv() kicks off building a Schema that SQLAlchemy can use to build a table.
  • get_column_names() simply pulls column names as half our schema.
  • get_column_datatypes() manually replaces the datatype names we received from tableschema and replaces them with SQLAlchemy datatypes.
  • create_new_table Uses a beautiful marriage between Pandas and SQLAlchemy to create a table in our database with the correct datatypes mapped.

Promising Potential, Room to Grow

While tableschema works some of the time, it isn't perfect. The base of what we accomplish still stands: we now have a reliable formula for how we would create schemas on the fly if we trust our schemas to be accurate.

Just wait until next time when we introduce Google BigQuery into the mix.