From CSVs to Tables: Infer Data Types From Raw Spreadsheets

From CSVs to Tables: Infer Data Types From Raw Spreadsheets

The quest to never explicitly set a table schema ever again.

    Back in August of last year (roughly 8 months ago), I hunched over my desk at 4 am desperate to fire off a post before boarding a flight the next morning. The article was titled Creating Database Schemas: a Job for Robots, or Perhaps Pandas. It was my intent at the time to solve a common annoyance: creating database tables out of raw data, without the obnoxious process of explicitly setting each column's datatype. I had a few leads that led me to believe I had the answer... boy was I wrong.

    The task seems somewhat reasonable from the surface. Surely we can spot columns where the data is always in integers, or match the expected format of a date, right? If anything, we'll fall back to text or varchar and call it a day. Hell, even MongoDB's Compass does a great job of this by merely uploading a CSV... this has got to be some trivial task handled by third-party libraries by now.

    For one reason or another, searching for a solution to this problem almost always comes up empty. Software developers probably have little need for dynamically generated tables if their applications run solely on self-defined models. Full-time Data Scientists have access to plenty of expensive tools which seem to claim this functionality, yet it all seems so... inaccessible.

    Is This NOT a Job For Pandas?

    From my experience, no. Pandas does offer hope but doesn't seem to get the job done quite right. Let's start with a dataset so you can see what I mean. Here's a bunch of fake identities I'll be using to mimic the outcome I experienced when working with real data:

    id initiated hiredate email firstname lastname title department location country type
    100035435 2015-12-11T09:16:20.722-08:00 3/22/67 [email protected] Gretchen Morrow Power plant operator Physical Product Britling Cafeterias United Kingdom Employee
    100056435 2015-12-15T10:11:24.604-08:00 6/22/99 [email protected] Elizabeth Snow Oxygen therapist Physical Product Grade A Investment United States of America Employee
    100037955 2015-12-16T14:31:32.765-08:00 5/31/74 [email protected] Albert Peterson Psychologist Physical Product Grass Roots Yard Services United States of America Employee
    100035435 2016-01-20T11:15:47.249-08:00 9/9/69 [email protected] John Lynch Environmental hydrologist Physical Product Waccamaw's Homeplace United States of America Employee
    100057657 2016-01-21T12:45:38.261-08:00 4/9/83 [email protected] Theresa Cahoon Personal chef Physical Product Cala Foods United States of America Employee
    100056747 2016-02-01T11:25:39.317-08:00 6/26/98 [email protected] Kenneth Payne Central office operator Frontline Magna Consulting United States of America Employee
    100035435 2016-02-01T11:28:11.953-08:00 4/16/82 [email protected] Leif Speights Staff development director Frontline Rivera Property Maintenance United States of America Employee
    100035435 2016-02-01T12:21:01.756-08:00 8/6/80 [email protected] James Robinson Scheduling clerk Frontline Discount Furniture Showcase United States of America Employee
    100074688 2016-02-01T13:29:19.147-08:00 12/14/74 [email protected] Anna Moberly Playwright Physical Product The Wiz United States of America Employee
    100665778 2016-02-04T14:40:05.223-08:00 9/13/66 [email protected] Marjorie Crawford Court, municipal, and license clerk Physical Product The Serendipity Dip United Kingdom Employee
    100876876 2016-02-24T12:39:25.872-08:00 12/19/67 [email protected] Lyle Hackett Airframe mechanic Physical Product Infinity Investment Plan United States of America Employee
    100658565 2016-02-29T15:52:12.933-08:00 11/17/83 [email protected] Mary Densmore Employer relations representative Frontline One-Up Realtors United States of America Employee
    100766547 2016-03-01T12:32:53.357-08:00 10/1/87 [email protected] Cindy Diaz Student affairs administrator Physical Product Mr. AG's United States of America Employee
    100045677 2016-03-02T12:07:44.264-08:00 8/16/65 [email protected] Andrea Ligon Railroad engineer Central Growth Robinson Furniture United States of America Employee

    There are some juicy datatypes in there: integers, timestamps, dates, strings.... and those are only the first four columns! Let's load this thing into a DataFrame and see what information we can get that way:

    import pandas as pd
    
    csv = 'data/fake.csv'
    
    workers_df = pd.read_csv(csv, header=0, encoding='utf-8')
    meta = workers_df.info(verbose=True)
    print(meta)
    

    Using Pandas' info() should do the trick! This returns a list of columns and their data types:

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 22 entries, 0 to 21
    Data columns (total 11 columns):
    id            14 non-null float64
    initiated     14 non-null object
    hiredate      14 non-null object
    email         14 non-null object
    firstname     14 non-null object
    lastname      14 non-null object
    title         14 non-null object
    department    14 non-null object
    location      14 non-null object
    country       14 non-null object
    type          14 non-null object
    dtypes: float64(1), object(10)
    memory usage: 2.0+ KB
    None
    

    ...Or not. What is this garbage? Only one of our 11 columns identified a data type, and it was incorrectly listed as a float! Okay, so maybe Pandas doesn't have a secret one-liner for this. So who does?

    What about PySpark?

    It's always been a matter of time before we'd turn to Apache's family of aged data science products. Hadoop, Spark, Kafka... all of them have a particular musty stench about them that tastes like "I feel like I should be writing in Java right now." Heads up: they do want you to write in Java. Misery loves company.

    Nonetheless, PySpark does support reading data as DataFrames in Python, and also comes with the elusive ability to infer schemas. Installing Hadoop and Spark locally still kind of sucks for solving this one particular problem. Cue Databricks: a company that spun off from the Apache team way back in the day, and offers free cloud notebooks integrated with- you guessed it: Spark.

    With Databricks, we can upload our CSV and load it into a DataFrame by spinning up a free notebook. The source looks something like this:

    # File location and type
    file_location = "/FileStore/tables/fake.csv"
    file_type = "csv"
    
    # CSV options
    infer_schema = "true"
    first_row_is_header = "true"
    delimiter = ","
    
    df = spark.read.format(file_type) \
      .option("inferSchema", infer_schema) \
      .option("header", first_row_is_header) \
      .option("sep", delimiter) \
      .load(file_location)
    
    display(df)
    

    Let's see out the output looks:

    df:pyspark.sql.dataframe.DataFrame
    id:integer
    initiated:timestamp
    hiredate:string
    email:string
    firstname:string
    lastname:string
    title:string
    department:string
    location:string
    country:string
    type:string
    

    Not bad! We correctly 'upgraded' our ID from float to integer, and we managed to get the timestamp correct also. With a bit of messing around, we could probably have even gotten the date correct too, given that we stated the format beforehand.

    A look at the Databricks Notebook interface.

    And Yet, This Still Kind of Sucks

    Even though we can solve our problem in a notebook, we still haven't solved the use case: I want a drop-in solution to create tables out of CSVs... whenever I want! I want to accomplish this while writing any app, at the drop of a hat without warning. I don't want to install Hadoop and have Java errors coming back at me through my terminal. Don't EVER let me see Java in my terminal. UGH:

    py4j.protocol.Py4JJavaError: An error occurred while calling o43.count.
    : java.lang.IllegalArgumentException: Unsupported class file major version 55
            at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
            at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
            at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
            at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
            at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
            at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
            at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
            at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
            at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
            at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
    
            at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
            at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
            at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
            at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
            at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
            at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
            at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
            at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
    

    Python's "tableschema" Library

    Thankfully, there's at least one other person out there who has shared this desire. That brings us to tableschema, a not-quite-perfect-but-perhaps-good-enough library to gunsling data like some kind of wild data cowboy. Let's give it a go:

    import csv
    from tableschema import Table
    
    
    data = 'data/fake.csv'
    schema = infer(data, limit=500, headers=1, confidence=0.85)
    print(schema)
    

    If our dataset is particularly large, we can use the limit attribute to limit the sample size to the first X number of rows. Another nice feature is the confidence attribute: a 0-1 ratio for allowing casting errors during the inference. Here's what comes back:

    {
      "fields": [{
        "name": "id",
        "type": "integer",
        "format": "default"
      }, {
        "name": "initiated",
        "type": "string",
        "format": "default"
      }, {
        "name": "hiredate",
        "type": "date",
        "format": "default"
      }, {
        "name": "email",
        "type": "string",
        "format": "default"
      }, {
        "name": "firstname",
        "type": "string",
        "format": "default"
      }, {
        "name": "lastname",
        "type": "string",
        "format": "default"
      }, {
        "name": "title",
        "type": "string",
        "format": "default"
      }, {
        "name": "department",
        "type": "string",
        "format": "default"
      }, {
        "name": "location",
        "type": "string",
        "format": "default"
      }, {
        "name": "country",
        "type": "string",
        "format": "default"
      }, {
        "name": "type",
        "type": "string",
        "format": "default"
      }],
      "missingValues": [""]
    }
    

    Hey, that's good enough for me! Now let's automate the shit out this.

    Creating a Table in SQLAlchemy With Our New Schema

    I'm about to throw a bunch in your face right here. Here's a monster of a class:

    from sqlalchemy import create_engine
    import config
    import pandas as pd
    import psycopg2
    from tableschema import Table, infer, Schema
    from functions.recursivejson import extract_values
    from sqlalchemy.types import Integer, Text, Date
    
    
    class CreateTablesFromCSVs:
        """Infer a table schema from a CSV."""
    
        __uri = config.PG_URI
        __engine = create_engine(__uri, convert_unicode=True, echo=True)
        __data = 'data/fake.csv'
        
        @classmethod
        def get_data(cls):
            """Pull latest data."""
            test_df = pd.read_csv(cls.__data, header=0, encoding='utf-8')
            return test_df
    
        @classmethod
        def get_schema_from_csv(cls, csv):
            """Infers schema from CSV."""
            table = Table(csv)
            table.infer(limit=500, confidence=0.55)
            schema = table.schema.descriptor
            names = cls.get_column_names(schema, 'name')
            datatypes = cls.get_column_datatypes(schema, 'type')
            schema_dict = dict(zip(names, datatypes))
            return schema_dict
    
        @classmethod
        def get_column_names(cls, schema, key):
            """Get names of columns."""
            names = extract_values(schema, key)
            return names
    
        @classmethod
        def get_column_datatypes(cls, schema, key):
            """Convert schema to recognizable by SQLAlchemy."""
            values = extract_values(schema, key)
            for i, value in enumerate(values):
                if value == 'integer':
                    values[i] = Integer
                elif value == 'string':
                    values[i] = Text
                elif value == 'date':
                    values[i] = Date
            return values
            
        @classmethod
        def create_new_table(cls, data, schema):
              """Create new table from CSV and generated schema."""
              workday_table.to_sql('faketable',
                                   con=cls.__engine,
                                   schema='testschema',
                                   if_exists='replace',
                                   chunksize=300,
                                   dtype=schema)
                                     
    data = CreateTablesFromCSVs.get_schema_from_csv()
    schema = CreateTablesFromCSVs.get_schema_from_csv(data)
    CreateTablesFromCSVs.create_new_table(data, schema)
    

    The first thing worth mentioning is I'm importing a function from my personal secret library to extract values from JSON objects. I've spoken about it before.

    Let's break down this class:

    • get_data() reads our CSV into a Pandas DataFrame.
    • get_schema_from_csv() kicks off building a Schema that SQLAlchemy can use to build a table.
    • get_column_names() simply pulls column names as half our schema.
    • get_column_datatypes() manually replaces the datatype names we received from tableschema and replaces them with SQLAlchemy datatypes.
    • create_new_table Uses a beautiful marriage between Pandas and SQLAlchemy to create a table in our database with the correct datatypes mapped.

    Promising Potential, Room to Grow

    While tableschema works some of the time, it isn't perfect. The base of what we accomplish still stands: we now have a reliable formula for how we would create schemas on the fly if we trust our schemas to be accurate.

    Just wait until next time when we introduce Google BigQuery into the mix.

    Todd Birchard's' avatar
    New York City Website
    Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

    Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.