Connecting Pandas to a Database with SQLAlchemy

Connecting Pandas to a Database with SQLAlchemy

Easily drop data into Pandas from a SQL database, or upload your DataFrames to a SQL table.

    Databases. You love them, you need them, but let's face it... you've already mastered working with them. There's only so much fun to be had in the business of opening database connections, pulling rows, and putting them back where they came from. Wouldn't it be great if we could skip the boring stuff and work with data?

    Pandas and SQLAlchemy are a mach made in Python heaven. They're individually amongst Python's most frequently used libraries. Together they're greater than the sum of their parts, thanks to Pandas' built-in SQLAlchemy integration.

    Create a SQLAlchemy Connection

    As you might imagine, the first two libraries we need to install are Pandas and SQLAlchemy. We need to install a database connector as our third and final library, but the library you need depends on the type of database you'll be connecting to. If you're connecting to MySQL I recommend installing PyMySQL ( pip install pymysql ). If you're connecting to Postgres, go with Psycopg2 ( pip install psycopg2 ). The only time we'll ever use this is when we establish a database connection.

    SQLAlchemy URIs

    A URI (or connection string), is simply a string containing the information needed to connect to something like a database. Here's an example:

    postgres+psycop2://myuser:[email protected]:5432/mydatabase

    The first part of our string is postgres+psycop2, which is a combination of our target database type and our connector. If you're connecting to MySQL, replace this with mysql+pymysql. The rest of the URI should be explanatory.

    SQLAlchemy Engines

    An "engine" is an object used to connect to databases using the information in our URI. Once we create an engine, downloading and uploading data is as simple as passing this object to Pandas:

    db_URI = environ.get('SQLALCHEMY_DATABASE_URI')
    engine = create_engine(db_URI)

    Besides accepting a URI, create_engine() can accept a few optional kwargs as well. If you'd like to see the output of all database transactions taking place, pass echo=True. If your database requires SSL, pass connect_args={'sslmode': 'require'}.

    Believe it or not, we're already done dealing with databases! From here forward we're able to pull or upload data into Pandas via easy one-liners.

    Create a SQL Table From a DataFrame

    For our first trick, let's upload data from a CSV to our SQL database. I'm going to pull a CSV containing NYC job data and clean the column names up a bit to be database-friendly:

    import pandas as pd
    
    
    jobs_DF = pd.read_csv('data/nyc-jobs.csv')
    new_columns = [column.replace(' ', '_').lower() for column in jobs_DF]
    jobs_DF.columns = new_columns

    We now have a DataFrame ready to be saved as a table! We'll accomplish this with a single method called to_sql which we can call on any DataFrame:

    from sqlalchemy.types import Integer, Text, String, DateTime
    
    ...
    
    jobs_DF.to_sql("nyc_jobs",
                   engine,
                   if_exists='replace',
                   schema='public',
                   index=False,
                   chunksize=500,
                   dtype={"job_id": Integer,
                          "agency": Text,
                          "business_title": Text,
                          "job_category":  Text,
                          "salary_range_from": Integer,
                          "salary_range_to": Integer,
                          "salary_frequency": String(50),
                          "work_location": Text,
                          "division/work_unit": Text,
                          "job_description": Text,
                          "posting_date": DateTime,
                          "posting_updated": DateTime})

    There's quite a bit happening here! to_sql() attempts to create a table with the name nyc_jobs in the database associated with engine. These two positional arguments are technically the only required parameters we need to pass, but it's a very good idea to take advantage of Pandas' ability to be more specific in table creation:

    • if_exists: This argument specifies what to do in the situation where a database table with the name nyc_jobs already exists in the database. By default, Pandas will throw an error, which isn't very useful unless we only care about creating this table the first time. Passing replace to this argument will drop the existing table and replace it with the data & data types associated with the current DataFrame. append will keep the existing table the same, but append all rows in the DataFrame to the existing table.
    • schema: Accepts the name of the Postgres schema to save your table in.
    • index: When True, the resulting table will honor your DataFrame's index to create a column with the appropriate key in your database.
    • chunksize: Passing a number to this parameter will stream data to your database in "chunks" of n rows at a time, as opposed to all at once. Passing a chunksize is useful for particularly large datasets which may be a risk of interruption during upload.
    • dtype: Passing a dictionary to dtype gives us a chance to explicitly set the datatypes of each column in our database, which I highly recommend doing. You'll notice we import various data types from sqlalchemy.types, which we then associate to columns in a key/value pair with each column's name. This ensures that data lands in our database as the appropriate type. If data in your DataFrame doesn't match the constraint of a column's data type, you'll receive a SQL error during the upload.

    Create DataFrame from SQL Table

    Loading data from a database into a Pandas DataFrame is surprisingly easy. To load an entire table, use the read_sql_table() method:

    sql_DF = pd.read_sql_table("nyc_jobs",
                               con=engine)

    The first two parameters we pass are the same as last time: first is our table name, and then our SQLAlchemy engine. The above snippet is perhaps the quickest and simplest way to translate a SQL table into a Pandas DataFrame, with essentially no configuration needed!

    When working with data, we should try to be explicit about what we expect wherever possible. Let's expand on the above:

    sql_DF = pd.read_sql_table("nyc_jobs",
                               con=engine,
                               schema='public',
                               index_col='job_id',
                               coerce_float=True,
                               columns=['job_id', 'business_title', 'job_category', 'posting_date', 'posting_updated']
                               parse_dates=['posting_date', 'posting_updated']
                               chunksize=500)

    Some arguments should look familiar from when we ran to_sql() earlier. schema and chunksize have the same meanings as they did previously. We also have a few new arguments as well:

    • index_col: We can select any column of our SQL table to become an index in our Pandas DataFrame, regardless of whether or not the column is an index in SQL. We can pass the name of a single column as a string, or a list of strings representing the names of multiple columns.
    • coerce_float: When set to True, Pandas will look at columns containing numbers and attempt to convert these columns to floating point numbers. This attribute is set to True by default.
    • columns: Passing a list of column names to this attribute will create a DataFrame from only the columns we provide (similar to a SQL select on x columns).
    • parse_dates: When moving data into Pandas we need to explicitly state which columns should be considered DateTime columns.

    Create DataFrames From Query Results

    There will probably be times where you're just looking for a subset of data in a table as opposed to the entire table. In this scenario we can use read_sql(), which creates a DataFrame from the results of a SQL query you run on a table:

    sql_DF = pd.read_sql("SELECT * FROM nyc_jobs",
                         con=engine,
                         parse_dates=['posting_date', 'posting_updated'])
    

    This time around our first parameter is a SQL query instead of the name of a table. We can modify this query to select only specific columns, rows which match criteria, or anything else you can do with SQL.

    That's all folks! If you're interested, the source is up on Github here.

    Todd Birchard's' avatar
    New York City Website
    Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

    Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.