Something we've taken for granted thus far on Hackers and Slackers is a library most data professionals have accepted as standard: SQLAlchemy.

In the past, we've covered database connection management and querying using libraries such as PyMySQL and Psycopg2, both of which do an excellent job of interacting with databases just as we'd expect them to. The nature of opening/closing DB connections and working with cursors hasn't changed much in the past few decades. While boilerplate is annoying, at least it has remained consistent, one might figure. That may have been the case, but the philosophical boom of MVC frameworks nearly a decade ago sparked the emergence of popularity for ORMs. While the world was singing praises of object-oriented programming, containing database-oriented functionality within objects must have been a wet dream.

The benefits of SQLAlchemy can mostly be broken down into 3 categories: simplifying database workflows, providing an ORM, and integrating into other major Python libraries. We'll cover these one at a time.

Simple Database Connection Management

SQLAlchemy gives us a few methods for managing our connections with databases. These connection types can be broken down into engines, sessions, and connections.

Engines

An SQLAlchemy "engine" is an object which contains information about our target database. Before we can open any database connections, we need an engine object present to represent the database we're trying to connect to. Here's a basic engine:

from sqlalchemy import create_engine

engine = create_engine('sqlite:///:memory:', echo=True)

The first argument passed into an engine is a database URI. If you're not entirely familiar with these, a URI is a string containing information about a database we want to target. Here's the structure of a URI:

[DB_TYPE]+[DB_CONNECTOR]://[USERNAME]:[PASSWORD]@[HOST]:[PORT]/[DB_NAME]
  • [DB_TYPE] represents the kind of database we're connecting to (MySQL, Postgres, etc.).
  • [DB_CONNECTOR] is the name of the Python library SQLAlchemy will use to connect to said database (for the examples above, we might use PyMySQL or Psycopg2-binary respectively).
  • The variables which come after should all look familiar; these are referring to your target database's URL, a database user, that user's password, etc.

We can run ad hoc queries on an engine object directly by using engine.execute(). When we do this, SQLAlchemy opens a connection for us and runs our query. This way of interacting with a database is called explicit connectionless execution. This is great for when we want to run a quick query but don't want to deal with opening connections:

engine = create_engine('sqlite:///:memory:', echo=True)
results = engine.execute([YOUR_QUERY])
for row in results:
    ...
results.close()

Connections

We can open a persistent connection to our database with engine.connect():

engine = create_engine('sqlite:///:memory:', echo=True)
connection = engine.connect()
result = connection.execute([YOUR_QUERY])
for row in result:
    ...
connection.close()

Using engine.execute() and engine.connect() are almost the same thing: they both open a connection to our database which we'll need to close when we're done. Using engine.execute() saves a line of code, but it also means dealing with the fact that you're using the same variable to store results as you are for managing your connection (hence results.close()).

Sessions

A session is a persistent database connection which lets us add, remove, and undo changes with ease. Python applications which involve a lot of database reads and writes will almost always use a session to quickly execute queries using the SQLAlchemy ORM.

Sessions are created by binding them to an engine:

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine('sqlite:///:memory:', echo=True)
Session = sessionmaker(bind=engine)
session = Session()

That's all it takes! We can now run powerful queries against the session variable with several shorthand methods:

  • session.add(): We can pass an instance of a data model into .add() to quickly create a new record to be added in our database.
  • session.delete(): Similar to the above, .delete() accepts an instance of a data model. If that record exists in our database, it will be staged for deletion.
  • session.commit(): In both of the above cases, neither change is made until the changes of a session are explicitly committed. At this point, all staged changes are committed at once.
  • session.close(): Finally, .close() is a clean way of closing our connection when we're done.

Putting The 'M' In MVC

Frameworks which enforce a Model-View-Controller have held undisputed consensus for long enough: none of us need to recap why creating apps this way is unequivocally correct. To understand why side-stepping an ORM is so significant, let's recall what ORM stands for:

Object-Relational Mapping, commonly referred to as its abbreviation ORM, is a technique that connects the rich objects of an application to tables in a relational database management system. Using ORM, the properties and relationships of the objects in an application can be easily stored and retrieved from a database without writing SQL statements directly and with less overall database access code.

ORMs allow us to interact with databases simply by modifying objects in code (such as classes) as opposed to generating SQL queries by hand for each database interaction. Bouncing from application code to SQL is a major context switch, and the more interactions we introduce, the more out of control our app becomes.

Creating Data Models

Data models are Python classes that represent records in our database. To create a data model, we create a new class and extend the SQLAlchemy Model() class. We'll stick to the basics of data models for now as these get reasonably complex.

The bread and butter of any data model class are Column() objects. Every column variable inside our data model gets created in a User table once our app starts up and creates the appropriate tables:

from sqlalchemy import Column
from sqlalchemy.types import Integer, Text, String


class User(db.Model):
    id = Column(Integer,
                primary_key=True)
    username = Column(String(80),
                      unique=True, nullable=False)
    email = Column(String(120),
                   unique=True,
                   nullable=False)
    joined = Column(Datetime,
                    unique=False,
                    nullable=False)

    def __repr__(self):
        return '<User %r>' % self.username
models.py

Each column accepts a "type", which we import from sqlalchemy.types. We can then set constraints on these columns as well, such as unique or nullable.

SQLAlchemy Session Queries

SQLAlchemy contains what they've dubbed as function-based query construction, which is to say we can construct nearly any conceivable SQL query purely in Python by using the methods offered to us. Let's see what it would take to create a new user from our User class:

newuser = User(username='admin',
               email='admin@example.com')
session.add(newuser)
session.commit()

That's 3 simple lines of code to create an instance of a new user, stage it for creation, and finally committing it to our database via an open session. Not bad! If we wanted to remove the same record, we could run the following:

session.delete(newuser)
session.commit()

Like magic, we delete records which match exactly the values stored in newuser: the instance of our User data model.

Of course, we can also query all users like so:

users = User.query.all()

Calling query on a data model class will immediately look for records in the table which our class is tied to. .all() specifies that we'd like all records (as opposed to .first(), which would only return the first record). We can add logic to our query as well:

users = User.query.filter(username='todd').all()

Now we're querying records only where the username is "todd". What a dumb name.

Every time we write an ORM query, SQLAlchemy is actually constructing raw SQL to run against our database for us. By letting our libraries do the dirty work, we're able to keep our heads in the same Python context by modifying objects instead of SQL records, thus we can work faster and screw up less.

There's a lot more we can do with SQLAlchemy ORM queries, like creating table joins. I'll let you jump into that yourself! Check the full reference.

Integrations With Other Data Libraries

SQLAlchemy has solidified its place as Python's de facto SQL library. Here are just a few of the noteworthy libraries which have native SQLAlchemy integrations:

Pandas

The pairing of Pandas and SQLAlchemy are a match made in heaven. Pandas' read_sql() method is actually a has built-in integration to read data from SQLAlchemy, while to_sql()  enables us to write.

Here's what it takes to turn a database table into a Pandas DataFrame with SQLAlchemy as our connector:

df = pd.read_sql('SELECT * FROM table', engine)

Easy as pie: the first parameter is a SQL query as a string, and engine is our engine variable from earlier.

Now let's write back to our database:

df.to_sql('my_table',
          engine,
          if_exists='append',
          schema='my_schema')

Here we're taking the name of our SQL table, as well as our engine again. We also specify if_exists='append', which says "if our table already exists, add rows from our DataFrame to it." Lastly, if we happen to be using Postgres, we can specify our schema with schema='my_schema'.

Flask-SQLAlchemy

Flask-SQLAlchemy is my go-to for creating any applications with database interactions (AKA all of them). We've actually covered a whole piece about using Flask-SQLAlchemy in this tutorial.

Apache Airflow

I'm tossing Apache Airflow an honorable mention here for using SQLAlchemy as its SQL library of choice. There isn't really much to be said here, but it drives the point home: everybody is using SQLAlchemy, and you should too.

Takeaway Goodies

It's worth mentioning that SQLAlchemy works with nearly every type of database, and does so by leveraging the base Python library for the respective type of database. For example, it probably seems to the outsider that we've spent some time shitting on Psycopg2. On the contrary, when SQLAlchemy connects to a Postgres database, it is using the Psycopg2 library under the hood to manage the boilerplate for us. The same goes for every other type of relational database along with their standard libraries.

There are plenty of more reasons why SQLAlchemy is beneficial to the point where it is arguably critical to data analysis workflows. The critical point to be made here is that leaving SQLAlchemy out of any data workflow only hurts the person writing the code, or more importantly, all those who come after.