Pythonic Database Management with SQLAlchemy
Something we've taken for granted thus far on Hackers and Slackers is a library most data professionals have accepted as standard: SQLAlchemy.
In the past, we've covered database connection management and querying using libraries such as PyMySQL and Psycopg2, both of which do an excellent job of interacting with databases just as we'd expect them to. The nature of opening/closing DB connections and working with cursors hasn't changed much in the past few decades. While boilerplate is boring, at least it has remained consistent, one might figure. That may have been the case, but the philosophical boom of MVC frameworks nearly a decade ago sparked the emergence of popularity for ORMs. While the world was singing praises of object-oriented programming, containing database-oriented functionality within objects must have been a wet dream.
The benefits of SQLAlchemy can mostly be broken down into 3 categories: simplifying database workflows, providing an ORM, and integrating into other major Python libraries. We'll cover these one at a time.
Simple Database Connection Management
SQLAlchemy gives us a few methods for managing our connections with databases. These connection types can be broken down into engines, sessions, and connections.
An SQLAlchemy "engine" is an object which contains information about our target database. Before we can open any database connections, we need an engine object present to represent the database we're trying to connect to. Here's a basic engine:
from sqlalchemy import create_engine engine = create_engine('sqlite:///:memory:', echo=True)
The first argument passed into an engine is a database URI. If you're not entirely familiar with these, a URI is a string containing information about a database we want to target. Here's the structure of a URI:
- [DB_TYPE] represents the kind of database we're connecting to (MySQL, Postgres, etc.).
- [DB_CONNECTOR] is the name of the Python library SQLAlchemy will use to connect to said database (for the examples above, we might use PyMySQL or Psycopg2-binary respectively).
- The variables which come after should all look familiar; these are referring to your target database's URL, a database user, that user's password, etc.
We can run ad hoc queries on an engine object directly by using
engine.execute(). When we do this, SQLAlchemy opens a connection for us and runs our query. This way of interacting with a database is called explicit connectionless execution. This is great for when we want to run a quick query but don't want to deal with opening connections:
engine = create_engine('sqlite:///:memory:', echo=True) results = engine.execute([YOUR_QUERY]) for row in results: ... results.close()
We can open a persistent connection to our database with
engine = create_engine('sqlite:///:memory:', echo=True) connection = engine.connect() result = connection.execute([YOUR_QUERY]) for row in result: ... connection.close()
engine.connect() are almost the same thing: they both open a connection to our database which we'll need to close when we're done. Using
engine.execute() saves a line of code, but it also means dealing with the fact that you're using the same variable to store results as you are for managing your connection (hence
A session is a persistent database connection which lets us add, remove, and undo changes with ease. Python applications which involve a lot of database reads and writes will almost always use a session to quickly execute queries using the SQLAlchemy ORM.
Sessions are created by binding them to an engine:
from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker engine = create_engine('sqlite:///:memory:', echo=True) Session = sessionmaker(bind=engine) session = Session()
That's all it takes! We can now run powerful queries against the
session variable with several shorthand methods:
session.add(): We can pass an instance of a data model into
.add()to quickly create a new record to be added in our database.
session.delete(): Similar to the above,
.delete()accepts an instance of a data model. If that record exists in our database, it will be staged for deletion.
session.commit(): In both of the above cases, neither change is made until the changes of a session are explicitly committed. At this point, all staged changes are committed at once.
.close()is a clean way of closing our connection when we're done.
Putting The 'M' In MVC
Frameworks which enforce a Model-View-Controller have held undisputed consensus for long enough: none of us need to recap why creating apps this way is unequivocally correct. To understand why side-stepping an ORM is so significant, let's recall what ORM stands for:
Object-Relational Mapping, commonly referred to as its abbreviation ORM, is a technique that connects the rich objects of an application to tables in a relational database management system. Using ORM, the properties and relationships of the objects in an application can be easily stored and retrieved from a database without writing SQL statements directly and with less overall database access code.
ORMs allow us to interact with databases simply by modifying objects in code (such as classes) as opposed to generating SQL queries by hand for each database interaction. Bouncing from application code to SQL is a major context switch, and the more interactions we introduce, the more out of control our app becomes.
Creating Data Models
Data models are Python classes which will end up representing records in our database. To create a data model, we create a new class and extend the SQLAlchemy
Model() class. We'll stick to the basics of data models for now as these get fairly complex.
The bread and butter of any data model class are
Column() objects. Every column variable inside our data model gets created in a User table once our app starts up and creates the appropriate tables:
from sqlalchemy import Column from sqlalchemy.types import Integer, Text, String class User(db.Model): id = Column(Integer, primary_key=True) username = Column(String(80), unique=True, nullable=False) email = Column(String(120), unique=True, nullable=False) joined = Column(Datetime, unique=False, nullable=False) def __repr__(self): return '<User %r>' % self.username
Each column accepts a "type", which we import from
sqlalchemy.types. We can then set constraints on these columns as well, such as unique or nullable.
SQLAlchemy Session Queries
SQLAlchemy contains what they've dubbed as function-based query construction, which is to say we can construct nearly any conceivable SQL query purely in Python by using the methods offered to us. Let's see what it would take to create a new user from our User class:
newuser = User(username='admin', firstname.lastname@example.org') session.add(newuser) session.commit()
That's 3 simple lines of code to create an instance of a new user, stage it for creation, and finally committing it to our database via an open session. Not bad! If we wanted to remove the same record, we could run the following:
Like magic, we delete records which match exactly the values stored in
newuser: the instance of our User data model.
Of course, we can also query all users like so:
users = User.query.all()
Calling query on a data model class will immediately look for records in the table which our class is tied to.
.all() specifies that we'd like all records (as opposed to
.first(), which would only return the first record). We can add logic to our query as well:
users = User.query.filter(username='todd').all()
Now we're querying records only where the username is "todd". What a dumb name.
Every time we write an ORM query, SQLAlchemy is actually constructing raw SQL to run against our database for us. By letting our libraries do the dirty work, we're able to keep our heads in the same Python context by modifying objects instead of SQL records, thus we can work faster and screw up less.
There's a lot more we can do with SQLAlchemy ORM queries, like creating table joins. I'll let you jump into that yourself! Check the full reference.
Integrations With Other Data Libraries
SQLAlchemy has solidified its place as Python's de facto SQL library. Here are just a few of the noteworthy libraries which have native SQLAlchemy integrations:
The pairing of Pandas and SQLAlchemy are a match made in heaven. Pandas'
read_sql() method is actually a has built-in integration to read data from SQLAlchemy, while
to_sql() enables us to write.
Here's what it takes to turn a database table into a Pandas DataFrame with SQLAlchemy as our connector:
df = pd.read_sql('SELECT * FROM table', engine)
Easy as pie: the first parameter is a SQL query as a string, and
engine is our engine variable from earlier.
Now let's write back to our database:
df.to_sql('my_table', engine, if_exists='append', schema='my_schema')
Here we're taking the name of our SQL table, as well as our
engine again. We also specify
if_exists='append', which says "if our table already exists, add rows from our DataFrame to it." Lastly, if we happen to be using Postgres, we can specify our schema with
Flask-SQLAlchemy is my go-to for creating any applications with database interactions (AKA all of them). We've actually covered a whole piece about using Flask-SQLAlchemy in this tutorial.
I'm tossing Apache Airflow an honorable mention here for using SQLAlchemy as its SQL library of choice. There isn't really much to be said here, but it drives the point home: everybody is using SQLAlchemy, and you should too.
It's worth mentioning that SQLAlchemy works with nearly every type of database, and does so by leveraging the base Python library for the respective type of database. For example, it probably seems to the outsider that we've spent some time shitting on Psycopg2. On the contrary, when SQLAlchemy connects to a Postgres database, it is using the Psycopg2 library under the hood to manage the boilerplate for us. The same goes for every other type of relational database along with their standard libraries.
There are plenty of more reasons why SQLAlchemy is beneficial to the point where it is arguably critical to data analysis workflows. The critical point to be made here is that leaving SQLAlchemy out of any data workflow only hurts the person writing the code, or more importantly, all those who come after.