Something we've taken for granted thus far on Hackers and Slackers is a library most data professionals have accepted as an undisputed standard: SQLAlchemy.
In the past, we've covered database connection management and querying using libraries such as PyMySQL and Psycopg2, both of which do an excellent job of interacting with databases just as we'd expect them to. The nature of opening/closing DB connections and working with cursors hasn't changed much in the past few decades (nearly the lifespan of relational databases themselves). While boilerplate is boring, at least it has remained consistent, one might figure. That may have been the case, but the philosophical boom of MVC frameworks nearly a decade ago sparked the emergence of popularity for ORMs. While the world was singing praises of object-oriented programming, containing database-oriented functionality within objects must have been a wet dream.
The only thing shocking about SQLAlchemy's popularity is its flip side: the contingency of those functioning without SQLAlchemy as a part of their regular stack. Whether this stems from unawareness or active reluctance to change, data teams using Python without a proper ORM are surprisingly prevalent. It's easy to forget the reality of the workforce when our interactions with other professionals come mostly from blogs published by those at the top of their field.
I realize the "this is how we've always done it" attitude is a cliché with no shortage of commentary. Tales of adopting new (relatively speaking) practices dominate Silicon Valley blogs every day- it's the manner in which this is manifested, however, that catches me off guard. In this case, resistance to a single Python library can shed light on a frightening mental model that has implications up and down a corporation's stack.
Putting The 'M' In MVC
Frameworks which enforce a Model-View-Controller have held undisputed consensus for long enough: none of us need to recap why creating apps this way is unequivocally correct. To understand why side-stepping an ORM is so significant, let's recall what ORM stands for:
Object-Relational Mapping, commonly referred to as its abbreviation ORM, is a technique that connects the rich objects of an application to tables in a relational database management system. Using ORM, the properties and relationships of the objects in an application can be easily stored and retrieved from a database without writing SQL statements directly and with less overall database access code. - Active Record
ORMs allow us to interact with databases simply by modifying objects in code (such as classes) as opposed to generating SQL queries by hand for each database interaction. Bouncing from application code to SQL is a major context switch, and the more interactions we introduce, the more out of control our app becomes.
To illustrate the alternative to this using models, I'll use an example offered by Flask-SQLAlchemy. Let's say we have a table of users which contains columns for id, username, and email. A model for such a table would look as such:
class User(db.Model): id = db.Column(db.Integer, primary_key=True) username = db.Column(db.String(80), unique=True, nullable=False) email = db.Column(db.String(120), unique=True, nullable=False) def __repr__(self): return '<User %r>' % self.username
The 'model' is an object representing the structure of a single entry in our table. Once our model exists, this is all it takes to create an entry:
newuser = User(username='admin', email='[email protected]')
That's a single readable line of code without writing a single line of SQL. Compare this to the alternative, which would be to use Psycopg2:
query = "INSERT INTO users VALUES username='admin', email='[email protected]';" def query_function(query): """Runs a database query.""" try: conn = psycopg2.connect( user = config.username, password = config.password, host = config.host, port = config.port, database = config.database) with conn.cursor() as cur: cur.execute(query) cur.close() conn.close() except Exception as e: print(e) query_function(query)
query_function() only needs to be set once, but compare the readability of using a model to the following:
query = "INSERT INTO users VALUES username='admin', email='[email protected]';" query_function(query)
Despite achieving the same effect, the latter is much less readable or maintainable by human beings. Building an application around raw string queries can quickly become a nightmare.
Integration With Other Data Libraries
When it comes to golden standards of Python libraries, there is none more quintessential to data analysis than Pandas. The pairing of Pandas and SQLAlchemy is standard to the point where Pandas has built-in integrations to interact with data from SQLAlchemy. Here's what it takes to turn a database table into a Pandas dataframe with SQLAlchemy as our connector:
df = pd.read_sql(session.query(Table).filter(User.id == 2).statement,session.bind)
Once again, a single line of Python code!
Writing Queries Purely in Python
So far by using SQLAlchemy, we haven't needed to write a single line of SQL: how far could we take this? As far as we want, in fact. SQLAlchemy contains what they've dubbed as function-based query construction, which is to say we can construct nearly any conceivable SQL query purely in Python by using the methods offered to us. For example, here's an update query:
stmt = users.update().values(fullname="Fullname: " + users.c.name) conn.execute(stmt)
Check the full reference to see what I mean. Every query you've ever needed to write: it's all there. All of it.
Simple Connection Management
Seeing as how we all now agree that SQLAlchemy is beneficial to our workflow, let's visit square one and see how simple it is to manage connections. The two key words to remember here are engines and sessions.
An engine in SQLAlchemy is merely a bare-bones object representing our database. Making SQLAlchemy aware of our database is as simple as these two lines:
from sqlalchemy import create_engine engine = create_engine('sqlite:///:memory:', echo=True)
The Engine can interact with our database by accepting a simple URI. Once
engine exists, we could in theory use engine exclusively via functions such as
To interact with our database in a Pythonic manner via the ORM, we'll need to create a session from the engine we just declared. Thus our code expands:
from sqlalchemy import create_engine from sqlalchemy.orm import sessionmaker engine = create_engine('sqlite:///:memory:', echo=True) Session = sessionmaker(bind=engine)
That's all it takes! Now just as before, we can use SQLAlchemy's ORM and built-in functions to make simple interacts:
new_user = User(name='todd', fullname='Todd Hacker', password='toddspassword') session.add(new_user)
It's worth mentioning that SQLAlchemy works with nearly every type of database, and does so by leveraging the base Python library for the respective type of database. For example, it probably seems to the outsider that we've spent some time shitting on Psycopg2. On the contrary, when SQLAlchemy connects to a Postgres database, it is using the Psycopg2 library under the hood to manage the boilerplate for us. The same goes for every other type of relational database along with their standard libraries.
There are plenty of more reasons why SQLAlchemy is beneficial to the point where it is arguably critical to data analysis workflows. The critical point to be made here is that leaving SQLAlchemy out of any data workflow only hurts the person writing the code, or more importantly, all those who come after.