- Being REALLY Lazy With Multiple Aggregations in Pandas
- Splitting Columns With Pandas
- Recasting Low-Cardinality Columns as Categoricals
- Removing Duplicate Columns in Pandas
- Downcast Numerical Data Types with Pandas
- Using Random Forests for Feature Selection with Categorical Features
- Tuning Random Forests Hyperparameters: min_samples_leaf
- Tuning Random Forests Hyperparameters: max_depth
- Tuning Machine Learning Hyperparameters with Binary Search
- Importing Excel Datetimes Into Pandas, Part II
- Importing Excel Datetimes Into Pandas, Part I
- All That Is Solid Melts Into Graphs
- Trash Pandas: Messy, Convenient Database Operations via Pandas
- A Dirty Way of Cleaning Data (ft. Pandas & SQL)
- Getting Conda Environments To Play Nicely With Cron
- Using Pandas and SQLAlchemy to Simplify Databases
Manually opening and closing cursors? Iterating through DB output by hand? Remembering which function is the actual one that matches the Python data structure you're gonna be using?
There has to be a better way!
There totally is.
One of Pandas' most useful abilities is easy I/O. Whether it's a CSV, JSON, an Excel file, or a database - Pandas gets you what you want painlessly. In fact,I'd say that even if you don't have the spare bandwidth at the moment to rewire your brain to learn all the wonderful ways Pandas lets you manipulate data (array-based programming generally isn't the first paradigm people learn), then it's STILL worth using just for the I/O. Grab your data with one of its delicious one-liners, and once it's in a Dataframe there's have a method to convert it to practically any combination of native Python data structures that you want (lists of lists? Lists of dictionaries? Lots of other stuff? You've got it!).
Here we'll be looking at interacting with a database with Pandas, because that's what I was finishing up when I saw Todd texted me that a lot of the people who wind up on the site are searching for Pandas stuff.
So, for our purposes, let's say you're doing Python stuff with a MySQL database. I'm also assuming you're using Linux (including via Windows Subsystem for Linux, which is what I use in order to have Ubuntu on my Windows laptop), because I've never done this with Windows.
Annoyingly, there's more than one potential connector - I generally use PyMySQL, for no particular reason. Installing this involves an extra step - telling pip or Conda to install it won't do the whole thing. Annoying, but them's the breaks.
sudo apt-get install python3-pymysql -y
Should get you there. I remember fussing with this every time I have to do it on a new machine.
You also have to install a mysql-connector, for reasons I won't pretend to understand.
conda install mysql-connector-python -y
I've never actually done this with pip, but I assume it'd be
pip install mysql-connector
If that doesn't work, try
pip install --allow-external mysql-connector-python mysql-connector-python
WE'RE ALMOST DONE INSTALLING THINGS!
Last, but not least, is a lovely little thing called SQLAlchemy. Remember when I said I didn't actually know anything about PyMySQL and its relative advantages and disadvantages to other MySQL connectors for Python? Part of the reason for that is that SQLAlchemy, as the name implies, allows you to MAGICALLY wrap Python SQL connectors at a higher level of abstraction and not have to deal with the particulars. So, grab a little chunk of Philosopher's Stone, and - oh, it's actually a default Anaconda package. BUT, if you're in a fresh environment (or not using Conda)...
(Always remember the little
-y at the end. That ensures that you can walk away while it installs and it'll be done when you get back, instead of finishing the long process of whatever's happening when you first tell it to install and then waiting to start on the next long process that takes place after you hit y.)
pip install sqlalchemy (that IS one of the nice things about pip! Doesn't need you to confirm).
So, to review, before we start Pythoning...
$ sudo apt-get install python3-pymysql -y $ conda install mysql-connector-python -y $ conda install -c anaconda sqlalchemy -y
$ sudo apt-get install python3-pymysql -y $ pip install mysql-connector $ pip install sqlalchemy
Cool! Ready to learn about SQLAlchemy? Well, you won't - this is about Pandas! We're going YET ANOTHER LEVEL OF ABSTRACTION HIGHER. SQLAlchemy is just what Pandas uses to connect to databases.
Aaaand one (possible) last step. If you're doing this with a locally-installed db, you might have to
sudo service mysql start.
(Like, every time you do dev work on it. At least on my machine. Your mileage may vary)
On to the Actual Python
So, what does all this do for us? Well, first off, it wasn't THAT much - and you only have to do it once (per machine (really per environment per machine (you are taking advantage of Conda envs or virtualenvs, right?))). But what it really lets us skip all the implementation details and focus on the actual logic of what we're doing. That's really what both SQL and Pandas are all about- they wed short, declarative bits of code to highly optimized libraries underneath the hood. Relational Database Management Systems sit on top of code that's been optimized for decades, turning your short queries into highly efficient compiled code. Pandas takes advantage of scientific computing libraries for doing matrix operations that have been around since before there was an internet. Drill down deep enough in Pandas and you'll actually find FORTRAN code that was originally written to simulate nuclear explosions. Our whole goal is to just focus on our data manipulation via SQL and Python's scientific computing family - and we don't want to add a bunch of additional cognitive load from worrying about cursors.
Okay, I Lied - NOW on to the Actual Python
import pandas as pd import pymysql from sqlalchemy import create_engine
From all the database connector stuff we installed, we actually just need ONE function - and then it's pure blissful operations on data from here on out. create_engine() will create a connection for us, and then we'll use that as an argument to Pandas functions that talk to the DB. You might have to reopen it sometimes. Whatever.
First, let's make a string that describes our database. Note that normally you shouldn't actually store this directly in your code (IT will totally be mad at you), but for our purposes let's include it.
The format is:
So, since we're using mysql, and the PyMySQL package, we'd start with:
The rest is fairly straightforward. So, if we've got the user "analyst" with the password "badsecuritykills" with a local MySQL running on port 3306 (pretty sure you don't have to specify a port if it's localhost, but bear with me), and the database itself is named terrible_purchases, it'd be:
And now let's actually make our connection!
cnx = create_engine('mysql+pymysql://analyst:badsecuritykills@localhost:3306/terrible_purchases)
Done! Now Pandas can read and write from our database.
If you don't wanna have to see a whole bunch of extraneous stuff, I'd also add the flag echo=False (create_engine has a ton of useful little flags, letting you do stuff like automatically strip unicode that your database might not be able to read).
Behold! We can write an SQL query and then get a Pandas Dataframe with those exact parameters. We can rapidly prototype without worrying that the reason our query isn't working is that we forgot some fussy intermediary stage. We can get stuff wrong and experiment!
Reading a table is easy as:
df = pd.io.sql.sql_query("SELECT * FROM cursed_items", cnx)
Done in one line!
Writing is just as easy (with a teeeensy little asterisk, that I'll probably talk about in the future). Let's say we dusted off an old CSV with more cursed items that aren't in our database. After we've loaded it (and maybe done some cleaning)
newCursedItemsDF.to_sql(name='cursed_items', con=cnx, if_exists='append')
if_exists flag adds a ton of flexibility. You can create a whole new DB table from a Dataframe, and that's actually the default behavior. For our case, we appended - but there's other times when we might want to replace, for instance.
So, this is a super useful workflow for doing interactive analytics work with stuff that's on a database. It's also quite useful for quick-and-dirty scripts, or for prototyping the logic of something that'll be expanded upon later. And, ah, bigger stuff too - especially if there's gonna be really complex transformations that. As long as the data can fit in memory (RAM), Pandas is pretty awesome.
That being said, there are times when you'd actually want to deal with the levels we bypassed here. Everything has tradeoffs. One example would be that all those compiled Linear Algebra libraries that Pandas sit on top of have a massive memory footprint, and AWS Lambda charges you according to memory usage. I'm pretty sure there are also use cases where this method of handling the lower-level nitty-gritty of database interactions (ie, by mostly not handling it) will cause problems. But at that point, we're dealing with the difference between Data Science & Data Engineering.