Warning: The following is FANTASTICALLY not-secure. Do not put this in a script that's going to be running unsupervised. This is for interactive sessions where you're prototyping the data cleaning methods that you're going to use, and/or just manually entering stuff. Especially if there's any chance there could be something malicious hiding in the data to be uploaded. We're going to be executing formatted strings of SQL unsanitized code. Also, this will lead to LOTS of silent failures, which are arguably The Worst Thing - if guaranteed correctness is a requirement, leave this for the tinkering table. Alternatively, if
Analyze data with the Pandas data analysis library for Python. Start from the basics or see real-life examples of pros using Pandas to solve problems.
Taxation without representation. Colonialism. Not letting people eat cake. Human beings rightfully meet atrocities with action in an effort to change the worked for the better. Cruelty by mankind justifies revolution, and it is this writer’s opinion that API limitations are one such cruelty.
The data we need and crave is stashed in readily available APIs all around us. It’s as though we have the keys to the world, but that power often comes with a few caveats:
- Your “key” only lasts a couple of hours, and if you want another one, you’ll have to use some
Databases. You love them, you need them, but let's face it... you've already mastered working with them. There's only so much fun to be had in the business of opening database connections, pulling rows, and putting them back where they came from. Wouldn't it be great if we could skip the boring stuff and work with data?
Pandas and SQLAlchemy are a mach made in Python heaven. They're individually amongst Python's most frequently used libraries. Together they're greater than the sum of their parts, thanks to Pandas' built-in SQLAlchemy integration.
Create a SQLAlchemy Connection
As you might imagine, the first
Manually opening and closing cursors? Iterating through DB output by hand? Remembering which function is the actual one that matches the Python data structure you're gonna be using?
There has to be a better way!
There totally is.
One of Pandas' most useful abilities is easy I/O. Whether it's a CSV, JSON, an Excel file, or a database - Pandas gets you what you want painlessly. In fact,I'd say that even if you don't have the spare bandwidth at the moment to rewire your brain to learn all the wonderful ways Pandas lets you manipulate data (array-based programming
In one corner we have Pandas: Python's beloved data analysis library. In the other, AWS: the unstoppable cloud provider we're obligated to use for all eternity. We should have known this day would come.
While not the prettiest workflow, uploaded Python package dependencies for usage in AWS Lambda is typically straightforward. We install the packages locally to a virtual env, package them with our app logic, and upload a neat CSV to Lambda. In some cases this doesn't always work: some packages result in a cryptic error message with absolutely no helpful instruction. Pandas is one of those packages.
You've heard the cliché before: it is often cited that roughly %80~ of a data scientist's role is dedicated to cleaning data sets. I Personally haven't looked in to the papers or clinical trials which prove this number (that was a joke), but the idea holds true: in the data profession, we find ourselves doing away with blatantly corrupt or useless data. The simplistic approach is to discard such data entirely, thus here we are.
What constitutes 'filthy' data is project-specific, and at times borderline subjective. Occasionally, the offenders are more obvious: these might include chunks of data which are
Let's say you have two obscenely large sets of data.
These sets of data contain information on a similar topic, such as customers. Dataset #1 might contain a high-level view of all customers of a business, while Datatset #2 contains a lifetime history of orders for a company. Unsurprisingly, the customers in Dataset #1 appear in Dataset x#2, as any business' orders are made by customers.
Welcome to Relational Databases
What we just described is the core foundation for relational databases which have been running at the core of businesses since the 1970s. Starting with
Let’s face it: the last thing the world needs is another “Intro to Pandas” post. Anybody strange enough to read this blog surely had the same reaction to discovering Pandas as I did: a manic euphoria that can only be described as love at first sight. We wanted to tell the world, and that we did. A lot. Yet here I am, about to helplessly sing cliche praises one more time.
I’m a prisoner of circumstance here. As it turns out, the vast (and I mean vast) majority of our fans have a raging Pandas addiction. They come