Pandas

Analyze data with the Pandas data analysis library for Python. Start from the basics or see real-life examples of pros using Pandas to solve problems.
Code Snippet Corner
03 Jun 2019

Recasting Low-Cardinality Columns as Categoricals

Downcast strings in Pandas to their proper data-types using HDF5.
Recasting Low-Cardinality Columns as Categoricals

The other day, I was grabbing a way-too-big DB query for local exploration.  It was literally over 2GB as a CSV - which is a pain for a number of reasons!  Not the least of which being that, while you're doing Exploratory Data Analysis, things take way too long - it doesn't take long for Cognitive Drift to break your rhythm!

Numerical columns can be taken down to size with the downcasting functions from a previous post.  But what about Object/String columns?  One of the best ways to reduce the size of a file like this is to recast

Continue Reading
Code Snippet Corner
28 May 2019

Removing Duplicate Columns in Pandas

Dealing with duplicate column names in your Pandas DataFrame.
Removing Duplicate Columns in Pandas

Sometimes you wind up with duplicate column names in your Pandas DataFrame. This isn't necessarily a huge deal if we're just messing with a smallish file in Jupyter.  But, if we wanna do something like load it into a database, that'll be a problem.  It can also interfere with our other cleaning functions - I ran into this the other day when reducing the size of a giant data file by downcasting it (as per this previous post).  The cleaning functions required a 1D input (so, a Series or List) - but calling the name of a duplicate column gave

Continue Reading
Pandas
28 May 2019

Using Hierarchical Indexes With Pandas

Use Panda's Multiindex to make your data work harder for you.
Using Hierarchical Indexes With Pandas

I've been wandering into a lot of awkward conversations lately, most of them being about how I spend my free time. Apparently "rifling through Python library documentation in hopes of finding dope features" isn't considered a relatable hobby by most people. At least you're here, so I must be doing something right occasionally.

Today we'll be venturing off into the world of Pandas indexes. Not just any old indexes... hierarchical indexes. Hierarchical indexes take the idea of having identifiers for rows, and extends this concept by allowing us to set multiple identifiers with a twist: these indexes hold parent/child

Continue Reading
Pandas
20 May 2019

Reshaping Pandas DataFrames

A guide to DataFrame manipulation using groupby, melt, pivot tables, pivot, transpose, and stack.
Reshaping Pandas DataFrames

Summer is just around the corner and everybody seems to be asking the same question: “does my data look... out of shape?” Whether you’re a scientist or an engineer, data-image dysmorphia can lead to serious negative thoughts which leave you second-guessing our data.  

Much has already been said about modifying DataFrames on a “micro” level, such as column-wise operations. But what about modifying entire DataFrames at once? When considering Numpy’s role in general mathematics, it should come as no surprise that Pandas DataFrames have a lot of similarities to the matrices we learned high school pre-calc; namely, they

Continue Reading
Data Science
29 Apr 2019

Plotting Data With Seaborn and Pandas

Create beautiful data visualizations out-of-the-box with Python’s Seaborn.
Plotting Data With Seaborn and Pandas

There are plenty of good libraries for charting data in Python, perhaps too many. Plotly is great, but a limit of 25 free charts is hardly a starting point. Sure, there's Matplotlib, but surely we find something a little less... well, lame. Where are all the simple-yet-powerful chart libraries at?

As you’ve probably guessed, this is where Seaborn comes in. Seaborn isn’t a third-party library, so you can get started without creating user accounts or worrying about API limits, etc. Seaborn is also built on top of Matplotlib, making it the logical next step up for anybody wanting

Continue Reading
Data Engineering
28 Mar 2019

Building an ETL Pipeline: From JIRA to SQL

An example data pipeline which extracts data from the JIRA Cloud API and loads it to a SQL database.
Building an ETL Pipeline: From JIRA to SQL

Something we haven't done just yet on this site is walking through the humble process of creating data pipelines: the art of taking a bunch of data, changing said data, and putting it somewhere else. It's kind of a weird thing to be into, hence why the MoMA has been rejecting my submissions of Github repositories. Don't worry; I'll keep at it.

Something you don't see every day are people sharing their pipelines, which is understandable. Presumably, the other people who do this kind of stuff do it for work; nobody is happily building stupid pipelines in their free time

Continue Reading
Pandas
28 Jan 2019

Downcast Numerical Data Types with Pandas

Using an Example Where We Downcast Numerical Columns.
Downcast Numerical Data Types with Pandas

Recently, I had to find a way to reduce the memory footprint of a Pandas DataFrame in order to actually do operations on it.  Here's a trick that came in handy!

By default, if you read a DataFrame from a file, it'll cast all the numerical columns as the float64 type.  This is in keeping with the philosophy behind Pandas and NumPy - by using strict types (instead of normal Python "duck typing"), you can do things a lot faster.  The float64 is the most flexible numerical type - it can handle fractions, as well as turning missing values into

Continue Reading
Pandas
04 Nov 2018

Liberating Data from PDFs with Tabula and Pandas

Making 'open' data more open.
Liberating Data from PDFs with Tabula and Pandas

Check out the accompanying GitHub repo for this article here.

Technically, the School District of Philadelphia's budget data for the 2019 fiscal year is "open". It is, after all, made available through the district's Open Data portal and is freely available to download.

But just because data is freely available, doesn't mean it's easy to work with. That's what found out when I downloaded the zipped folder, opened it up, and found a heap of PDFs. Joy.

As a member of Code for Philly, I thought of my compatriots who might want to use school district data in their projects.

Continue Reading
Pandas
20 Aug 2018

Importing Excel Datetimes Into Pandas, Part II

Import dates & times from Excel .xlsx files into Pandas!
Importing Excel Datetimes Into Pandas, Part II

What if, like during my data import task a few months back, the dates & times are in separate columns?  This gives us a few new issues.  Let's import that Excel file!

import pandas as pd
import xlrd
import datetime

df = pd.read_excel("hasDatesAndTimes.xlsx", sheet_name="Sheet1")

book = xlrd.open_workbook("hasDatesAndTimes.xlsx")
datemode = book.datemode

And let's see that time variable!

df["Time"]
Index Time
0 0.909907
1 0.909919
2 0.909931
3 0.909942
4 0.909954
df["Time"].map(lambda x: xlrd.xldate_
Continue Reading
Pandas
13 Aug 2018

Importing Excel Datetimes Into Pandas, Part I

Pandas & Excel, Part 1.
Importing Excel Datetimes Into Pandas, Part I

Different file formats are different!  For all kinds of reasons!

A few months back, I had to import some Excel files into a database. In this process I learned so much about the delightfully unique way Excel stores dates & times!  

The basic datetime will be a decimal number, like 43324.909907407404.  The number before the decimal is the day, the number afterwards is the time.  So far, so good - this is pretty common for computers.  The date is often the number of days past a certain date, and the time is the number of seconds.  

So, let's load

Continue Reading
Pandas
06 Aug 2018

Lazy Pandas and Dask

Increase the performance of Pandas with Dask.
Lazy Pandas and Dask

Ah, laziness.  You love it, I love it, everyone agrees it's just better.

Flesh-and-blood are famously lazy.  Pandas the package, however, uses Eager Evaluation.  What's Eager Evaluation, you ask?  Is Pandas really judgey, hanging out on the street corner and being fierce to the style choices of people walking by?  Well, yes, but that's not the most relevant sense in which I mean it here.  

Eager evaluation means that once you call pd.read_csv(), Pandas immediately jumps to read the whole CSV into memory.

"Wait!" I hear you ask.  "Isn't that what we want?  Why would I call the

Continue Reading
Python
30 Jul 2018

All That Is Solid Melts Into Graphs

Reshaping Pandas dataframes with a real-life example, and graphing it with Altair.
All That Is Solid Melts Into Graphs

Last few Code Snippet Corners were about using Pandas as an easy way to handle input and output between files & databases.  Let's shift gears a little bit!  Among other reasons, because earlier today I discovered a package that exclusively does that, which means I can stop importing the massive Pandas package when all I really wanted to do with it was take advantage of its I/O modules.  Check it out!

So, rather than the entrances & exits, let's focus on all the crazy ways you can reshape data with Pandas!

Our Data

For our demonstration, I'll use a

Continue Reading
Pandas
28 Jul 2018

Automagically Turn JSON into Pandas DataFrames

Let Pandas do the heavy lifting for you when turning JSON into a DataFrame.
Automagically Turn JSON into Pandas DataFrames

In his post about extracting data from APIs, Todd demonstrated a nice way to massage JSON into a pandas DataFrame. This method works great when our JSON response is flat, because dict.keys() only gets the keys on the first "level" of a dictionary. It gets a little trickier when our JSON starts to become nested though, as I experienced when working with Spotify's API via the Spotipy library. For example, take a look at a response from their https://api.spotify.com/v1/tracks/{id} endpoint:

import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

spotify_client_id = 'YOUR_ID'
spotify_
Continue Reading
Pandas
23 Jul 2018

Trash Pandas: Messy, Convenient DB Operations via Pandas

(And a way to clean it up with SQLAlchemy).
Trash Pandas: Messy, Convenient DB Operations via Pandas

Let's say you were continuing our task from last week: Taking a bunch of inconsistent Excel files and CSVs, and putting them into a database.

Let's say you've been given a new CSV that conflicts with some rows you've already entered, and you're told that these rows are the correct values.

Why Not Use Pandas' Built-in Method?

Pandas' built-in to_sql DataFrame method won't be useful here.  Remember, it writes as a block - if you set the if_exists flag to "replace", that'll make it replace the entire DB table with a new one based on the DF you're

Continue Reading