Another "Intro to Data Analysis in Python Using Pandas" Post

Obligatory Pandas tutorial by a questionably qualified stranger.

Let’s face it: the last thing the world needs is another “Intro to Pandas” post. Anybody strange enough to read this blog surely had the same reaction to discovering Pandas as I did: a manic euphoria that can only be described as love at first sight. We wanted to tell the world, and that we did. A lot. Yet here I am, about to helplessly sing cliche praises one more time.

I’m a prisoner of circumstance here. As it turns out, the vast (and I mean vast) majority of our fans have a raging Pandas addiction. They come to our humble mom-and-pop shop here at Hackers and Slackers foaming at the mouth, going on raging benders for all Pandas-related content. If I had half a brain, I’d rename this site Pandas and Pandas and delete all non-Pandas-related content. Talk about cash money.

As a middle-ground, I’ve decided to do a bit of housekeeping. My previous “Intro to Pandas” post was an unflattering belligerent mess jotted into a Confluence instance long ago during a Friday night pregame. That mess snuck its way on to this blog, and has gone unnoticed for a year now. I've decided that this probably wasn't the best way to open up a series about the most influential Python library of all time. We're going to try this again... For the Pandas.

Intro to Pandas Rereleased: in IMAX 8k 4D

Pandas is used to analyze and modify tabular data in Python. When we say “tabular data,” we mean any instance in life where data is represented in a table format. Excel, SQL databases, shitty HTML tables.... they’ve all been the same thing with different syntax this whole time. Pandas can achieve anything that any other table can.

If you’re reasonably green to data analysis and are experiencing the “oh-my-God-all-data-professions-are-kinda-just-Excel” realization as we speak, feel free to take a moment. Great, that’s behind us now.

The Anatomy of a DataFrame

Tabular data in Pandas is referred to as a “DataFrame.” We can’t call everything “tables-” otherwise, our choice of vague terminology would grow horribly confusing when we refer to data in different systems. Between you and me though, DataFrames are basically tables.

So how do we represent two-dimensional data via command line: a concept which inherently interprets and displays information one-dimensionally?

"Oh nothing, we think it's cute."

DataFrames consist of individual parts which are easy-to-understand at face value. It’s the complexity of these things together, creating a sum greater than the whole of its parts, which fuels the seemingly endless power of DataFrames. If we want any hope of contributing to the field of Data Science, we need to not only understand the terminology but at least be aware of core concepts of what a DataFrame is beneath the hood. This understanding is what separates engineers from Excel monkeys.

Engineers

1

Win

Excel Nerds

0

Lose

Parts of a DataFrame

Were you expecting this post just to be a bunch of one-liners in Pandas? Good, I hope you're disappointed. Strap yourself in, we might actually learn something today. Class is now in session, baby. Let's break apart what makes a DataFrame, piece-by-piece:

The most basic description of any table would be a collection of columns and rows. Thinking abstractly, we can use the same definition for columns as we do for rows: a sequence of values separated by cells. The only distinction between the two is the direction (horizontal or vertical). Considering we can flip any table on its side and retain the same meaning, we could almost argue that rows and columns are actually the same thing. In Pandas, that's exactly what's happening: both rows and columns are considered to be a Series.

  • Series' are objects native to Pandas (and Numpy) which refer to one-dimensional sequences of data. Another example of a one-dimensional sequence of data could be an array, but series' are much more than arrays: they're a class of their own for many powerful reasons, which we'll see in a moment.
  • Axis refers to the 'direction' of a series, or in other words, whether a series is a column or a row. A series with an axis of 0 is a row, whereas a series with an axis of 1 is a column. This should help break the conception that columns are separate entities from rows: instead, they're the same object with a different attribute.
  • A series contains labels, which are given visual names for a row/column specifying labels allows us to call upon any labeled series in the same way we would access a value in a Python dictionary. For instance, accessing dataframe['awayTeamName'] returns the entire column matching the header "awayTeamName".
  • Every row and column has a numerical index. Most of the time, a row's label will be equivalent to the row's index. While it's common practice to define headers for columns, columns have indexes as well, which simply aren't shown. In this regard, Series share an attribute with lists/arrays, in that they are a collection of indexed values

Consider the last two points: we just described a series to work the same way as a Python dictionary, but also the same way as a Python list. That's right: series' objects are like the biracial offspring of lists and dicts. We can access any column by either its name or its index, and the same goes for rows. Even if we rip a column out from a DataFrame, each cell in that series will still retain the row labels for every cell. This means we can say things like get me column #3, and then find me the value for whatever was in the row labeled "Y". Of course, this works in the reverse as well. It's crazy how things get exponentially more powerful and complicated when we add entire dimensions, isn't it?

Loading Up Your Data

If you've made it this far, you've earned the right to start getting hands-on. Luckily, Pandas has plenty of methods to load tabular data into DataFrames, regardless if you're using static files, SQL, or quirkier methods, Pandas has you covered. Here are some of my favorite examples:

import pandas as pd

# Reads a local CSV file.
csv_df = pd.read_csv('data.csv')

# Similar to above
excel_df = pd.read_excel('data.xlsx')

# Creating tabular data from non-tabular JSON
json_df = pd.read_json('data.json')

# Direct db access utilizing SQLAlchemy
read_sql = read_sql('SELECT * FROM blah', conn=sqlalchemy_engine)

# My personal ridiculous favorite: HTML table to DataFrame.
read_html = read_html('examplePageWithTable.html')

# The strength of Google BigQuery: already officially supported by Pandas
read_gbq = read_gbq('SELECT * FROM test_dataset.test_table', projectid)

All of these achieve the same result of creating a DataFrame. No matter what horrible data sources you may have been forced to inherit, Pandas is here to help. Pandas knows our pain. Pandas is love. Pandas is life.

With data loaded, let's see how we can apply our new knowledge of series' to interact with out data.

Finding Data in Our Dataframe

Pandas has a method for finding a series by label, as well as a separate method for finding a series by index. These methods are .iloc and .loc, respectively. Let's say our DataFrame from the example above is stored as a variable named baseball_df. To get the values of a column by name, we would do the following:

baseball_df = baseball_df.iloc['homeTeamName']
print(baseball_df)

This would return the following:

0   Cubs
1   Indians
2   Padres
3   Diamondbacks
4   Giants
5   Blue Jays
6   Reds
7   Cubs
8   Rockies
9   Yankees
Name: homeTeamName, dtype: object

That's our column! We can see the row labels being listed alongside each row's value. Told ya so. Getting a column will also return the column's dtype, or data type. Data types can be set on columns explicitly. If they aren't, Pandas will generally either default to detecting that the data in the column is a float (returned for any column which only holds numerical values, despite number of decimal points) or an 'object', which is a fancy catch-all meaning "fuck if I know, there's letters and shit in there, it could be anything probably." Pandas doesn't try hard on its own to discern the types of data in each field.

If you're thinking ahead, you might see a looming conflict of interest with iloc. Since we've established that columns and rows are the same, and we're accessing series' based on criteria that is met by both columns and rows (every table has a first row and a first column), how does Pandas know what we want with .loc()? Short answer: It doesn't, so it just returns both!

baseball_df = baseball_df.loc[3]
print(baseball_df)
    homeTeamName    awayTeamName   startTime      duration_minutes
0   Cubs            Reds           18:20:00 UTC   188
1   Indians         Astros         18:20:00 UTC   194
2   Padres          Giants         18:20:00 UTC   185
3   Diamondbacks    Brewers        18:20:00 UTC   211

Ahhh, a 4x4 grid! This does, in fact, satisfy what we asked for- albiet in a clever, intentional way. "Clever and intentional" is actually a great way to describe Pandas as a library. This combination of ease and power is what makes Pandas so magnetic to curious newcomers.

Want another example? How about leveraging the unique attributes of series' to splice DataFrames as though they were arrays?

sliced_df = df.loc['homeTeamName':'awayTeamName']
print(sliced_df)
    homeTeamName    awayTeamName
0   Cubs            Reds        
1   Indians         Astros      
2   Padres          Giants      
3   Diamondbacks    Brewers     

...Did we just do that? We totally did. We were able to slice a two-dimensional set of data by using the same syntax that we'd used to slice arrays, thanks to the power of the series object.

Welcome to the Club

There are a lot more entertaining, mind-blowing ways to introduce people to Pandas. If our goal had been sheer amusement, we would have leveraged the cookie-cutter route to Pandas tutorials: overloading readers with Pandas "tricks" displaying immense power in minimal effort. Unfortunately, we took the applicable approach to actually retaining information. Surely this model of "informational and time consuming" will beat out "useless but instantly gratifying," right? RIGHT?

Whatever. I’ll schedule the Pandas and Pandas rebrand for next week. From now on when people want that quick fix, you can call me Pablo Escobar. Join us next time when we use Pandas data analysis to determine which private Caribbean island offers the best return on investment with all the filthy money we’ll make.

Hint: it’s definitely not the Fyre festival one.

Todd Birchard's' avatar
New York City Website
Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.