Data Analysis

Prepare reports and draw meaningful conclusions from data. Learn to clean and manipulate datasets to draw meaningful conclusions from existing information.
Code Snippet Corner
03 Jun 2019

Recasting Low-Cardinality Columns as Categoricals

Downcast strings in Pandas to their proper data-types using HDF5.
Recasting Low-Cardinality Columns as Categoricals

The other day, I was grabbing a way-too-big DB query for local exploration.  It was literally over 2GB as a CSV - which is a pain for a number of reasons!  Not the least of which being that, while you're doing Exploratory Data Analysis, things take way too long - it doesn't take long for Cognitive Drift to break your rhythm!

Numerical columns can be taken down to size with the downcasting functions from a previous post.  But what about Object/String columns?  One of the best ways to reduce the size of a file like this is to recast

Continue Reading
Code Snippet Corner
28 May 2019

Removing Duplicate Columns in Pandas

Dealing with duplicate column names in your Pandas DataFrame.
Removing Duplicate Columns in Pandas

Sometimes you wind up with duplicate column names in your Pandas DataFrame. This isn't necessarily a huge deal if we're just messing with a smallish file in Jupyter.  But, if we wanna do something like load it into a database, that'll be a problem.  It can also interfere with our other cleaning functions - I ran into this the other day when reducing the size of a giant data file by downcasting it (as per this previous post).  The cleaning functions required a 1D input (so, a Series or List) - but calling the name of a duplicate column gave

Continue Reading
Pandas
28 May 2019

Using Hierarchical Indexes With Pandas

Use Panda's Multiindex to make your data work harder for you.
Using Hierarchical Indexes With Pandas

I've been wandering into a lot of awkward conversations lately, most of them being about how I spend my free time. Apparently "rifling through Python library documentation in hopes of finding dope features" isn't considered a relatable hobby by most people. At least you're here, so I must be doing something right occasionally.

Today we'll be venturing off into the world of Pandas indexes. Not just any old indexes... hierarchical indexes. Hierarchical indexes take the idea of having identifiers for rows, and extends this concept by allowing us to set multiple identifiers with a twist: these indexes hold parent/child

Continue Reading
Pandas
20 May 2019

Reshaping Pandas DataFrames

A guide to DataFrame manipulation using groupby, melt, pivot tables, pivot, transpose, and stack.
Reshaping Pandas DataFrames

Summer is just around the corner and everybody seems to be asking the same question: “does my data look... out of shape?” Whether you’re a scientist or an engineer, data-image dysmorphia can lead to serious negative thoughts which leave you second-guessing our data.  

Much has already been said about modifying DataFrames on a “micro” level, such as column-wise operations. But what about modifying entire DataFrames at once? When considering Numpy’s role in general mathematics, it should come as no surprise that Pandas DataFrames have a lot of similarities to the matrices we learned high school pre-calc; namely, they

Continue Reading
SQL
19 Feb 2019

Welcome to SQL: Modifying Databases and Tables

Brush up on SQL fundamentals such as creating tables, schemas, and views.
Welcome to SQL: Modifying Databases and Tables

SQL: we all pretend to be experts at it, and mostly get away with it thanks to StackOverflow. Paired with our vast experience of learning how to code in the 90s, our field work of PHPMyAdmin and LAMP stacks basically makes us experts. Go ahead and chalk up a win for your resume.

SQL has been around longer than our careers have, so why start a series on it now? Surely there’s sufficient enough documentation that we can Google the specifics whenever the time comes for us to write a query? That, my friends, is precisely the problem. Regardless

Continue Reading
Data Vis
18 Dec 2018

Geocoding Raw Datasets for Mapbox

Use the Mapbox Python SDK to transform a collection of addresses into lat/long coordinates.
Geocoding Raw Datasets for Mapbox

This wouldn't be a proper data blog unless we spend a vast majority of our time talking about cleaning data. Chances are if you're pursuing analysis that's groundbreaking (or worthwhile), we're probably starting with some ugly, untapped information. It turns out Mapbox has an API specifically for this purpose: the Mapbox Geocoding API.

Geocoding is a blanket term for turning vague information into specific Lat/Long coordinates. How vague, you ask? The API covers:

  • Pinpointing exact location via street address.
  • Locating regions or cities by recognizable name (ie: Rio de Janeiro).
  • Locating cities by highly unspecific name (Geocoding for "Springfield"
Continue Reading
Excel
31 Aug 2018

Dynamic Tension! Creating and Using Dynamic Named Ranges in Excel

Dynamically load data in smart pivot tables.
Dynamic Tension! Creating and Using Dynamic Named Ranges in Excel

First things first, 10 points to anyone who understands what the title is referring to (that's either Charles Atlas' workout philosophy, or one of core tenants of Bokononism in Kurt Vonnegut's standout Cat's Cradle).

Now, with my obscure reference quota filled for the day, let's assume that you've been working as an analyst for some time now. As stated in a previous post, generally you're likely to get promoted into a management position, where you no longer need your formidable skills of analysis, and they too, like all good things, will atrophy to nothingness. However, for those of you who

Continue Reading
Excel
10 Jun 2018

Getting Iffy With it: Conditional Statements in Excel

Effectively utilize conditionals such as IF statements in your Excel workflow.
Getting Iffy With it: Conditional Statements in Excel

If you've been following along, we discussed in the last several posts of this series how, if you're not working in a very "tech forward" organization (like my two compatriots on his site), but you have the same title, you're probably obtaining your data from another department (or it might be a sentient sponge, or a gang of squirrels with dreams of world domination, you'll actually have no idea) who you will have no contact with. As a side effect of this...rather strange situation (ya know, like being a child and being abandoned briefly by your mother in a

Continue Reading
Excel
05 Jun 2018

Taking Out the Trash: Dirty Data in Excel

Dealing With Dirty Data in Excel (continued).
Taking Out the Trash: Dirty Data in Excel

In my last post, we explored the organizational structure of many large companies and how this pertains to one's duties as a fledgling data analyst. I highly recommend you go back and read the first post on "dirty" data, but just in case you're one of those rebels who thinks that they're too cool to read part 1, here's a quick refresher to put you back in the analytical mindset (which is the perfect combination tactical laziness, ADHD, and a complete disregard for downside risk).

Essentially, the reason why most data analysts exist within a company (remember, NOT a tech

Continue Reading
Excel
31 May 2018

Dealing with Dirty Data in Excel

Cleaning data Excel in the absence of tools designed to do so.
Dealing with Dirty Data in Excel

In my last post, we discussed what separates a true analyst (read: technical) from a project manager wearing the mask of an analyst like some Scott Snyder era Joker (I figure that there's a solid overlap between fans of comic books and fans of the real world application of data. Note that this is a study with an N = 1 so it bares no statistical significance, but I have a funny feeling...call it spidey sense).

Full disclosure, this post comes mostly out of my inability to sleep in my hotel room in Chicago following a grueling day of doing

Continue Reading
Excel
29 May 2018

Doing the Excel Data Dance

Taking a step back to reflect on the problems Excel intends to solve.
Doing the Excel Data Dance

If you've been reading along, over the last several posts you've learned the two major skills that any self-respecting Excel jockey counts as their go-tos: the ability to lookup (remember, I'm partial to index-match, but if you learned VH lookup, ride that until you crash your system) and the ability to pivot.

Now here's something really interesting: until we pierce the veil on doing some VB macros, by truly mastering just those two skills, you can handle basically any job that Excel can actually handle, anything more advanced is just a matter of doing it faster (and faster, and faster)

Continue Reading
Excel
24 May 2018

Adventures in Excel: Power to the Pivot

Power up your Excel Pivot Tables with these pro moves.
Adventures in Excel: Power to the Pivot

During the last discussion, you've (hopefully) learned how to generate a pivot table, and learned about the four "buckets" that can house your columns:

  • Filter
  • Row
  • Column
  • Value

I'm also going to make the wild assumption that you've played around with your newly birthed pivot table, taking your column headings from your "raw" data (in the lingua franca of the numerically inclined, this means the data just as you've received it, before you add formulae or engage in any other manipulation.) Hopefully this gave you a feel of how a pivot table works, and if your synapses are wired in

Continue Reading
Excel
18 May 2018

The One Excel Formula to Rule Them All

Unlock the secrets of Excel's Index Match.
The One Excel Formula to Rule Them All

In my last entry, we discussed how to write a formula, and you've been armed with what each piece of the formula represents (the command, the variables, and the definition of an array). With this knowledge, you've actually been armed with the keys to the kingdom, and you're finally going to learn how to do something fancy.

There comes a time in every person's life where just by learning one piece of information, suddenly everything else you've learned clicks into place. For those who utilize Excel, learning this formula WILL BE THAT FOR YOU. In our last entry, I stated

Continue Reading
Pandas
18 Apr 2018

Dropping Rows of Data Using Pandas

Square one of cleaning your Pandas Dataframes: dropping empty or problematic data.
Dropping Rows of Data Using Pandas

You've heard the cliché before: it is often cited that roughly %80~ of a data scientist's role is dedicated to cleaning data sets. I Personally haven't looked in to the papers or clinical trials which prove this number (that was a joke), but the idea holds true: in the data profession, we find ourselves doing away with blatantly corrupt or useless data. The simplistic approach is to discard such data entirely, thus here we are.

What constitutes 'filthy' data is project-specific, and at times borderline subjective. Occasionally, the offenders are more obvious: these might include chunks of data which are

Continue Reading