- Recasting Low-Cardinality Columns as Categoricals
- Removing Duplicate Columns in Pandas
- Downcast Numerical Data Types with Pandas
- Using Random Forests for Feature Selection with Categorical Features
- Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf
- Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth
- Tuning Machine Learning Hyperparameters with Binary Search
- Importing Excel Datetimes Into Pandas, Part II
- Importing Excel Datetimes Into Pandas, Part I
- All That Is Solid Melts Into Graphs
- Trash Pandas: Messy, Convenient DB Operations via Pandas
- A Dirty Way of Cleaning Data (ft. Pandas & SQL)
- Getting Conda Envs (And Environment Variables!) To Play Nicely With Cron
- Using Pandas and SQLAlchemy to Simplify Databases
Last few Code Snippet Corners were about using Pandas as an easy way to handle input and output between files & databases. Let's shift gears a little bit! Among other reasons, because earlier today I discovered a package that exclusively does that, which means I can stop importing the massive Pandas package when all I really wanted to do with it was take advantage of its I/O modules. Check it out!
So, rather than the entrances & exits, let's focus on all the crazy ways you can reshape data with Pandas!
For our demonstration, I'll use a dataset based on something I was once actually sent. It was a big CSV with sensor readings from HVAC systems. Each line had a different house, a different room, a datetime, and readings from a bunch of different types of sensors. Oh, hrm, I probably shouldn't use data I got from a client. Uh...
GENERATING DUMMY TEMPERATURE DATA
(Feel free to skip to next part if you don't care)
We want it to fluctuate, but we don't want to just make a bunch of totally random values - a reading should have some relationship to the reading taken a second earlier.
NumPy for some Randomness, and the
repeat functions from
itertools. Maybe I'll do an in-depth post on these at some point, but the code I'll be writing with them will be pretty short and hopefully somewhat self-demonstrating. If you wanna go deeper here's some good material: Official Docs, Good article
import numpy as np from itertools import accumulate, repeat
We want there to be some random "noise", but we also want the occasional substantive change. We'll reflect this by having it so that 90% of the time we get a small fluctuation, with a 10% chance of a smaller fluctuation.
def genTempDataPoint(x, *args): if np.random.rand(1) <= 0.9: return x + np.random.uniform(-3,3,1) else: return x + np.random.uniform(-10,10,1)
Now let's see some test points!
list(accumulate(repeat(70, 5), genTempDataPoint)) [70, 69.00258239202094, 59.34919781643355, 56.60722073795931, 57.265078261782946]
Sure, fine, why not. Good enough for our purposes! Now let's put it all together so we can just call it with a base temp and the number of points we want.
def genTempData(base, n): return list(accumulate(repeat(base, n), genTempDataPoint))
To simulate the dataset, we actually need to mix it up. Or else what good are the GroupBys gonna be? So, let's create a problem to fix later! Here's a function to create a simplified version of the dataset - each row will have a location ID, a number corresponding to time (just raw ints, I'm not making actual datetimes - I've spent too much time on this part already). We'll also generate humidity values, to add another monkey wrench to fix later (we'll still use the
from itertools import chain def makeLocation(name, base1, n1, base2, n2): return [(x, name, x, x) for x in enumerate(zip(genTempData(base1, n1), genTempData(base2, n2)) )] bigList = list(chain.from_iterable(makeLocation(str(x), 70, 15, 40, 15) for x in range(5))) np.random.shuffle(bigList) df = pd.DataFrame(bigList, columns = ["Time", "Loc", "Temp", "Hum"])
Back To The Main Plot
Let's look at some test rows!
# Viewing test rows df.iloc[:5] Time Loc Temp Hum 10 4 68.396970 34.169753 13 0 80.288846 42.076786 7 4 69.923273 37.967951 6 0 71.781362 41.186802 5 2 62.678844 37.321636
Now, when I'm getting started with a new dataset, one of the first things I like to do is make some graphs. As of late, my favorite package has been Altair. Looks very nice by default, is pretty easy to iterate with, and has nice declarative syntax.
Only one problem! It wants date in "long-form" - as in, rather than each row having several variables of interest, each row has one (or more) "ID" variables, one numerical value, and the name of the variable we're measuring. So for instance, something more like this:
Time Loc variable value 10 4 Temp 68.396970 13 0 Temp 80.288846 7 4 Temp 69.923273 6 0 Temp 71.781362 5 2 Temp 62.678844
Not quite sure why! Buuut, that's kind of a feature of modern coding - we're sitting on an inheritance of libraries that have built up over the years, and so more often than not we're just building the "plumbing" between existing stuff. It's cool! And good! It lets us separate Function from Implementation. We don't need to know what's going on under the hood - we just need to know thing X will produce an output we want, and that in order to get it we first need to reshape what we've already got into an input that it'll accept. Since that's such a huge part of coding these days, Pandas' power in that realm is super useful.
Sooo, how do we get from here to there? Shockingly easily!
melted = pd.melt(df, id_vars=["Time", "Loc"])
Well, obviously we're not REALLY done yet. Half the point of having such terse, expressive code is that we can do MORE things!
Let's say we want to see how humidity & temperature change over the course of the day. First, we'll have to grab all the readings from a single location. Let's say Location 3!
loc3 = melted[melted["Loc"]=="3"]
Altair's pretty neat.
(alt.Chart(loc3) .mark_line() .encode(x='Time:O', #We're encoding time as an Ordinal y='value:Q', color='variable:N'))
Hrm, lot of room there at the bottom. If we were in an interactive session, we could make this interactive (zoomable and navigable!) by just adding the
.interactive() method to the end, but I don't know how to do that in the blog. Regardless, it's pretty easy to rescale if we want a closer look!
(alt.Chart(loc3) .mark_line() .encode(x='Time:O', y=alt.Y('value:Q', scale=alt.Scale(zero=False)), color='variable:N'))
Let's try it with just temperature, and color will encode the location!
meltedJustTemp = pd.melt(df, id_vars=["Time", "Loc"], value_vars= ["Temp"]) (alt.Chart(meltedJustTemp) .mark_line() .encode(x='Time:O', y='value:Q', color='Loc:N'))
Let's zoom in again...
(alt.Chart(meltedJustTemp) .mark_line() .encode(x='Time:O', y=alt.Y('value:Q', scale=alt.Scale(zero=False)), color='Loc:N'))
Altair also lets us Facet our graphs extremely flexibly & painlessly.
alt.Chart(melted).mark_line().encode( x='Time:O', y=alt.Y('value:Q', scale=alt.Scale(zero=False)), color='Loc:N', column="variable")
Or how about another way! Let's see humidity & temp, location by location.
alt.Chart(melted).mark_line().encode( x='Time:O', y=alt.Y('value:Q', scale=alt.Scale(zero=False)), color='variable:N', row="Loc")
We could make them nicer (there's a WIDE array of customizations), but I'm looking to simulate Exploratory Data Analysis. I can't think of another graphing package in Python that has quite this level of "instant gratification" for so many different variations.