Last few Code Snippet Corners were about using Pandas as an easy way to handle input and output between files & databases.  Let's shift gears a little bit!  Among other reasons, because earlier today I discovered a package that exclusively does that, which means I can stop importing the massive Pandas package when all I really wanted to do with it was take advantage of its I/O modules.  Check it out!

So, rather than the entrances & exits, let's focus on all the crazy ways you can reshape data with Pandas!

Our Data

For our demonstration, I'll use a dataset based on something I was once actually sent.  It was a big CSV with sensor readings from HVAC systems.  Each line had a different house, a different room, a datetime, and readings from a bunch of different types of sensors.  Oh, hrm, I probably shouldn't use data I got from a client.  Uh...

GENERATING DUMMY TEMPERATURE DATA

(Feel free to skip to next part if you don't care)

We want it to fluctuate, but we don't want to just make a bunch of totally random values - a reading should have some relationship to the reading taken a second earlier.

Let's use NumPy for some Randomness, and the accumulate and repeat functions from itertools.  Maybe I'll do an in-depth post on these at some point, but the code I'll be writing with them will be pretty short and hopefully somewhat self-demonstrating.  If you wanna go deeper here's some good material: Official Docs, Good article.

import numpy as np
from itertools import accumulate, repeat

We want there to be some random "noise", but we also want the occasional substantive change.  We'll reflect this by having it so that 90% of the time we get a small fluctuation, with a 10% chance of a smaller fluctuation.

def genTempDataPoint(x, *args):
    if np.random.rand(1) <= 0.9:
        return x + np.random.uniform(-3,3,1)[0]
    else:
        return x + np.random.uniform(-10,10,1)[0]

Now let's see some test points!

list(accumulate(repeat(70, 5), genTempDataPoint))
[70,
 69.00258239202094,
 59.34919781643355,
 56.60722073795931,
 57.265078261782946]

Sure, fine, why not.  Good enough for our purposes!   Now let's put it all together so we can just call it with a base temp and the number of points we want.

def genTempData(base, n):
    return list(accumulate(repeat(base, n), 
                           genTempDataPoint))

To simulate the dataset, we actually need to mix it up.  Or else what good are the GroupBys gonna be?  So, let's create a problem to fix later!  Here's a function to create a simplified version of the dataset - each row will have a location ID, a number corresponding to time (just raw ints, I'm not making actual datetimes - I've spent too much time on this part already).  We'll also generate humidity values, to add another monkey wrench to fix later (we'll still use the genTempData function).

from itertools import chain

def makeLocation(name, base1, n1, base2, n2):
    return [(x[0], name, x[1][0], x[1][1]) 
        for x in enumerate(zip(genTempData(base1, n1),
              genTempData(base2, n2)) )]

bigList = list(chain.from_iterable(makeLocation(str(x), 
                                                70, 
                                                15,
                                                40, 
                                                15) 
                         for x in range(5)))
np.random.shuffle(bigList)

df = pd.DataFrame(bigList, 
                  columns = ["Time", "Loc", "Temp", "Hum"])

Back To The Main Plot

Let's look at some test rows!

# Viewing test rows

df.iloc[:5]
Time	Loc	Temp     	Hum
10	4	68.396970	34.169753
13	0	80.288846	42.076786
7	4	69.923273	37.967951
6	0	71.781362	41.186802
5	2	62.678844	37.321636

Now, when I'm getting started with a new dataset, one of the first things I like to do is make some graphs.  As of late, my favorite package has been Altair.  Looks very nice by default, is pretty easy to iterate with, and has nice declarative syntax.

Only one problem!  It wants date in "long-form" - as in, rather than each row having several variables of interest, each row has one (or more) "ID" variables, one numerical value, and the name of the variable we're measuring.  So for instance, something more like this:

Time	Loc	variable	value
10	4	Temp	        68.396970
13	0	Temp	        80.288846
7	4	Temp	        69.923273
6	0	Temp	        71.781362
5	2	Temp	        62.678844

Not quite sure why!  Buuut, that's kind of a feature of modern coding - we're sitting on an inheritance of libraries that have built up over the years, and so more often than not we're just building the "plumbing" between existing stuff.  It's cool!  And good!  It lets us separate Function from Implementation.  We don't need to know what's going on under the hood - we just need to know thing X will produce an output we want, and that in order to get it we first need to reshape what we've already got into an input that it'll accept.  Since that's such a huge part of coding these days, Pandas' power in that realm is super useful.

Sooo, how do we get from here to there?  Shockingly easily!

melted = pd.melt(df, id_vars=["Time", "Loc"])

Done!

Well, obviously we're not REALLY done yet.  Half the point of having such terse, expressive code is that we can do MORE things!

Let's say we want to see how humidity & temperature change over the course of the day.  First, we'll have to grab all the readings from a single location.  Let's say Location 3!

loc3 = melted[melted["Loc"]=="3"]

Altair's pretty neat.

(alt.Chart(loc3)
 .mark_line()
 .encode(x='Time:O', #We're encoding time as an Ordinal 
         y='value:Q',
         color='variable:N'))

Hrm, lot of room there at the bottom.  If we were in an interactive session, we could make this interactive (zoomable and navigable!) by just adding the .interactive() method to the end, but I don't know how to do that in the blog.  Regardless, it's pretty easy to rescale if we want a closer look!

(alt.Chart(loc3)
 .mark_line()
 .encode(x='Time:O',
         y=alt.Y('value:Q', scale=alt.Scale(zero=False)),
         color='variable:N'))

Let's try it with just temperature, and color will encode the location!

meltedJustTemp = pd.melt(df, 
                         id_vars=["Time", "Loc"],
                        value_vars= ["Temp"])

(alt.Chart(meltedJustTemp)
 .mark_line()
 .encode(x='Time:O',
         y='value:Q',
         color='Loc:N'))

Let's zoom in again...

(alt.Chart(meltedJustTemp)
 .mark_line()
 .encode(x='Time:O',
         y=alt.Y('value:Q', scale=alt.Scale(zero=False)),
         color='Loc:N'))

Altair also lets us Facet our graphs extremely flexibly & painlessly.

alt.Chart(melted).mark_line().encode(
      x='Time:O',
      y=alt.Y('value:Q', scale=alt.Scale(zero=False)),
      color='Loc:N',
      column="variable")

Or how about another way!  Let's see humidity & temp, location by location.

alt.Chart(melted).mark_line().encode(
      x='Time:O',
      y=alt.Y('value:Q', scale=alt.Scale(zero=False)),
      color='variable:N',
      row="Loc")

We could make them nicer (there's a WIDE array of customizations), but I'm looking to simulate Exploratory Data Analysis.  I can't think of another graphing package in Python that has quite this level of "instant gratification" for so many different variations.