Code Snippet Corner

Real-world examples of Python being used to solve complex data problems, primarily using Jupyter notebooks.
Recasting Low-Cardinality Columns as Categoricals

Recasting Low-Cardinality Columns as Categoricals

Downcast strings in Pandas to their proper data-types using HDF5.

The other day, I was grabbing a way-too-big DB query for local exploration.  It was literally over 2GB as a CSV - which is a pain for a number of reasons!  Not the least of which being that, while you're doing Exploratory Data Analysis, things take way too long - it doesn't take long for Cognitive Drift to break your rhythm!

Numerical columns can be taken down to size with the downcasting functions from a previous post.  But what about Object/String columns?  One of the best ways to reduce the size of a file like this is to recast

Removing Duplicate Columns in Pandas

Removing Duplicate Columns in Pandas

Dealing with duplicate column names in your Pandas DataFrame.

Sometimes you wind up with duplicate column names in your Pandas DataFrame. This isn't necessarily a huge deal if we're just messing with a smallish file in Jupyter.  But, if we wanna do something like load it into a database, that'll be a problem.  It can also interfere with our other cleaning functions - I ran into this the other day when reducing the size of a giant data file by downcasting it (as per this previous post).  The cleaning functions required a 1D input (so, a Series or List) - but calling the name of a duplicate column gave

Using Random Forests for Feature Selection with Categorical Features

Using Random Forests for Feature Selection with Categorical Features

Python helper functions for adding feature importance, and displaying them as a single variable.

Notebook here.  Helper functions here.

One of the best features of Random Forests is that it has built-in Feature Selection.  Explicability is one of the things we often lose when we go from traditional statistics to Machine Learning, but Random Forests lets us actually get some insight into our dataset instead of just having to treat our model as a black box.

One problem, though - it doesn't work that well for categorical features.  Since you'll generally have to One-Hot Encode a categorical feature (for instance, turn something with 7 categories into 7 variables that are a "True/False"), you'll

Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf

Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf

Tune the min_samples_leaf parameter in for a Random Forests classifier in scikit-learn in Python .

Part 1 (n_estimators) here
Part 2 (max_depth) here
Notebook here


Another parameter, another set of quirks!

min_samples_leaf is sort of similar to max_depth.  It helps us avoid overfitting.  It's also non-obvious what you should use as your upper and lower limits to search between.  Let's do what we did last week - build a forest with no parameters, see what it does, and use the upper and lower limits!

import pandas as pd

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_
Tuning Random  Forests Hyperparameters with Binary Search Part II: max_depth

Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth

Tune the max_depth parameter in for a Random Forests classifier in scikit-learn in Python

Continued from here.

Notebook for this post is here.

Binary search code itself is here.


max_depth is an interesting parameter.  While n_estimators has a tradeoff between speed & score, max_depth has the possibility of improving both.  By limiting the depth of your trees, you can reduce overfitting.

Unfortunately, deciding on upper & lower bounds is less than straightforward.  It'll depend on your dataset.  Luckily, I found a post on StackOverflow that had a link to a blog post that had a promising methodology.  

First, we build a tree with default arguments and fit it to our data.

Tuning Machine Learning Hyperparameters with Binary Search

Tuning Machine Learning Hyperparameters with Binary Search

Tune the n_estimators parameter in for a Random Forests classifier in scikit-learn in Python.

Ah, hyperparameter tuning.  Time & compute-intensive.  Frequently containing weird non-linearities in how changing a parameter changes the score and/or the time it takes to train the model.

RandomizedSearchCV goes noticeably faster than a full GridSearchCV but it still takes a while - which can be rough, because in my experience you do still need to be iterative with it and experiment with different distributions.  Plus, then you've got hyper-hyperparameters to tune - how many iterations SHOULD you run it for, anyway?

I've been experimenting with using the trusty old Binary Search to tune hyperparameters.  I'm finding it has two

Importing Excel Datetimes Into Pandas, Part II

Importing Excel Datetimes Into Pandas, Part II

Import dates & times from Excel .xlsx files into Pandas!

What if, like during my data import task a few months back, the dates & times are in separate columns?  This gives us a few new issues.  Let's import that Excel file!

import pandas as pd
import xlrd
import datetime

df = pd.read_excel("hasDatesAndTimes.xlsx", sheet_name="Sheet1")

book = xlrd.open_workbook("hasDatesAndTimes.xlsx")
datemode = book.datemode

And let's see that time variable!

df["Time"]
Index Time
0 0.909907
1 0.909919
2 0.909931
3 0.909942
4 0.909954
df["Time"].map(lambda x: xlrd.xldate_
Importing Excel Datetimes Into Pandas, Part I

Importing Excel Datetimes Into Pandas, Part I

Pandas & Excel, Part 1.

Different file formats are different!  For all kinds of reasons!

A few months back, I had to import some Excel files into a database. In this process I learned so much about the delightfully unique way Excel stores dates & times!  

The basic datetime will be a decimal number, like 43324.909907407404.  The number before the decimal is the day, the number afterwards is the time.  So far, so good - this is pretty common for computers.  The date is often the number of days past a certain date, and the time is the number of seconds.  

So, let's load

Lazy Pandas and Dask

Lazy Pandas and Dask

Increase the performance of Pandas with Dask.

Ah, laziness.  You love it, I love it, everyone agrees it's just better.

Flesh-and-blood are famously lazy.  Pandas the package, however, uses Eager Evaluation.  What's Eager Evaluation, you ask?  Is Pandas really judgey, hanging out on the street corner and being fierce to the style choices of people walking by?  Well, yes, but that's not the most relevant sense in which I mean it here.  

Eager evaluation means that once you call pd.read_csv(), Pandas immediately jumps to read the whole CSV into memory.

"Wait!" I hear you ask.  "Isn't that what we want?  Why would I call the

All That Is Solid Melts Into Graphs

All That Is Solid Melts Into Graphs

Reshaping Pandas dataframes with a real-life example, and graphing it with Altair.

Last few Code Snippet Corners were about using Pandas as an easy way to handle input and output between files & databases.  Let's shift gears a little bit!  Among other reasons, because earlier today I discovered a package that exclusively does that, which means I can stop importing the massive Pandas package when all I really wanted to do with it was take advantage of its I/O modules.  Check it out!

So, rather than the entrances & exits, let's focus on all the crazy ways you can reshape data with Pandas!

Our Data

For our demonstration, I'll use a

Getting Conda Envs (And Environment Variables!) To Play Nicely With Cron

Getting Conda Envs (And Environment Variables!) To Play Nicely With Cron

Set up CRON jobs to interact with Conda environments.

This isn't really a tutorial on cron in general; Better people at Linux have written way better ones than I could write.  Here's one: http://mediatemple.net/blog/news/complete-beginners-guide-cron-part-1/  This is more of a code journaling exercise for a problem that I didn't find a neat-and-tidy answer to online when I was looking for it, and that I presume at least one person will encounter at some point between now and the heat death of the universe.

Let's say you've got two different Conda envs:  production and development.  Let's say that, in addition to having different packages installed, they