Code Snippet Corner

Real-world examples of Python being used to solve complex data problems, primarily using Jupyter notebooks.

Using Random Forests for Feature Selection with Categorical Features

Python helper functions for adding feature importance, and displaying them as a single variable.

Notebook here.  Helper functions here.

One of the best features of Random Forests is that it has built-in Feature Selection.  Explicability is one of the things we often lose when we go from traditional statistics to Machine Learning, but Random Forests lets us actually get some insight into our dataset instead of just having to treat our model as a

Code Snippet Corner Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Sep 24
Read

Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf

Tune the min_samples_leaf parameter in for a Random Forests classifier in scikit-learn in Python

Part 1 (n_estimators) here
Part 2 (max_depth) here
Notebook here

Another parameter, another set of quirks!

min_samples_leaf is sort of similar to max_depth.  It helps us avoid overfitting.  It's also non-obvious what you should use as your upper and lower limits to search between.  Let's do what we did last week - build a forest

Code Snippet Corner Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Sep 17
Read

Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth

Code snippet corner is back! Tune the max_depth parameter in for a Random Forests classifier in scikit-learn in Python

Continued from here

Notebook for this post is here

Binary search code itself is here

max_depth is an interesting parameter.  While n_estimators has a tradeoff between speed & score, max_depth has the possibility of improving both.  By limiting the depth of your trees, you can reduce overfitting.

Unfortunately, deciding on upper & lower bounds is less than

Code Snippet Corner Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Sep 10
Read

Code Snippet Corner: Tuning Machine Learning Hyperparameters with Binary Search

Tune the n_estimators parameter in for a Random Forests classifier in scikit-learn in Python

Ah, hyperparameter tuning.  Time & compute-intensive.  Frequently containing weird non-linearities in how changing a parameter changes the score and/or the time it takes to train the model.

RandomizedSearchCV goes noticeably faster than a full GridSearchCV but it still takes a while - which can be rough, because in my experience you do still need to be iterative with it

Code Snippet Corner Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Sep 03
Read

Importing Excel Datetimes Into Pandas II

Pandas and Excel Pt. 2

What if, like during my data import task a few months back, the dates & times are in separate columns?  This gives us a few new issues.  Let's import that Excel file!

import pandas as pd
import xlrd
import datetime

df = pd.read_excel("hasDatesAndTimes.xlsx", sheet_name="Sheet1")

book = xlrd.open_workbook("hasDatesAndTimes.xlsx&
Pandas Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Aug 20
Read

Importing Excel Datetimes Into Pandas

Pandas & Excel, Part 1

Different file formats are different!  For all kinds of reasons!

A few months back, I had to import some Excel files into a database. In this process I learned so much about the delightfully unique way Excel stores dates & times!  

The basic datetime will be a decimal number, like 43324.909907407404.  The number before the decimal is the day,

Pandas Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Aug 13
Read

Lazy Pandas and Dask

Picking Low-Hanging Fruit With Dask

Ah, laziness.  You love it, I love it, everyone agrees it's just better.

Flesh-and-blood are famously lazy.  Pandas the package, however, uses Eager Evaluation.  What's Eager Evaluation, you ask?  Is Pandas really judgey, hanging out on the street corner and being fierce to the style choices of people walking by?  Well, yes, but that's not the most relevant sense in

Pandas Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Aug 06
Read

All That Is Solid Melts Into Graphs

Reshaping Pandas dataframes with a real-life example, and graphing it with Altair

Last few Code Snippet Corners were about using Pandas as an easy way to handle input and output between files & databases.  Let's shift gears a little bit!  Among other reasons, because earlier today I discovered a package that exclusively does that, which means I can stop importing the massive Pandas package when all I really wanted to do with

Python Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Jul 30
Read

Getting Conda Envs (And Environment Variables!) To Play Nicely With Cron

Code Snippet Corner

This isn't really a tutorial on cron in general; Better people at Linux have written way better ones than I could write.  Here's one: http://mediatemple.net/blog/news/complete-beginners-guide-cron-part-1/  This is more of a code journaling exercise for a problem that I didn't find a neat-and-tidy answer to online when I was looking for it, and that I presume

Python Matthew Alhonte avatarMatthew Alhonte Matthew Alhonte avatar
mattJan 15
Jul 09
Read