Machine Learning

The latest developments in machine learning tools and technology available to data scientists.
Using Random Forests for Feature Selection with Categorical Features

Using Random Forests for Feature Selection with Categorical Features

Python helper functions for adding feature importance, and displaying them as a single variable.

Notebook here.  Helper functions here.

One of the best features of Random Forests is that it has built-in Feature Selection.  Explicability is one of the things we often lose when we go from traditional statistics to Machine Learning, but Random Forests lets us actually get some insight into our dataset instead of just having to treat our model as a black box.

One problem, though - it doesn't work that well for categorical features.  Since you'll generally have to One-Hot Encode a categorical feature (for instance, turn something with 7 categories into 7 variables that are a "True/False"), you'll

Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf

Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf

Tune the min_samples_leaf parameter in for a Random Forests classifier in scikit-learn in Python .

Part 1 (n_estimators) here
Part 2 (max_depth) here
Notebook here


Another parameter, another set of quirks!

min_samples_leaf is sort of similar to max_depth.  It helps us avoid overfitting.  It's also non-obvious what you should use as your upper and lower limits to search between.  Let's do what we did last week - build a forest with no parameters, see what it does, and use the upper and lower limits!

import pandas as pd

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_
Tuning Random  Forests Hyperparameters with Binary Search Part II: max_depth

Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth

Tune the max_depth parameter in for a Random Forests classifier in scikit-learn in Python

Continued from here.

Notebook for this post is here.

Binary search code itself is here.


max_depth is an interesting parameter.  While n_estimators has a tradeoff between speed & score, max_depth has the possibility of improving both.  By limiting the depth of your trees, you can reduce overfitting.

Unfortunately, deciding on upper & lower bounds is less than straightforward.  It'll depend on your dataset.  Luckily, I found a post on StackOverflow that had a link to a blog post that had a promising methodology.  

First, we build a tree with default arguments and fit it to our data.

Tuning Machine Learning Hyperparameters with Binary Search

Tuning Machine Learning Hyperparameters with Binary Search

Tune the n_estimators parameter in for a Random Forests classifier in scikit-learn in Python.

Ah, hyperparameter tuning.  Time & compute-intensive.  Frequently containing weird non-linearities in how changing a parameter changes the score and/or the time it takes to train the model.

RandomizedSearchCV goes noticeably faster than a full GridSearchCV but it still takes a while - which can be rough, because in my experience you do still need to be iterative with it and experiment with different distributions.  Plus, then you've got hyper-hyperparameters to tune - how many iterations SHOULD you run it for, anyway?

I've been experimenting with using the trusty old Binary Search to tune hyperparameters.  I'm finding it has two