Tuning Random Forests Hyperparameters: max_depth
- Being REALLY Lazy With Multiple Aggregations in Pandas
- Splitting Columns With Pandas
- Recasting Low-Cardinality Columns as Categoricals
- Removing Duplicate Columns in Pandas
- Downcast Numerical Data Types with Pandas
- Using Random Forests for Feature Selection with Categorical Features
- Tuning Random Forests Hyperparameters: min_samples_leaf
- Tuning Random Forests Hyperparameters: max_depth
- Tuning Machine Learning Hyperparameters with Binary Search
- Importing Excel Datetimes Into Pandas, Part II
- Importing Excel Datetimes Into Pandas, Part I
- All That Is Solid Melts Into Graphs
- Trash Pandas: Messy, Convenient Database Operations via Pandas
- A Dirty Way of Cleaning Data (ft. Pandas & SQL)
- Getting Conda Environments To Play Nicely With Cron
- Using Pandas and SQLAlchemy to Simplify Databases
max_depth is an interesting parameter. While
n_estimators has a tradeoff between speed & score,
max_depth has the possibility of improving both. By limiting the depth of your trees, you can reduce overfitting.
Unfortunately, deciding on upper & lower bounds is less than straightforward. It'll depend on your dataset. Luckily, I found a post on StackOverflow that had a link to a blog post that had a promising methodology.
First, we build a tree with default arguments and fit it to our data.
Now, let's see how deep the trees get when we don't impose any sort of
max_depth. We'll use the code from that wonderful blog post to crawl our Random Forest, and get the height of every tree.
Here's the output:
We'll be searching between 2 and 9!
Let's bring back our old make a helper function to easily return scores.
Now let's see it in action:
Here's what we've got:
So, for our purposes, 9 will function as our baseline since that was the biggest depth that it built with default arguments.
Looks like a
max_depth of 2 has a slightly higher score than 9, and is slightly faster! Interestingly, it's slightly slower than 4 or 6. Not sure why that is.