max_depth is an interesting parameter. While n_estimators has a tradeoff between speed & score, max_depth has the possibility of improving both. By limiting the depth of your trees, you can reduce overfitting.
Unfortunately, deciding on upper & lower bounds is less than straightforward. It'll depend on your dataset. Luckily, I found a post on StackOverflow that had a link to a blog post that had a promising methodology.
First, we build a tree with default arguments and fit it to our data.
Now, let's see how deep the trees get when we don't impose any sort of max_depth. We'll use the code from that wonderful blog post to crawl our Random Forest, and get the height of every tree.
Here's the output:
We'll be searching between 2 and 9!
Let's bring back our old make a helper function to easily return scores.
Now let's see it in action:
Here's what we've got:
max_depth
score
time
2
0.987707
0.145360
9
0.987029
0.147563
6
0.986247
0.140514
4
0.968316
0.140164
max_depth
score
time
scoreTimeRatio
2
1.051571
0.837377
0.175986
9
1.016649
1.135158
0.103478
6
0.976311
0.182516
1.000000
4
0.051571
0.135158
0.000000
So, for our purposes, 9 will function as our baseline since that was the biggest depth that it built with default arguments.
Looks like a max_depth of 2 has a slightly higher score than 9, and is slightly faster! Interestingly, it's slightly slower than 4 or 6. Not sure why that is.
Supervillain in somebody's action hero movie. Experienced a radioactive freak accident at a young age which rendered him part-snake and strangely adept at Python.
Monthly Newsletter
Support us
We share tutorials to help and inspire new engineers and enthusiasts. If you've found Hackers and Slackers to be helpful, we welcome donations in the form of coffee :).