# Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth

#### Code snippet corner is back! Tune the max_depth parameter in for a Random Forests classifier in scikit-learn in Python

Continued from here

Notebook for this post is here

Binary search code itself is here

`max_depth` is an interesting parameter.  While `n_estimators` has a tradeoff between speed & score, `max_depth` has the possibility of improving both.  By limiting the depth of your trees, you can reduce overfitting.

Unfortunately, deciding on upper & lower bounds is less than straightforward.  It'll depend on your dataset.  Luckily, I found a post on StackOverflow that had a link to a blog post that had a promising methodology.

First, we build a tree with default arguments and fit it to our data.

``````import pandas as pd

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.ensemble import RandomForestClassifier

X, y = data.data, data.target

rfArgs = {"random_state": 0,
"n_jobs": -1,
"class_weight": "balanced",
"n_estimators": 18,
"oob_score": True}

clf = RandomForestClassifier(**rfArgs)
clf.fit(X, y)``````

Now, let's see how deep the trees get when we don't impose any sort of `max_depth`. We'll use the code from that wonderful blog post to crawl our Random Forest, and get the height of every tree.

``````#From here: https://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html
def leaf_depths(tree, node_id = 0):

'''
tree.children_left and tree.children_right store ids
of left and right chidren of a given node
'''
left_child = tree.children_left[node_id]
right_child = tree.children_right[node_id]

'''
If a given node is terminal,
both left and right children are set to _tree.TREE_LEAF
'''
if left_child == _tree.TREE_LEAF:

'''
Set depth of terminal nodes to 0
'''
depths = np.array([0])
else:
'''
Get depths of left and right children and
increment them by 1
'''
left_depths = leaf_depths(tree, left_child) + 1
right_depths = leaf_depths(tree, right_child) + 1

depths = np.append(left_depths, right_depths)

return depths

allDepths = [leaf_depths(estimator.tree_)
for estimator in clf.estimators_]

np.hstack(allDepths).min()
#> 2
np.hstack(allDepths).max()
#> 9``````

We'll be searching between 2 and 9!

Let's bring back our old make a helper function to easily return scores.

``````def getForestAccuracy(X, y, kwargs):
clf = RandomForestClassifier(**kwargs)
clf.fit(X, y)
y_pred = clf.oob_decision_function_[:, 1]
precision, recall, _ = precision_recall_curve(y, y_pred)
return auc(recall, precision)``````
``````max_depth = bgs.compareValsBaseCase(X,
y,
getForestAccuracy,
rfArgs,
"max_depth",
0,
2,
9)
bgs.showTimeScoreChartAndGraph(max_depth, html=True)``````
max_depth score time
2 0.987707 0.145360
9 0.987029 0.147563
6 0.986247 0.140514
4 0.968316 0.140164

max_depth score time scoreTimeRatio
2 1.051571 0.837377 0.175986
9 1.016649 1.135158 0.103478
6 0.976311 0.182516 1.000000
4 0.051571 0.135158 0.000000

So, for our purposes, 9 will function as our baseline since that was the biggest depth that it built with default arguments.

Looks like a `max_depth` of 2 has a slightly higher score than 9, and is slightly faster!  Interestingly, it's slightly slower than  4 or 6.  Not sure why that is.

Center of the Universe Website
Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
Center of the Universe

Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.