Tuning Random  Forests Hyperparameters with Binary Search Part II: max_depth

Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth

Tune the max_depth parameter in for a Random Forests classifier in scikit-learn in Python

    Continued from here.

    Notebook for this post is here.

    Binary search code itself is here.


    max_depth is an interesting parameter.  While n_estimators has a tradeoff between speed & score, max_depth has the possibility of improving both.  By limiting the depth of your trees, you can reduce overfitting.

    Unfortunately, deciding on upper & lower bounds is less than straightforward.  It'll depend on your dataset.  Luckily, I found a post on StackOverflow that had a link to a blog post that had a promising methodology.  

    First, we build a tree with default arguments and fit it to our data.

    import pandas as pd
    
    from sklearn.metrics import precision_recall_curve
    from sklearn.metrics import auc
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.datasets import load_breast_cancer
    
    
    data = load_breast_cancer()
    X, y = data.data, data.target
    
    rfArgs = {"random_state": 0,
              "n_jobs": -1,
              "class_weight": "balanced",
             "n_estimators": 18,
             "oob_score": True}
    
    clf = RandomForestClassifier(**rfArgs)
    clf.fit(X, y)
    

    Now, let's see how deep the trees get when we don't impose any sort of max_depth. We'll use the code from that wonderful blog post to crawl our Random Forest, and get the height of every tree.

    #From here: https://aysent.github.io/2015/11/08/random-forest-leaf-visualization.html
    def leaf_depths(tree, node_id = 0):
        
        '''
        tree.children_left and tree.children_right store ids
        of left and right chidren of a given node
        '''
        left_child = tree.children_left[node_id]
        right_child = tree.children_right[node_id]
    
        '''
        If a given node is terminal, 
        both left and right children are set to _tree.TREE_LEAF
        '''
        if left_child == _tree.TREE_LEAF:
            
            '''
            Set depth of terminal nodes to 0
            '''
            depths = np.array([0])
        else:
            '''
            Get depths of left and right children and
            increment them by 1
            '''
            left_depths = leaf_depths(tree, left_child) + 1
            right_depths = leaf_depths(tree, right_child) + 1
     
            depths = np.append(left_depths, right_depths)
     
        return depths
    
    allDepths = [leaf_depths(estimator.tree_) 
                 for estimator in clf.estimators_]
    
    np.hstack(allDepths).min()
    #> 2
    np.hstack(allDepths).max()
    #> 9
    

    We'll be searching between 2 and 9!  

    Let's bring back our old make a helper function to easily return scores.

    def getForestAccuracy(X, y, kwargs):
        clf = RandomForestClassifier(**kwargs)
        clf.fit(X, y)
        y_pred = clf.oob_decision_function_[:, 1]
        precision, recall, _ = precision_recall_curve(y, y_pred)
        return auc(recall, precision)
    
    max_depth = bgs.compareValsBaseCase(X, 
                        y, 
                        getForestAccuracy,        
                        rfArgs, 
                        "max_depth", 
                        0, 
                        2, 
                        9)
    bgs.showTimeScoreChartAndGraph(max_depth, html=True)
    
    max_depth score time
    2 0.987707 0.145360
    9 0.987029 0.147563
    6 0.986247 0.140514
    4 0.968316 0.140164

    max_depth score time scoreTimeRatio
    2 1.051571 0.837377 0.175986
    9 1.016649 1.135158 0.103478
    6 0.976311 0.182516 1.000000
    4 0.051571 0.135158 0.000000

    So, for our purposes, 9 will function as our baseline since that was the biggest depth that it built with default arguments.  

    Looks like a max_depth of 2 has a slightly higher score than 9, and is slightly faster!  Interestingly, it's slightly slower than  4 or 6.  Not sure why that is.

    Matthew Alhonte's' avatar
    Center of the Universe
    Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
    Matthew Alhonte's' avatar
    Center of the Universe @MattAlhonte

    Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.