• Part 1 (n_estimators) here
  • Part 2 (max_depth) here
  • Notebook here

Another parameter, another set of quirks!

min_samples_leaf is sort of similar to max_depth.  It helps us avoid overfitting.  It's also non-obvious what you should use as your upper and lower limits to search between.  Let's do what we did last week - build a forest with no parameters, see what it does, and use the upper and lower limits!

import pandas as pd

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer


data = load_breast_cancer()
X, y = data.data, data.target

rfArgs = {"random_state": 0,
          "n_jobs": -1,
          "class_weight": "balanced",
         "n_estimators": 18,
         "oob_score": True}

clf = RandomForestClassifier(**rfArgs)
clf.fit(X, y)

Let's use the handy function from here to crawl the number of samples in a tree's leaf nodes:

def leaf_samples(tree, node_id = 0):
    left_child = tree.children_left[node_id]
    right_child = tree.children_right[node_id]
    
    if left_child == _tree.TREE_LEAF:
        samples = np.array([tree.n_node_samples[node_id]])
        
    else:
        
        left_samples = leaf_samples(tree, left_child)
        right_samples = leaf_samples(tree, right_child)
        
        samples = np.append(left_samples, right_samples)
        
    return samples

Last week we made a function to grab them for a whole forest - since this is the second time we're doing this, and we may do it again, let's make a modular little function that takes a crawler function as an argument!

def getForestParams(X, y, param, kwargs):
    clf = RandomForestClassifier(**kwargs)
    clf.fit(X, y)
    params = np.hstack([param(estimator.tree_) 
                 for estimator in clf.estimators_])
    return {"min": params.min(),
           "max": params.max()}
def getForestParams(X, y, param, kwargs):
    clf = RandomForestClassifier(**kwargs)
    clf.fit(X, y)
    params = np.hstack([param(estimator.tree_) 
                 for estimator in clf.estimators_])
    return {"min": params.min(),
           "max": params.max()}

Let's see it in action!

data = load_breast_cancer()
X, y = data.data, data.target

rfArgs = {"random_state": 0,
          "n_jobs": -1,
          "class_weight": "balanced",
         "n_estimators": 18,
         "oob_score": True}

getForestParams(X, y, leaf_samples, rfArgs)
#> {'max': 199, 'min': 1}

Almost ready to start optimizing!  Since part of what we get out of optimizing min_samples_leaf is regularization (and because it's just good practice!), let's make a metric with some cross-validation.  Luckily, Scikit has a builtin cross_val_score function.  We'll just need to do a teensy bit of tweaking to make it use the area under a precision_recall_curve.

from sklearn.model_selection import cross_val_score

def auc_prc(estimator, X, y):
    estimator.fit(X, y)
    y_pred = estimator.oob_decision_function_[:, 1]
    precision, recall, _ = precision_recall_curve(y, y_pred)
    return auc(recall, precision)

def getForestAccuracyCV(X, y, kwargs):
    clf = RandomForestClassifier(**kwargs)
    return np.mean(cross_val_score(clf, X, y, scoring=auc_prc, cv=5))

Awesome, now we have a metric that can be fed into our binary search.

min_samples_leaf = bgs.compareValsBaseCase(X, 
                    y, 
                    getForestAccuracyCV,        
                    rfArgs, 
                    "min_samples_leaf", 
                    0, 
                    1, 
                    199)
bgs.showTimeScoreChartAndGraph(min_samples_leaf)
min_samples_leaf score time
1 0.981662 1.402102
199 0.506455 1.416349
100 0.506455 1.401090
51 0.506455 1.394548
26 0.975894 1.396503
14 0.982954 1.398522
7 0.979888 1.398929
10 0.984789 1.404815
12 0.986302 1.391171

min_samples_leaf score time scoreTimeRatio
1 0.992414 0.473848 0.082938
199 0.002084 1.039718 0.000000
100 0.002084 0.433676 0.000111
51 0.002084 0.173824 0.000396
26 0.980393 0.251484 0.154448
14 0.995105 0.331692 0.118839
7 0.988716 0.347858 0.112585
10 0.998930 0.581632 0.067998
12 1.002084 0.039718 1.000000

Looks like the action's between 1 and 51.  More than that, and the score goes while simultaneously increasing the runtime - the opposite of what we want!

min_samples_leaf = bgs.compareValsBaseCase(X, 
                    y, 
                    getForestAccuracyCV,        
                    rfArgs, 
                    "min_samples_leaf", 
                    0, 
                    1, 
                    14)
bgs.showTimeScoreChartAndGraph(min_samples_leaf)
min_samples_leaf score time
1 0.981662 1.389387
51 0.506455 1.403807
26 0.975894 1.404517
14 0.982954 1.385420
7 0.979888 1.398840
10 0.984789 1.393863
12 0.986302 1.411774

min_samples_leaf score time scoreTimeRatio
1 0.992414 0.188492 0.200671
51 0.002084 0.735618 0.000000
26 0.980393 0.762561 0.048920
14 0.995105 0.037944 1.000000
7 0.988716 0.547179 0.068798
10 0.998930 0.358303 0.106209
12 1.002084 1.037944 0.036709

Big drop-off after 26, it seems!

min_samples_leaf = bgs.compareValsBaseCase(X, 
                    y, 
                    getForestAccuracyCV,        
                    rfArgs, 
                    "min_samples_leaf", 
                    0, 
                    1, 
                    26)
bgs.showTimeScoreChartAndGraph(min_samples_leaf)
min_samples_leaf score time
1 0.981662 1.407957
26 0.975894 1.398042
14 0.982954 1.396782
7 0.979888 1.396096
10 0.984789 1.402322
12 0.986302 1.401080

min_samples_leaf score time scoreTimeRatio
1 0.650270 1.084306 0.040144
26 0.096077 0.248406 0.000000
14 0.774346 0.142157 0.954016
7 0.479788 0.084306 1.000000
10 0.950677 0.609184 0.221294
12 1.096077 0.504512 0.336668

One more with 14 as our upper limit!

min_samples_leaf = bgs.compareValsBaseCase(X, 
                    y, 
                    getForestAccuracyCV,        
                    rfArgs, 
                    "min_samples_leaf", 
                    0, 
                    1, 
                    14)
bgs.showTimeScoreChartAndGraph(min_samples_leaf)
min_samples_leaf score time
1 0.981662 1.401341
14 0.982954 1.400361
7 0.979888 1.402408
4 0.981121 1.401396
3 0.983580 1.401332

min_samples_leaf score time scoreTimeRatio
1 0.992414 0.188492 0.200671
51 0.002084 0.735618 0.000000
26 0.980393 0.762561 0.048920
14 0.995105 0.037944 1.000000
7 0.988716 0.547179 0.068798
10 0.998930 0.358303 0.106209
12 1.002084 1.037944 0.036709
3 it is!
3 it is!

I suppose when it gets this small we could use a regular Grid Search, but...maybe next week!  Or maybe another variable!  Or maybe benchmarks vs GridSearchCV and/or RandomizedSearchCV.  Who knows what the future holds?