Tuning Random Forests Hyperparameters: min_samples

Part 1 (n_estimators) here
Part 2 (max_depth) here
Notebook here

Another parameter, another set of quirks!

min_samples_leaf is sort of similar to max_depth. It helps us avoid overfitting. It's also non-obvious what you should use as your upper and lower limits to search between. Let's do what we did last week - build a forest with no parameters, see what it does, and use the upper and lower limits!

import pandas as pd

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer


data = load_breast_cancer()
X, y = data.data, data.target

rfArgs = {"random_state": 0,
          "n_jobs": -1,
          "class_weight": "balanced",
         "n_estimators": 18,
         "oob_score": True}

clf = RandomForestClassifier(**rfArgs)
clf.fit(X, y)

Let's use the handy function from here to crawl the number of samples in a tree's leaf nodes:

def leaf_samples(tree, node_id = 0):
    left_child = tree.children_left[node_id]
    right_child = tree.children_right[node_id]
    
    if left_child == _tree.TREE_LEAF:
        samples = np.array([tree.n_node_samples[node_id]])
        
    else:
        
        left_samples = leaf_samples(tree, left_child)
        right_samples = leaf_samples(tree, right_child)
        
        samples = np.append(left_samples, right_samples)
        
    return samples

Last week we made a function to grab them for a whole forest - since this is the second time we're doing this, and we may do it again, let's make a modular little function that takes a crawler function as an argument!

def getForestParams(X, y, param, kwargs):
    clf = RandomForestClassifier(**kwargs)
    clf.fit(X, y)
    params = np.hstack([param(estimator.tree_) 
                 for estimator in clf.estimators_])
    return {"min": params.min(),
           "max": params.max()}

def getForestParams(X, y, param, kwargs):
    clf = RandomForestClassifier(**kwargs)
    clf.fit(X, y)
    params = np.hstack([param(estimator.tree_) 
                 for estimator in clf.estimators_])
    return {"min": params.min(),
           "max": params.max()}

Let's see it in action!

data = load_breast_cancer()
X, y = data.data, data.target

rfArgs = {"random_state": 0,
          "n_jobs": -1,
          "class_weight": "balanced",
         "n_estimators": 18,
         "oob_score": True}

getForestParams(X, y, leaf_samples, rfArgs)
#> {'max': 199, 'min': 1}

Almost ready to start optimizing! Since part of what we get out of optimizing min_samples_leaf is regularization (and because it's just good practice!), let's make a metric with some cross-validation. Luckily, Scikit has a builtin cross_val_score function. We'll just need to do a teensy bit of tweaking to make it use the area under a precision_recall_curve.

from sklearn.model_selection import cross_val_score

def auc_prc(estimator, X, y):
    estimator.fit(X, y)
    y_pred = estimator.oob_decision_function_[:, 1]
    precision, recall, _ = precision_recall_curve(y, y_pred)
    return auc(recall, precision)

def getForestAccuracyCV(X, y, kwargs):
    clf = RandomForestClassifier(**kwargs)
    return np.mean(cross_val_score(clf, X, y, scoring=auc_prc, cv=5))

Awesome, now we have a metric that can be fed into our binary search.

min_samples_leaf = bgs.compareValsBaseCase(X, 
                    y, 
                    getForestAccuracyCV,        
                    rfArgs, 
                    "min_samples_leaf", 
                    0, 
                    1, 
                    199)
bgs.showTimeScoreChartAndGraph(min_samples_leaf)

min_samples_leaf	score	time
1	0.981662	1.402102
199	0.506455	1.416349
100	0.506455	1.401090
51	0.506455	1.394548
26	0.975894	1.396503
14	0.982954	1.398522
7	0.979888	1.398929
10	0.984789	1.404815
12	0.986302	1.391171

min_samples_leaf	score	time	scoreTimeRatio
1	0.992414	0.473848	0.082938
199	0.002084	1.039718	0.000000
100	0.002084	0.433676	0.000111
51	0.002084	0.173824	0.000396
26	0.980393	0.251484	0.154448
14	0.995105	0.331692	0.118839
7	0.988716	0.347858	0.112585
10	0.998930	0.581632	0.067998
12	1.002084	0.039718	1.000000

Looks like the action's between 1 and 51. More than that, and the score goes while simultaneously increasing the runtime - the opposite of what we want!

min_samples_leaf = bgs.compareValsBaseCase(X, 
                    y, 
                    getForestAccuracyCV,        
                    rfArgs, 
                    "min_samples_leaf", 
                    0, 
                    1, 
                    14)
bgs.showTimeScoreChartAndGraph(min_samples_leaf)

min_samples_leaf	score	time
1	0.981662	1.389387
51	0.506455	1.403807
26	0.975894	1.404517
14	0.982954	1.385420
7	0.979888	1.398840
10	0.984789	1.393863
12	0.986302	1.411774

min_samples_leaf	score	time	scoreTimeRatio
1	0.992414	0.188492	0.200671
51	0.002084	0.735618	0.000000
26	0.980393	0.762561	0.048920
14	0.995105	0.037944	1.000000
7	0.988716	0.547179	0.068798
10	0.998930	0.358303	0.106209
12	1.002084	1.037944	0.036709

Big drop-off after 26, it seems!

min_samples_leaf = bgs.compareValsBaseCase(X, 
                    y, 
                    getForestAccuracyCV,        
                    rfArgs, 
                    "min_samples_leaf", 
                    0, 
                    1, 
                    26)
bgs.showTimeScoreChartAndGraph(min_samples_leaf)

min_samples_leaf	score	time
1	0.981662	1.407957
26	0.975894	1.398042
14	0.982954	1.396782
7	0.979888	1.396096
10	0.984789	1.402322
12	0.986302	1.401080

min_samples_leaf	score	time	scoreTimeRatio
1	0.650270	1.084306	0.040144
26	0.096077	0.248406	0.000000
14	0.774346	0.142157	0.954016
7	0.479788	0.084306	1.000000
10	0.950677	0.609184	0.221294
12	1.096077	0.504512	0.336668

One more with 14 as our upper limit!

min_samples_leaf = bgs.compareValsBaseCase(X, 
                    y, 
                    getForestAccuracyCV,        
                    rfArgs, 
                    "min_samples_leaf", 
                    0, 
                    1, 
                    14)
bgs.showTimeScoreChartAndGraph(min_samples_leaf)

min_samples_leaf	score	time
1	0.981662	1.401341
14	0.982954	1.400361
7	0.979888	1.402408
4	0.981121	1.401396
3	0.983580	1.401332

min_samples_leaf	score	time	scoreTimeRatio
1	0.992414	0.188492	0.200671
51	0.002084	0.735618	0.000000
26	0.980393	0.762561	0.048920
14	0.995105	0.037944	1.000000
7	0.988716	0.547179	0.068798
10	0.998930	0.358303	0.106209
12	1.002084	1.037944	0.036709

I suppose when it gets this small we could use a regular Grid Search, but...maybe next week! Or maybe another variable! Or maybe benchmarks vs GridSearchCV and/or RandomizedSearchCV. Who knows what the future holds?

Hackers and Slackers

Tuning Random Forests Hyperparameters: min_samples_leaf

Related Posts