Using Random Forests for Feature Selection with Categorical Features

Python helper functions for adding feature importance, and displaying them as a single variable.

Notebook here.  Helper functions here.

One of the best features of Random Forests is that it has built-in Feature Selection.  Explicability is one of the things we often lose when we go from traditional statistics to Machine Learning, but Random Forests lets us actually get some insight into our dataset instead of just having to treat our model as a black box.

One problem, though - it doesn't work that well for categorical features.  Since you'll generally have to One-Hot Encode a categorical feature (for instance, turn something with 7 categories into 7 variables that are a "True/False"), you'll wind up with a bunch of small features.  This gets tough to read, especially if you're dealing with a lot of categories.  It also makes that feature look less important than it is - rather than appearing near the top, you'll maybe have 17 weak-seeming features near the bottom - which gets worse if you're filtering it so that you only see features above a certain threshold.

Soo, here's some helper functions for adding up their importance and displaying them as a single variable.  I did have to "reinvent the wheel" a bit and roll my my own One-Hot function, rather than using Scikit's builtin one.

First, let's grab a dataset.  I'm using this Kaggle dataset because it has a good number of categorical predictors.  I'm also only using the first 500 rows because the whole dataset is like ~ 1 GB.

import pandas as pd

df = pd.read_csv("train.csv", 
                   nrows=500)

Let's just use the Categorical variables as our predictors because that's what we're focusing on, but in actual usage you don't have to make them the same.

predVars = [
    "site_category",
    "app_category",
    "device_model",
    "device_type",
    "device_conn_type",
]

X = (df
     .dropna()
     [predVars]
     .pipe((fh.oneHotEncodeMultipleVars, "df"),
           varList = predVars) #Change this if you don't have solely categoricals
    )

labels = X.columns

y = (df
     .dropna()
     ["click"]
     .values)

Let's use log_loss as our metric, because I saw this blog post that used it for this dataset.

from sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import log_loss
fi.displayFeatureImportances(X,y,labels,log_loss,{"n_estimators": 18,"oob_score": True},)
Score is 3.6297600214665064 
Variable Importance
0 device_model 0.843122
1 site_category 0.083392
2 app_category 0.037216
3 device_type 0.025057
4 device_conn_type 0.011213
Author image
Center of the Universe Website
Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.
Author image
Center of the Universe

Super villain in somebody's action hero movie. Experienced a radioactive freak accident at a young age, which rendered him part-snake and strangely adept at Python.