### Data Science

Watch as we attempt to maintain a delicate harmony of math, engineering, and intuition to solve larger-than-life problems.

## Using Random Forests for Feature Selection with Categorical Features

##### Python helper functions for adding feature importance, and displaying them as a single variable.

Notebook here.  Helper functions here.

One of the best features of Random Forests is that it has built-in Feature Selection.  Explicability is one of the things we often lose when we go from traditional statistics to Machine Learning, but Random Forests lets us actually get some insight into our dataset instead of just having to treat our model as a

Code Snippet Corner
mattJan 15
Sep 24

## Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf

##### Tune the min_samples_leaf parameter in for a Random Forests classifier in scikit-learn in Python

Part 1 (n_estimators) here
Part 2 (max_depth) here
Notebook here

Another parameter, another set of quirks!

`min_samples_leaf` is sort of similar to `max_depth`.  It helps us avoid overfitting.  It's also non-obvious what you should use as your upper and lower limits to search between.  Let's do what we did last week - build a forest

Code Snippet Corner
mattJan 15
Sep 17

## Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth

##### Code snippet corner is back! Tune the max_depth parameter in for a Random Forests classifier in scikit-learn in Python

Continued from here

Notebook for this post is here

Binary search code itself is here

`max_depth` is an interesting parameter.  While `n_estimators` has a tradeoff between speed & score, `max_depth` has the possibility of improving both.  By limiting the depth of your trees, you can reduce overfitting.

Unfortunately, deciding on upper & lower bounds is less than

Code Snippet Corner
mattJan 15
Sep 10

## Code Snippet Corner: Tuning Machine Learning Hyperparameters with Binary Search

##### Tune the n_estimators parameter in for a Random Forests classifier in scikit-learn in Python

Ah, hyperparameter tuning.  Time & compute-intensive.  Frequently containing weird non-linearities in how changing a parameter changes the score and/or the time it takes to train the model.

`RandomizedSearchCV` goes noticeably faster than a full `GridSearchCV` but it still takes a while - which can be rough, because in my experience you do still need to be iterative with it

Code Snippet Corner
mattJan 15
Sep 03

## Automagically Turn JSON into Pandas DataFrames

##### Let pandas do the heavy lifting for you when turning JSON into a DataFrame.

In his post about extracting data from APIs, Todd demonstrated a nice way to massage JSON into a pandas DataFrame. This method works great when our JSON response is flat, because `dict.keys()` only gets the keys on the first "level" of a dictionary. It gets a little trickier when our JSON starts to become nested though, as I experienced

## Trash Pandas: Messy, Convenient DB Operations via Pandas

##### (And a way to clean it up with SQLAlchemy)

Let's say you were continuing our task from last week: Taking a bunch of inconsistent Excel files and CSVs, and putting them into a database.

Let's say you've been given a new CSV that conflicts with some rows you've already entered, and you're told that these rows are the correct values.

## Why Not Use Pandas' Built-in Method?

Pandas' built-in `to_`

Pandas
mattJan 15
Jul 23

## Data Could Save Humanity if it Weren't for Humanity

##### A compelling case for robot overlords.

A decade has passed since I stumbled into technical product development. Looking back, I've spent that time almost exclusively in the niche of data-driven products and engineering. While it seems obvious now, I realized in the 2000s that you could generally create two types of product: you could either build a (likely uninspired) UI for existing data, or you could

Data Science
mattJan 15
Jul 19

Data Science
mattJan 15
Jul 18

Statistics
mattJan 15
Jul 17