One of the best features of Random Forests is that it has built-in Feature Selection. Explicability is one of the things we often lose when we go from traditional statistics to Machine Learning, but Random Forests lets us actually get some insight into our dataset instead of just having to treat our model as a
Code Snippet Corner
Tune the min_samples_leaf parameter in for a Random Forests classifier in scikit-learn in Python
Another parameter, another set of quirks!
min_samples_leaf is sort of similar to
max_depth. It helps us avoid overfitting. It's also non-obvious what you should use as your upper and lower limits to search between. Let's do what we did last week - build a forest
Code snippet corner is back! Tune the max_depth parameter in for a Random Forests classifier in scikit-learn in Python
Continued from here
Notebook for this post is here
Binary search code itself is here
max_depth is an interesting parameter. While
n_estimators has a tradeoff between speed & score,
max_depth has the possibility of improving both. By limiting the depth of your trees, you can reduce overfitting.
Unfortunately, deciding on upper & lower bounds is less than
Tune the n_estimators parameter in for a Random Forests classifier in scikit-learn in Python
Ah, hyperparameter tuning. Time & compute-intensive. Frequently containing weird non-linearities in how changing a parameter changes the score and/or the time it takes to train the model.
RandomizedSearchCV goes noticeably faster than a full
GridSearchCV but it still takes a while - which can be rough, because in my experience you do still need to be iterative with it
Pandas and Excel Pt. 2
What if, like during my data import task a few months back, the dates & times are in separate columns? This gives us a few new issues. Let's import that Excel file!
import pandas as pd import xlrd import datetime df = pd.read_excel("hasDatesAndTimes.xlsx", sheet_name="Sheet1") book = xlrd.open_workbook("hasDatesAndTimes.xlsx&
Pandas & Excel, Part 1
Different file formats are different! For all kinds of reasons!
A few months back, I had to import some Excel files into a database. In this process I learned so much about the delightfully unique way Excel stores dates & times!
The basic datetime will be a decimal number, like
43324.909907407404. The number before the decimal is the day,
Picking Low-Hanging Fruit With Dask
Ah, laziness. You love it, I love it, everyone agrees it's just better.
Flesh-and-blood are famously lazy. Pandas the package, however, uses Eager Evaluation. What's Eager Evaluation, you ask? Is Pandas really judgey, hanging out on the street corner and being fierce to the style choices of people walking by? Well, yes, but that's not the most relevant sense in
Reshaping Pandas dataframes with a real-life example, and graphing it with Altair
Last few Code Snippet Corners were about using Pandas as an easy way to handle input and output between files & databases. Let's shift gears a little bit! Among other reasons, because earlier today I discovered a package that exclusively does that, which means I can stop importing the massive Pandas package when all I really wanted to do with
Code Snippet Corner
This isn't really a tutorial on
cron in general; Better people at Linux have written way better ones than I could write. Here's one: http://mediatemple.net/blog/news/complete-beginners-guide-cron-part-1/ This is more of a code journaling exercise for a problem that I didn't find a neat-and-tidy answer to online when I was looking for it, and that I presume