- Recasting Low-Cardinality Columns as Categoricals
- Removing Duplicate Columns in Pandas
- Downcast Numerical Data Types with Pandas
- Using Random Forests for Feature Selection with Categorical Features
- Tuning Random Forests Hyperparameters with Binary Search Part III: min_samples_leaf
- Tuning Random Forests Hyperparameters with Binary Search Part II: max_depth
- Tuning Machine Learning Hyperparameters with Binary Search
- Importing Excel Datetimes Into Pandas, Part II
- Importing Excel Datetimes Into Pandas, Part I
- All That Is Solid Melts Into Graphs
- Trash Pandas: Messy, Convenient DB Operations via Pandas
- A Dirty Way of Cleaning Data (ft. Pandas & SQL)
- Getting Conda Envs (And Environment Variables!) To Play Nicely With Cron
- Using Pandas and SQLAlchemy to Simplify Databases
Different file formats are different! For all kinds of reasons!
A few months back, I had to import some Excel files into a database. In this process I learned so much about the delightfully unique way Excel stores dates & times!
The basic datetime will be a decimal number, like
43324.909907407404. The number before the decimal is the day, the number afterwards is the time. So far, so good - this is pretty common for computers. The date is often the number of days past a certain date, and the time is the number of seconds.
So, let's load our excel sheet! Pandas of course has a painless way of doing this.
import pandas as pd dfRaw = pd.read_excel("hasDates.xlsx", sheet_name="Sheet1") dfRaw["dateTimes"]
Sadly, we can't yet convert these. Different Excel files start at different dates, and you'll get a very wrong result if you use the wrong one. Luckily, there are tools that'll go into the file and extract what we need! Enter
import xlrd book = xlrd.open_workbook("hasDates.xlsx") datemode = book.datemode
xlrd also has a handy function for turning those dates into a
datetime tuple that'll play nicely with Python.
dfRaw["dateTimes"].map(lambda x: xlrd.xldate_as_tuple(x, datemode))
|0||(2018, 8, 12, 21, 50, 16)|
|1||(2018, 8, 12, 21, 50, 17)|
|2||(2018, 8, 12, 21, 50, 18)|
|3||(2018, 8, 12, 21, 50, 19)|
|4||(2018, 8, 12, 21, 50, 20)|
And once we've got that, simple enough to convert to proper datetimes!
import datetime dfRaw["dateTimes"].map(lambda x: datetime.datetime(*xlrd.xldate_as_tuple(x, datemode)))
Stick around for Part 2, where we look at some messier situations.