Importing Excel Datetimes Into Pandas, Part II

What if, like during my data import task a few months back, the dates & times are in separate columns? This gives us a few new issues. Let's import that Excel file!

import pandas as pd
import xlrd
import datetime

df = pd.read_excel("hasDatesAndTimes.xlsx", sheet_name="Sheet1")

book = xlrd.open_workbook("hasDatesAndTimes.xlsx")
datemode = book.datemode

And let's see that time variable!

df["Time"]

Index	Time
0	0.909907
1	0.909919
2	0.909931
3	0.909942
4	0.909954

df["Time"].map(lambda x: xlrd.xldate_as_tuple(x, datemode))

So far, so good....

df["Time"].map(lambda x: datetime.datetime(*xlrd.xldate_as_tuple(x, 
                                              datemode)))
ValueError: year 0 is out of range

Agh! Plain datetime won't let us have 0 as our year.

We'll want two packages to fix this. One is an awesome package for handling dates called arrow. In order for arrow to recognize what we want it to, though, we'll need some more manipulations - I'll be using the pipe function from toolz in order to make that more readable.

Index	Time
0	(0, 0, 0, 21, 50, 16)
1	(0, 0, 0, 21, 50, 17)
2	(0, 0, 0, 21, 50, 18)
3	(0, 0, 0, 21, 50, 19)
4	(0, 0, 0, 21, 50, 20)

Pipe lets us have a nice workflow where we just list the transformations we want, and our value will be "piped" sequentially through each one.

fns = [lambda x: xlrd.xldate_as_tuple(x, datemode),
     lambda x: x[3:6],
      lambda x: map(str, x),
      lambda x: "-".join(x),
       lambda x: arrow.get(x, "H-m-s"),
       lambda x: x.format('HH:mm:ss')
      ]

Let's see a blow-by-blow of how one of our values gets transformed by that.

fnRanges = [fns[:i+1] for i in range(len(fns))]
[pipe(0.909907, *x) for x in fnRanges]

[(0, 0, 0, 21, 50, 16),
 (21, 50, 16),
 <map at 0x7f105151af98>,
 '21-50-16',
 <Arrow [0001-01-01T21:50:16+00:00]>,
 '21:50:16']

The first function takes us from an Excel datetime to a datetime tuple.

The next extracts just the time variables.

We then map that all to a string (which shows up as nothing because map is evaluated lazily).

Then we put a dash between all those elements so it'll be easier to parse as a time.

Then arrow consumes the value, with the format we specified.

And finally gives us a neatly-formatted time, ready to be consumed by a database!

Helper Functions

def mapPipe(ser, fns):
    return ser.map(lambda a: pipe(a, *fns),
        na_action="ignore" )

mapPipe(df['Time'],
   fns)

Index	Time
0	21:50:16
1	21:50:17
2	21:50:18
3	21:50:19
4	21:50:20

Dates are a bit easier - though the pipe syntax is still helpful!

dateFns = [lambda x: xlrd.xldate_as_tuple(x, datemode),
      lambda x: arrow.get(*x),
      lambda x: x.format('YYYY-MM-DD')
      ]
mapPipe(df['Date'],
       dateFns)

Index	Date
0	2018-08-12
1	2018-08-12
2	2018-08-12
3	2018-08-12
4	2018-08-12

Put it all together....

(df.assign(Date = mapPipe(df['Date'],
       dateFns))
   .assign(Time = mapPipe(df['Time'],
       fns)))

Index	Date	Time
0	2018-08-12	21:50:16
1	2018-08-12	21:50:17
2	2018-08-12	21:50:18
3	2018-08-12	21:50:19
4	2018-08-12	21:50:20

Helper Functions

Related Posts