Using Pandas with AWS Lambda Functions

In one corner we have Pandas: Python's beloved data analysis library. In the other, AWS: the unstoppable cloud provider we're obligated to use for all eternity. We should have known this day would come.

While not the prettiest workflow, uploaded Python package dependencies for usage in AWS Lambda is typically straightforward. We install the packages locally to a virtual env, package them with our app logic, and upload a neat CSV to Lambda. In some cases this doesn't always work: some packages result in a cryptic error message with absolutely no helpful instruction. Pandas is one of those packages.

Why is this? I can't exactly speak to that, but I can speak to how to fix it.

Spin up an EC2 Instance

Certain Python packages need to be installed and compiled on an EC2 instance in order to work properly with AWS microservices. I wish I could say that this fun little fact is well-documented somewhere in AWS with a perfectly good explanation. It's not, and it doesn't. It's probably best not to ask questions.

Spin up a free tier EC2 instance, update your system packages, and make sure Python3 is installed. Some people theorize that the Python dependency package errors happen when said dependencies are installed via versions of Python which differ from the version AWS is running. Those people are wrong. I've already wasted the time to debunk this. They are liars.

With Python installed, create a virtual environment inside any empty directory:

$ apt-get install virtualenv
$ python3 -m venv pandasenv
$ source pandasenv/bin/activate

Create a virtual environment

With the environment active, install pandas via pip3 install pandas. This will save pandas and all its dependencies to the site-packages folder our environment is running from, resulting in a URL such as this: pandasenv/lib/python3.6/site-packages.

Pandas is actually 5 packages total. We're going to add each of these libraries to a zip file by installing zip, and adding each folder to the zip file one-by-one. Finally, we'll apply some liberal permissions to the zip file we just created so we can grab it via FTP.

$ cd pandasenv/lib/python3.6/site-packages
$ apt-get install zip
$ zip -r pandas_archive.zip pandas
$ zip -r pandas_archive.zip numpy
$ zip -r pandas_archive.zip pytz
$ zip -r pandas_archive.zip six.py
$ zip -r pandas_archive.zip dateutil
$ chmod 777 pandas_archive.zip

Zip Pandas & dependencies.

This should be ready for you to FTP in your instance and grab as a zip file now (assuming you want to work locally). Alternatively, we could always copy those packages into the directory we'd like to work out of and zip everything once we're done.

Upload Source Code to S3

At this point, you should have been able to grab the AWS friendly version of Pandas which is ready to be included in the final source code which will become your Lambda Function. You might notice that pandas alone nearly 30Mb: which is roughly the file size of countless intelligent people creating their life's work. When Lambda Functions go above this file size, it's best to upload our final package (with source and dependencies) as a zip file to S3, and link it to Lambda that way. This is considerably faster than the alternative of uploading the zip to Lambda directly.

Bonus Round: Saving Exports

What? You want to save a CSV result of all the cool stuff you're doing in Pandas? You really are needy.

Because AWS is invoking the function, any attempt to read_csv() will be worthless to us. To get around this, we can use boto3 to write files to an S3 bucket instead:

import pandas as pd
from io import StringIO
import boto3


s3 = boto3.client('s3', aws_access_key_id=ACCESSKEY, aws_secret_access_key=SECRETYKEY)
s3_resource = boto3.resource('s3')
bucket = 'your_bucket_name'

csv_buffer = StringIO()

example_df = pd.DataFrame()
example_df.to_csv(csv_buffer)
s3_resource.Object(bucket, 'export.csv').put(Body=csv_buffer.getvalue())

Write DataFrame to a CSV in S3

Word of Advice

This isn't the prettiest process in the world, but we're somewhat at fault here. Lambda functions are intended to be small tidbits of logic aimed to serve a single simple purpose. We just jammed 30Mbs of Python libraries into that simple purpose.

There are alternatives to Pandas that are better suited for usage in Lambda, such as Toolz (thanks to Snkia for the heads up). Enjoy your full Pandas library for now, but remember to feel bad about what you’ve done for next time.

Spin up an EC2 Instance

Upload Source Code to S3

Bonus Round: Saving Exports

Word of Advice

Related Posts