Extract Massive Amounts of Data from APIs in Python

Abusing APIs for all they’re worth

Taxation without representation. Colonialism. Not letting people eat cake. Human beings rightfully meet atrocities with action in an effort to change the worked for the better. Cruelty by mankind justifies revolution, and it is this writer's opinion that API limitations are one such cruelty.

The data we need and crave is stashed in readily available APIs all around us. It's as though we have the keys to the world, but that power often comes with a few caveats:

  • Your "key" only lasts a couple hours, and if you want another one, you'll have to use some other keys to get another key.
  • You can have the ten thousand records you're looking for, but you can only pull 50 at a time.
  • You won't know the exact structure of the data you're getting, but it'll probably be a JSON hierarchy designed by an 8 year old.

All men may be created equal, but APIs are not. In the spirit of this 4th of July, let us declare independence from repetitive tasks: One Script, under Python, for Liberty and Justice for all.

Project Setup

We'll split our project up by separation of concern into just a few files:

myProject
├── main.py
├── config.py
└── token.py

Main.py will unsurprisingly hold the core logic of our script.

Config.py contains variables such as client secrets and endpoints which we can easily swap when applying this script to different APIs. For now we'll just keep variables client_id and client_secret in there for now.

Token.py serves the purpose of Token Generation. Let's start there.

That's the Token

Since we're assuming worst case scenarios let's focus on atrocity number one: APIs which require expiring tokens. There are some tyrants in this world who believe that in order to use their API, it is necessary to to first use a client ID and client secret to generate a Token which quickly becomes useless hours later. In other words, you need to use an API every time you want to use the actual API. Fuck that.

import requests
from config import client_id, client_secret

token_url = 'https://api.fakeapi.com/auth/oauth2/v2/token'

def generateToken():
    r = requests.post(token_url, auth=(client_id, client_secret), json={"grant_type": "client_credentials"})
    bearer_token = r.json()['access_token']
    print('new token = ', bearer_token)
    return bearer_token

token = generateToken()

We import client_id and client_secret from our config file off the bat: most services will grant these things simply by signing up for their API.  

Many APIs have an endpoint which specifically serves the purpose of accepting these variables and spitting out a generated token. token_url is the variable we use to store this endpoint.

Our token variable invokes our generateToken() function which stores the resulting Token. With this out of the way, we can now call this function every time we use the API, so we never have to worry about expiring tokens.

Pandas to the Rescue

We've established that we're looking to pull a large set of data, probably somewhere in the range of thousands of records. While JSON is all fine and dandy, it probably isn't very useful for human beings to consume a JSON file with thousands of records.

Again, we have no idea what the nature of the data coming through will look like. I don't really care to manually map values to fields, and I'm guessing you don't either. Pandas can help us out here: by passing the first page of records to Pandas, we can generate the resulting keys into columns in a Dataframe. It's almost like having a database-type schema created for you simply by looking at the data coming through:

import requests
import pandas as pd
import numpy as np
import json
from token import token

def setKeys():
    headers = {"Authorization":"Bearer " + token}
    r = requests.get(base_url + 'users', headers=headers)
    dataframe = pd.DataFrame(columns=r.json()['data'][0].keys())
    return dataframe

records_df = setKeys()

We can now store all data into records_df moving forward, allowing us to build a table of results.

No Nation for Pagination

And here we are, one of the most obnoxious parts of programming: paginated results. We want thousands of records, but we're only allowed 50 at a time. Joy.

We've already set records_df earlier as a global variable, so we're going to append every page of results we get to that Dataframe, starting at page #1. The function getRecords is going to pull that first page for us.

base_url = 'https://api.fakeapi.com/api/1/'

def getRecords():
    headers = {"Authorization": "Bearer " + token}
    r = requests.get(base_url + 'users', headers=headers)
    nextpage = r.json()['pagination']['next_link']
    records_df = pd.DataFrame(columns=r.json()['data'][0].keys())
    if nextpage:
        getNextPage(nextpage)

getRecords()

Luckily APIs if there are additional pages of results to a request, most APIs will provide a URL to said page, usually stored in the response as a value. In our case, you can see we find this value after making the request: nextpage = r.json()['pagination']['next_link']. If this value exists, we make a call to get the next page of results.

page = 1

def getNextPage(nextpage):
    global page
    page = page + 1
    print('PAGE ', page)
    headers = {"Authorization": "Bearer " + token}
    r = requests.get(nextpage, headers=headers)
    nextpage = r.json()['pagination']['next_link']
    records = r.json()['data']
    for user in records:
        s  = pd.Series(user,index=user.keys())
        global records_df
        records_df.loc[len(records_df)] = s
    records_df.to_csv('records.csv')
    if nextpage:
        getNextPage(nextpage)

Our function getNextPage hits that next page of results, and appends them to the pandas Dataframe we created earlier. If another page exists after that, the function runs again, and our page increments by 1. As long as more pages exist, this function will fire again and again until all innocent records are driven out of their comfortable native resting place and forced into our contained dataset. There's not much more American than that.

There's More We Can Do

This script is fine, but it can optimized to be even more modular to truly be one-size-fits-all. For instance, some APIs don't tell you the number of pages you should except, but rather the number of records. In those cases, we'd have to divide total number of records by records per page to know how many pages to expect. As much as I want to go into detail about writing loops on the 4th of July, I don't. At all.

There are plenty more examples, but this should be enough to get us thinking how we can replace tedious work with machines. That sounds like a flavor that pairs perfectly with Bud Light and hotdogs if you ask me.

Author image
New York City Website
Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.