Hackers and Slackers

Extract Nested Data From Complex JSON

Steal our code and never manually walk through JSON objects again

We're all data people here, so you already know the scenario: it happens perhaps once a day, perhaps 5, or even more. There's an API you're working with, and it's great. It contains all the information you're looking for, but there's just one problem: the complexity of nested JSON objects is endless, and suddenly the job you love needs to be put on hold to painstakingly retrieve the data you actually want, and it's 5 levels deep in a nested JSON hell. Nobody feels like much of a "scientist" or an "engineer" when half their day becomes dealing with key value errors.

Luckily, we code in Python (okay fine, that doesn't make much of a difference here but it felt like a rallying call for a brief moment there).

Real World Example

To visualize the problem, let's take an example somebody might actually want to use.  I think the Google Maps API is a good candidate to fit the bill here.

While Google Maps is actually a collection of APIs, the Google Maps Distance Matrix. The idea is that with a single API call, a user can calculate the distance and time traveled between an origin and an infinite number of destinations. It's a great full-featured API, but as you might imagine the resulting JSON for calculating commute time between where you stand and every location in the conceivable universe makes an awfully complex JSON structure.

Real quick, here's an example of the types of parameters this request accepts:

import API_KEY

def __init__():
    params = {
       'units': 'imperial',
       'key': API_KEY,
       'origins': 'New York City, NY',
       'destinations': 'Philadelphia,PA',
       'transit_mode': 'car'
    }

One origin, one destination. Pretty simple response:

{
    "destination_addresses": [
        "Philadelphia, PA, USA"
    ],
    "origin_addresses": [
        "New York, NY, USA"
    ],
    "rows": [
        {
            "elements": [
                {
                    "distance": {
                        "text": "94.6 mi",
                        "value": 152193
                    },
                    "duration": {
                        "text": "1 hour 44 mins",
                        "value": 6227
                    },
                    "status": "OK"
                }
            ]
        }
    ],
    "status": "OK"
}

So for each destination, we're getting two significant kinds of data: the commute distance, and estimated duration.

Now let's make things interesting by adding a few more stops on our trip:

import API_KEY

def __init__():
    params = {
       'units': 'imperial',
       'key': API_KEY,
       'origins': 'New York City, NY',
       'destinations': 'Washington,DC|Philadelphia,PA|Santa Barbara,CA|Miami,FL|Austin,TX|Napa County,CA',
       'transit_mode': 'car'
    }

Oh boy:

{
    "destination_addresses": [
        "Washington, DC, USA",
        "Philadelphia, PA, USA",
        "Santa Barbara, CA, USA",
        "Miami, FL, USA",
        "Austin, TX, USA",
        "Napa County, CA, USA"
    ],
    "origin_addresses": [
        "New York, NY, USA"
    ],
    "rows": [
        {
            "elements": [
                {
                    "distance": {
                        "text": "227 mi",
                        "value": 365468
                    },
                    "duration": {
                        "text": "3 hours 54 mins",
                        "value": 14064
                    },
                    "status": "OK"
                },
                {
                    "distance": {
                        "text": "94.6 mi",
                        "value": 152193
                    },
                    "duration": {
                        "text": "1 hour 44 mins",
                        "value": 6227
                    },
                    "status": "OK"
                },
                {
                    "distance": {
                        "text": "2,878 mi",
                        "value": 4632197
                    },
                    "duration": {
                        "text": "1 day 18 hours",
                        "value": 151772
                    },
                    "status": "OK"
                },
                {
                    "distance": {
                        "text": "1,286 mi",
                        "value": 2069031
                    },
                    "duration": {
                        "text": "18 hours 43 mins",
                        "value": 67405
                    },
                    "status": "OK"
                },
                {
                    "distance": {
                        "text": "1,742 mi",
                        "value": 2802972
                    },
                    "duration": {
                        "text": "1 day 2 hours",
                        "value": 93070
                    },
                    "status": "OK"
                },
                {
                    "distance": {
                        "text": "2,871 mi",
                        "value": 4620514
                    },
                    "duration": {
                        "text": "1 day 18 hours",
                        "value": 152913
                    },
                    "status": "OK"
                }
            ]
        }
    ],
    "status": "OK"
}

There's a lot happening here. There are objects. There are lists. There are lists of objects which are part of an object. The last thing I'd want to deal with is trying to parse this data only to accidentally get a useless key:value pair like "status": "OK".

Obligatory Code Snippet Solution

Let's say we only want the human-readable data from this JSON, which is labeled "text" for both distance and duration. We've created a class below dubbed extractValues to help us resolve this very issue. The idea is that extractValues is flexible and agnostic, therefore can be imported as a module into any project you might need.

class ExtractValues:
    """Recursively pull values of specified key from nested JSON."""

    def __init__(self, obj, key):
        self.obj = obj
        self.key = key
        self.arr = []
        self.res = self.extract(self.obj, self.arr, self.key)

    @classmethod
    def extract(self, obj, arr, key):
        """Returns all matching values."""
        if isinstance(obj, dict):
            for k, v in obj.items():
                if isinstance(v, (dict, list)):
                    self.extract(v, arr, key)
                elif k == key:
                   arr.append(v)
        elif isinstance(obj, list):
              for item in obj:
                  self.extract(item, arr, key)
        return arr

Because we're dealing with a class, we'll need to define an instance of the class first before invoking the method within. This class accepts two values:

  • A JSON object, such as a request from an API requests
  • The key we're looking to extract values from
myValues = ExtractValues(res, 'text')
myValueArray = myValues.res

Regardless of where the key "text" lives in the JSON, this function will return every value for the instance of "key".

So a script like this:

import requests
import API_KEY
from recursivejson import ExtractValues

endpoint = "https://maps.googleapis.com/maps/api/distancematrix/json"

import API_KEY

def __init__():
    params = {
       'units': 'imperial',
       'key': "AIzaSyAhgdmWDWEqukxL4hVq3wB00VMr5wVOPfs",
       'origins': "New York City,NY",
       'destinations': "Washington,DC|Philadelphia,PA|Santa Barbara,CA|Miami,FL|Austin,TX|Napa Valley,CA",
       'transit_mode': 'car',
       }

    r = requests.get(endpoint, params=params)

   myValues = ExtractValues(r.json(), 'text')
   arr = myValues.res
   print(arr)

Results in this:

['227 mi', '3 hours 54 mins', '94.6 mi', '1 hour 44 mins', '2,878 mi', '1 day 18 hours', '1,286 mi', '18 hours 43 mins', '1,742 mi', '1 day 2 hours', '2,871 mi', '1 day 18 hours']

Alas, an array of only the things we want. Maybe we should separate this into two arrays, where each array contains the same type of information:

myValues = ExtractValues(r.json(), 'text')
arr = myValues.res
print('durations = ', arr[1::2])
print('distances = ', arr[2::1])

# -----------------------------------
# Output:
# -----------------------------------
durations = ['3 hours 54 mins', '1 hour 44 mins', '1 day 18 hours', '18 hours 43 mins', '1 day 2 hours', '1 day 18 hours']

distances = ['94.6 mi', '1 hour 44 mins', '2,878 mi', '1 day 18 hours', '1,286 mi', '18 hours 43 mins', '1,742 mi', '1 day 2 hours', '2,871 mi', '1 day 18 hours']

Feel Free to Improve

While it's nice to have a function that seems to solve all your problems at the moment, there's obvious room for improvement here. I'll leave that in your hands.

Author image
New York City Website
Product manager turned engineer with ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

Product manager turned engineer with ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.