In his post about extracting data from APIs, Todd demonstrated a nice way to massage JSON into a pandas DataFrame. This method works great when our JSON response is flat, because dict.keys()
only gets the keys on the first "level" of a dictionary. It gets a little trickier when our JSON starts to become nested though, as I experienced when working with Spotify's API via the Spotipy library. For example, take a look at a response from their https://api.spotify.com/v1/tracks/{id}
endpoint:
Here's the output:
In addition to plenty of information about the track, Spotify also includes information about the album that contains the track. If we were to just use the dict.keys()
method to turn this response into a DataFrame, we'd be missing out on all that extra album information. Well, it would be there, just not readily accessible.
Output:
So how do we get around this? Well, we could write our own function, but because pandas is amazing, it already has a built in tool that takes care of this for us.
Data Normalization
Meet json_normalize()
:
Output:
album.album_type | album.artists | album.available_markets | album.external_urls.spotify | album.href | album.id | album.images | album.name | album.release_date | album.release_date_precision | ... | external_urls.spotify | href | id | is_local | name | popularity | preview_url | track_number | type | uri | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 2008-03-04 | day | ... | https://open.spotify.com/track/0BDYBajZydY54OT... | https://api.spotify.com/v1/tracks/0BDYBajZydY5... | 0BDYBajZydY54OTgQsH940 | False | Real Emotional Trash | 21 | https://p.scdn.co/mp3-preview/4fcbcd5a99fc7590... | 4 | track | spotify:track:0BDYBajZydY54OTgQsH940 |
Yep – it's that easy. pandas takes our nested JSON object, flattens it out, and turns it into a DataFrame.
This makes our life easier when we're dealing with one record, but it really comes in handy when we're dealing with a response that contains multiple records.
Output:
Output:
album.album_type | album.artists | album.available_markets | album.external_urls.spotify | album.href | album.id | album.images | album.name | album.release_date | album.release_date_precision | ... | external_urls.spotify | href | id | is_local | name | popularity | preview_url | track_number | type | uri | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 3/4/2008 | day | ... | https://open.spotify.com/track/0BDYBajZydY54OT... | https://api.spotify.com/v1/tracks/0BDYBajZydY5... | 0BDYBajZydY54OTgQsH940 | FALSE | Real Emotional Trash | 21 | https://p.scdn.co/mp3-preview/4fcbcd5a99fc7590... | 4 | track | spotify:track:0BDYBajZydY54OTgQsH940 |
1 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 3/4/2008 | day | ... | https://open.spotify.com/track/7fdUqrzb8oCcIoK... | https://api.spotify.com/v1/tracks/7fdUqrzb8oCc... | 7fdUqrzb8oCcIoKvFuzMrs | FALSE | Cold Son | 25 | https://p.scdn.co/mp3-preview/4cf4e21727def470... | 3 | track | spotify:track:7fdUqrzb8oCcIoKvFuzMrs |
2 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 3/4/2008 | day | ... | https://open.spotify.com/track/0islTY4Fw6lhYbf... | https://api.spotify.com/v1/tracks/0islTY4Fw6lh... | 0islTY4Fw6lhYbfqi8Qtdj | FALSE | Hopscotch Willie | 24 | https://p.scdn.co/mp3-preview/c7782dc6d7c0bb12... | 2 | track | spotify:track:0islTY4Fw6lhYbfqi8Qtdj |
3 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 3/4/2008 | day | ... | https://open.spotify.com/track/3jyFLbljUTKjE13... | https://api.spotify.com/v1/tracks/3jyFLbljUTKj... | 3jyFLbljUTKjE13nIWXchH | FALSE | Dragonfly Pie | 26 | https://p.scdn.co/mp3-preview/50f419e7d3e8a6a7... | 1 | track | spotify:track:3jyFLbljUTKjE13nIWXchH |
4 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, AU, BO, BR, CA, CL, CO, CR, DO, EC, GT, H... | https://open.spotify.com/album/5DMvSCwRqfNVlMB... | https://api.spotify.com/v1/albums/5DMvSCwRqfNV... | 5DMvSCwRqfNVlMB5LjHOwG | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Sparkle Hard | 5/18/2018 | day | ... | https://open.spotify.com/track/6dNmC2YWtWbVOFO... | https://api.spotify.com/v1/tracks/6dNmC2YWtWbV... | 6dNmC2YWtWbVOFOdTuRDQs | FALSE | Difficulties - Let Them Eat Vowels | 35 | https://p.scdn.co/mp3-preview/787be9d1bbebcd84... | 11 | track | spotify:track:6dNmC2YWtWbVOFOdTuRDQs |
Separate Ways (Worlds Apart)
By default, json_normalize()
uses periods .
to indicate nested levels of the JSON object (which is actually converted to a Python dict
by Spotipy). In our case, the album id is found in track['album']['id']
, hence the period between album and id in the DataFrame. This makes things slightly annoying if we want to grab a Series from our new DataFrame. In pandas, we can grab a Series from a DataFrame in many ways. To grab the album.id
column, for example:
tracks_df['album.id']
Output:
0 6pWpb4IdPu9vp9mOdh5DjY
1 6pWpb4IdPu9vp9mOdh5DjY
2 6pWpb4IdPu9vp9mOdh5DjY
3 6pWpb4IdPu9vp9mOdh5DjY
4 5DMvSCwRqfNVlMB5LjHOwG
Name: album.id, dtype: object
or...
tracks_df.loc[:,'album.id']
Output:
0 6pWpb4IdPu9vp9mOdh5DjY
1 6pWpb4IdPu9vp9mOdh5DjY
2 6pWpb4IdPu9vp9mOdh5DjY
3 6pWpb4IdPu9vp9mOdh5DjY
4 5DMvSCwRqfNVlMB5LjHOwG
Name: album.id, dtype: object
Pandas also allows us to use dot notation (i.e. dataframe.column_name
) to grab a column as a Series, but only if our column name doesn't include a period already. Since json_normalize()
uses a period as a separator by default, this ruins that method. Never fear though – overriding this behavior is as simple as overriding the default argument in the function call:
Output:
album_album_type | album_artists | album_available_markets | album_external_urls_spotify | album_href | album_id | album_images | album_name | album_release_date | album_release_date_precision | ... | external_urls_spotify | href | id | is_local | name | popularity | preview_url | track_number | type | uri | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 3/4/2008 | day | ... | https://open.spotify.com/track/0BDYBajZydY54OT... | https://api.spotify.com/v1/tracks/0BDYBajZydY5... | 0BDYBajZydY54OTgQsH940 | FALSE | Real Emotional Trash | 21 | https://p.scdn.co/mp3-preview/4fcbcd5a99fc7590... | 4 | track | spotify:track:0BDYBajZydY54OTgQsH940 |
1 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 3/4/2008 | day | ... | https://open.spotify.com/track/7fdUqrzb8oCcIoK... | https://api.spotify.com/v1/tracks/7fdUqrzb8oCc... | 7fdUqrzb8oCcIoKvFuzMrs | FALSE | Cold Son | 25 | https://p.scdn.co/mp3-preview/4cf4e21727def470... | 3 | track | spotify:track:7fdUqrzb8oCcIoKvFuzMrs |
2 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 3/4/2008 | day | ... | https://open.spotify.com/track/0islTY4Fw6lhYbf... | https://api.spotify.com/v1/tracks/0islTY4Fw6lh... | 0islTY4Fw6lhYbfqi8Qtdj | FALSE | Hopscotch Willie | 24 | https://p.scdn.co/mp3-preview/c7782dc6d7c0bb12... | 2 | track | spotify:track:0islTY4Fw6lhYbfqi8Qtdj |
3 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, BO, BR, CA, CL, CO, CR, EC, GT, HK, HN, I... | https://open.spotify.com/album/6pWpb4IdPu9vp9m... | https://api.spotify.com/v1/albums/6pWpb4IdPu9v... | 6pWpb4IdPu9vp9mOdh5DjY | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Real Emotional Trash | 3/4/2008 | day | ... | https://open.spotify.com/track/3jyFLbljUTKjE13... | https://api.spotify.com/v1/tracks/3jyFLbljUTKj... | 3jyFLbljUTKjE13nIWXchH | FALSE | Dragonfly Pie | 26 | https://p.scdn.co/mp3-preview/50f419e7d3e8a6a7... | 1 | track | spotify:track:3jyFLbljUTKjE13nIWXchH |
4 | album | [{'external_urls': {'spotify': 'https://open.s... | [AR, AU, BO, BR, CA, CL, CO, CR, DO, EC, GT, H... | https://open.spotify.com/album/5DMvSCwRqfNVlMB... | https://api.spotify.com/v1/albums/5DMvSCwRqfNV... | 5DMvSCwRqfNVlMB5LjHOwG | [{'height': 640, 'url': 'https://i.scdn.co/ima... | Sparkle Hard | 5/18/2018 | day | ... | https://open.spotify.com/track/6dNmC2YWtWbVOFO... | https://api.spotify.com/v1/tracks/6dNmC2YWtWbV... | 6dNmC2YWtWbVOFOdTuRDQs | FALSE | Difficulties - Let Them Eat Vowels | 35 | https://p.scdn.co/mp3-preview/787be9d1bbebcd84... | 11 | track | spotify:track:6dNmC2YWtWbVOFOdTuRDQs |
Now we can go back to using dot notation to access a column as a Series. This saves us some typing every time we want to grab a column, and it looks a bit nicer (to me, at least). I say worth it.
tracks_df.album_id
Output:
0 6pWpb4IdPu9vp9mOdh5DjY
1 6pWpb4IdPu9vp9mOdh5DjY
2 6pWpb4IdPu9vp9mOdh5DjY
3 6pWpb4IdPu9vp9mOdh5DjY
4 5DMvSCwRqfNVlMB5LjHOwG
Name: album_id, dtype: object
I Need That Record
By including more parameters when we use json_normlize()
, we can really extract just the data that we want from our API response.
From our responses above, we can see that the artist
property contains a list of artists that are associated with a track:
tracks_response['tracks'][0]['artists']
Output:
[{
"external_urls": {
"spotify": "https://open.spotify.com/artist/7wyRA7deGRxozTyBc6QXPe"
},
"href": "https://api.spotify.com/v1/artists/7wyRA7deGRxozTyBc6QXPe",
"id": "7wyRA7deGRxozTyBc6QXPe",
"name": "Stephen Malkmus & The Jicks",
"type": "artist",
"uri": "spotify:artist:7wyRA7deGRxozTyBc6QXPe"
},
{
"external_urls": {
"spotify": "https://open.spotify.com/artist/0WISkx0PwT6lYWdPqKUJY8"
},
"href": "https://api.spotify.com/v1/artists/0WISkx0PwT6lYWdPqKUJY8",
"id": "0WISkx0PwT6lYWdPqKUJY8",
"name": "Stephen Malkmus",
"type": "artist",
"uri": "spotify:artist:0WISkx0PwT6lYWdPqKUJY8"
},
{
"external_urls": {
"spotify": "https://open.spotify.com/artist/7uStwCeP54Za8gXUFCf5L7"
},
"href": "https://api.spotify.com/v1/artists/7uStwCeP54Za8gXUFCf5L7",
"id": "7uStwCeP54Za8gXUFCf5L7",
"name": "The Jicks",
"type": "artist",
"uri": "spotify:artist:7uStwCeP54Za8gXUFCf5L7"
}
]
Let's say I want to load this data into a database later. It would be nice to have a join table that maps each of the artists that are associated with each track. Luckily, this is possible with json_normalize()
's record_path
and meta
parameters.
record_path
tells json_normalize()
what path of keys leads to each individual record in the JSON object. In our case, we want to grab every artist id, so our function call will look like:
json_normalize(
tracks_response['tracks'],
record_path=['artists'],
sep="_"
)
external_urls | href | id | name | type | uri | |
---|---|---|---|---|---|---|
1 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7wyRA7deGRx... | 7wyRA7deGRxozTyBc6QXPe | Stephen Malkmus & The Jicks | artist | spotify:artist:7wyRA7deGRxozTyBc6QXPe |
1 | {'spotify': 'https://open.spotify.com/artist/0... | https://api.spotify.com/v1/artists/0WISkx0PwT6... | 0WISkx0PwT6lYWdPqKUJY8 | Stephen Malkmus | artist | spotify:artist:0WISkx0PwT6lYWdPqKUJY8 |
2 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7uStwCeP54Z... | 7uStwCeP54Za8gXUFCf5L7 | The Jicks | artist | spotify:artist:7uStwCeP54Za8gXUFCf5L7 |
3 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7wyRA7deGRx... | 7wyRA7deGRxozTyBc6QXPe | Stephen Malkmus & The Jicks | artist | spotify:artist:7wyRA7deGRxozTyBc6QXPe |
4 | {'spotify': 'https://open.spotify.com/artist/0... | https://api.spotify.com/v1/artists/0WISkx0PwT6... | 0WISkx0PwT6lYWdPqKUJY8 | Stephen Malkmus | artist | spotify:artist:0WISkx0PwT6lYWdPqKUJY8 |
5 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7uStwCeP54Z... | 7uStwCeP54Za8gXUFCf5L7 | The Jicks | artist | spotify:artist:7uStwCeP54Za8gXUFCf5L7 |
6 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7wyRA7deGRx... | 7wyRA7deGRxozTyBc6QXPe | Stephen Malkmus & The Jicks | artist | spotify:artist:7wyRA7deGRxozTyBc6QXPe |
7 | {'spotify': 'https://open.spotify.com/artist/0... | https://api.spotify.com/v1/artists/0WISkx0PwT6... | 0WISkx0PwT6lYWdPqKUJY8 | Stephen Malkmus | artist | spotify:artist:0WISkx0PwT6lYWdPqKUJY8 |
8 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7uStwCeP54Z... | 7uStwCeP54Za8gXUFCf5L7 | The Jicks | artist | spotify:artist:7uStwCeP54Za8gXUFCf5L7 |
9 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7wyRA7deGRx... | 7wyRA7deGRxozTyBc6QXPe | Stephen Malkmus & The Jicks | artist | spotify:artist:7wyRA7deGRxozTyBc6QXPe |
10 | {'spotify': 'https://open.spotify.com/artist/0... | https://api.spotify.com/v1/artists/0WISkx0PwT6... | 0WISkx0PwT6lYWdPqKUJY8 | Stephen Malkmus | artist | spotify:artist:0WISkx0PwT6lYWdPqKUJY8 |
11 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7uStwCeP54Z... | 7uStwCeP54Za8gXUFCf5L7 | The Jicks | artist | spotify:artist:7uStwCeP54Za8gXUFCf5L7 |
12 | {'spotify': 'https://open.spotify.com/artist/7... | https://api.spotify.com/v1/artists/7wyRA7deGRx... | 7wyRA7deGRxozTyBc6QXPe | Stephen Malkmus & The Jicks | artist | spotify:artist:7wyRA7deGRxozTyBc6QXPe |
Cool – we're almost there. Now we want to use the meta
parameter to specify what data we want to include from the rest of the JSON object. In our case, we want to keep the track id and map it to the artist id. If we look back at our API response, the name of the column that included the track is is called, appropriately, id
, so our full function call should look like this:
json_normalize(
tracks_response['tracks'],
record_path=['artists'],
meta=['id'],
sep="_"
)
Output:
-----------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-77e00a98c3c0> in <module>()
----> 1 json_normalize(tracks_response['tracks'],record_path=['artists'],meta=['id'],sep="_")
~/anaconda3/envs/music_data/lib/python3.6/site-packages/pandas/io/json/normalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
268 if k in result:
269 raise ValueError('Conflicting metadata name {name}, '
--> 270 'need distinguishing prefix '.format(name=k))
271
272 result[k] = np.array(v).repeat(lengths)
ValueError: Conflicting metadata name id, need distinguishing prefix
Uh oh – an error! What's going on? Well, it turns out that both the album id and track id were given the key id
. pandas doesn't like that, and it gives us a helpful error to tell us so: ValueError: Conflicting metadata name id, need distinguishing prefix
.
There are two more parameters we can use to overcome this error: record_prefix
and meta_prefix
. These are strings we'll add to the beginning of our records and metadata to prevent these naming conflicts. Since we're dealing with Spotify artist ids for our records and Spotify track ids as the metadata, I'll use sp_artist_
and sp_track_
respectively. When that's done, I'll select only the columns that we're interested in.
artist_and_track = json_normalize(
data=tracks_response['tracks'],
record_path='artists',
meta=['id'],
record_prefix='sp_artist_',
meta_prefix='sp_track_',
sep="_"
)
artist_and_track = artist_and_track[['sp_track_id','sp_artist_id']]
artist_and_track
Output:
sp_track_id | sp_artist_id | |
---|---|---|
0 | 0BDYBajZydY54OTgQsH940 | 7wyRA7deGRxozTyBc6QXPe |
1 | 0BDYBajZydY54OTgQsH940 | 0WISkx0PwT6lYWdPqKUJY8 |
2 | 0BDYBajZydY54OTgQsH940 | 7uStwCeP54Za8gXUFCf5L7 |
3 | 7fdUqrzb8oCcIoKvFuzMrs | 7wyRA7deGRxozTyBc6QXPe |
4 | 7fdUqrzb8oCcIoKvFuzMrs | 0WISkx0PwT6lYWdPqKUJY8 |
5 | 7fdUqrzb8oCcIoKvFuzMrs | 7uStwCeP54Za8gXUFCf5L7 |
6 | 0islTY4Fw6lhYbfqi8Qtdj | 7wyRA7deGRxozTyBc6QXPe |
7 | 0islTY4Fw6lhYbfqi8Qtdj | 0WISkx0PwT6lYWdPqKUJY8 |
8 | 0islTY4Fw6lhYbfqi8Qtdj | 7uStwCeP54Za8gXUFCf5L7 |
9 | 3jyFLbljUTKjE13nIWXchH | 7wyRA7deGRxozTyBc6QXPe |
10 | 3jyFLbljUTKjE13nIWXchH | 0WISkx0PwT6lYWdPqKUJY8 |
11 | 3jyFLbljUTKjE13nIWXchH | 7uStwCeP54Za8gXUFCf5L7 |
12 | 6dNmC2YWtWbVOFOdTuRDQs | 7wyRA7deGRxozTyBc6QXPe |
TL;DR
- Use
pd.io.json.json_normalize()
to automagically flatten a nested JSON object into a DataFrame - Make your life slightly easier when it comes to selecting columns by overriding the default
sep
parameter - Specify what data constitutes a record with the
record_path
parameter - Include data from outside of the record path with the
meta
parameter - Fix naming conflicts if they arise with the
record_prefix
andmeta_prefix
parameters