Geocoding Raw Datasets for Mapbox

Make sense of unstructured data with enough precision to put it on a map.

Geocoding Raw Datasets for Mapbox

    This wouldn't be a proper data blog unless we spend a vast majority of our time talking about cleaning data. Chances are if you're pursuing analysis that's groundbreaking (or worthwhile), we're probably starting with some ugly, untapped information. It turns out Mapbox has an API specifically for this purpose: the Mapbox Geocoding API.

    Geocoding is a blanket term for turning vague information into specific Lat/Long coordinates. How vague, you ask? The API covers:

    • Pinpointing exact location via street address.
    • Locating regions or cities by recognizable name (ie: Rio de Janeiro).
    • Locating cities by highly unspecific name (Geocoding for "Springfield" will return results for 41 American cities)
    • Locating cities or venues by name within a given region (such as searching for Ray's Pizza in NYC).

    We can also use Geocoding to do the reverse of this, where passing in coordinates will return location names. If you find this useful, I'm assuming you're a spy.

    Chipping Away at a Real Use Case

    In a real-life example, I have two sets of data: one represents general places of residence for a particular sample group. The goal is to see how they interact with the second dataset: a list of locations they will be traveling to. I'd love to go into more detail, but:

    I get one cliche meme per year.

    So how can we use the Mapbox Geocoding API to systematically extract coordinates for thousands of addresses, from multiple datasets? With Pandas, of course!

    I'm Just Happy to be Writing About Pandas Right Now

    Pardon my excitement; I've been far overdue for posting anything Pandas-related. It's been killing me on the inside.

    We need to make sense of some vague data. As seen in our Citibike example, New York has plenty of public datasets with information like Taxi pickup/dropoffs, public transit, etc. These start and end points are typically too fluid to have Lat/Long coordinates associated with them, so we'll add them in ourselves. Given that we're about to pass hundreds or thousands of addresses and locations, we'll use Pandas .apply() to fill out the missing Lat/Long columns in our dataset.

    Instead of using Mapbox's Python SDK, I'll actually be using requests to hit the Mapbox REST API. For some reason, the Python SDK was a bit unpredictable on my last run.*

    *UPDATE: the Python SDK "wasn't working" because I apparently don't know the difference between longitude and latitude. Awesome, so I'm a moron.

    import sys
    import os
    import pandas as pd
    import requests
    import json
    class GeocodeAddresses:
        """Add missing lat/long information to exisiting dataset."""
        def __init__(self, address_data):
   = address_data
            self.address_df = pd.read_csv(
            self.complete_data = self.get_coords(self.address_df)
        def get_coords(self, employee_address_df):
            """Fill Dataframe lat/long columns."""
            def fill_coords(row):
                """Create a route object by passing GeoJSON start/end objects."""
                base_url = ''
                address = str(row.home_address)
                format = '.json'
                endpoint = base_url + address + format
                params = {
                    'access_token':  'pk.eyJ1IjNoYXJkd2VthisisreallolaXdyNHQ3OTUifQ.VTAUrmzD91Ppxr1AJww'
                headers = {
                    'Content-Type': 'application/json'
                r = requests.get(endpoint, params=params, headers=headers)
                    Lat = r.json()['features'][0]['geometry']['coordinates'][0]
                    Long = r.json()['features'][0]['geometry']['coordinates'][1]
                    print(pd.Series([Lat, Long]))
                    return pd.Series([Lat, Long])
                except IndexError:
            address_df[['Lat', 'Long']] = address_df.head(100).apply(fill_coords, axis=1)

    In the above example, we're using .apply() against an empty series (our Lat/Long columns) as opposed to our entire Dataframe. When get_coords() returns two values, these values will fill the empty columns on a per-row basis.

    For the scope of this tutorial, we'll simply focus on getting these points plotted. Don't worry, this is only part 2 of our Mapbox series! Yes, an entire series!

    Turning Your Datasets into Tilesets

    In Mapbox terms, a Tileset is essentially a layer of data we can overlay on top of our blank map. The map style we created last time was really just the aesthetic unpinning of all the interesting data we can pile on time.

    Tilesets can be stacked on one another, thus giving us infinite possibilities of the types of data we can communicate: especially when you consider that Mapbox supports Heatmaps and topology - far more than just plotted points.

    First, we'll plot our origins. I've put together a dataset of completely falsified names (with presumably real addresses?) to demonstrate how we'd plot these points. Here's a sample of the garbage I'll be feeding into Mapbox:

    address name long lat
    761 ST ANNS AVE NY NY 10451 Royal Hiett 40.754466 -73.9794552
    5 COLUMBUS CIR NY NY 10019 Yolanda Antonio 40.8201997 -73.9110324
    1145 LENOX RD NY NY 11212 Marguerita Autry 40.7667595 -73.9815704
    2800 VICTORY BLVD NY NY 10314 Alyse Peranio 40.6597804 -73.9183181
    750 LEXINGTON AVE NY NY 10022 Sina Walberg 40.6080557 -74.1532418
    29 BAY RIDGE AVE NY NY 11220 Ignacia Frasher 40.7625148 -73.9685564
    550 RIVERSIDE DR NY NY 10027 Marta Haymond 40.6386587 -74.034633
    808 W END AVE NY NY 10025 Angie Tseng 40.8159612 -73.960318
    41-03 69 ST NY NY 11377 Marcella Weinstock 40.797233 -73.9713245
    50 PARK AVE NY NY 10016 Filiberto Everett 40.7444514 -73.8956728
    739 BROOK AVE NY NY 10451 Vernia Mcgregor 40.7492656 -73.9803386
    777 W END AVE NY NY 10025 Michelina Althoff 40.8199675 -73.9122757
    866 E 165 ST NY NY 10459 Dave Tauber 40.7965956 -73.9726135
    130 E 37 ST NY NY 10016 Tandra Gowen 40.8237011 -73.8990202
    797 ST ANNS AVE NY NY 10451 Toby Philbrick 40.7482336 -73.978566
    41 AARON LN NY NY 10309 Aisha Grief 40.82089 -73.9109118
    641 LEXINGTON AVE NY NY 10022 Tarah Sinkler 40.5541368 -74.2126653
    4201 4 AVE NY NY 11232 Coletta Jeansonne 40.7590297 -73.9703219
    1021 PARK AVE NY NY 10028 Lorie Shriver 40.650317 -74.0081672
    127 RIVERSIDE DR NY NY 10024 Antwan Fullilove 40.7794132 -73.9572475
    5120 BROADWAY NY NY 10034 Normand Beerman 40.7890613 -73.9806569
    7124 20 AVE NY NY 11204 Wes Nieman 40.8714856 -73.9130362
    3506 BEDFORD AVE NY NY 11210 Marlen Hutcherson 40.6127972 -73.9901551
    550 GRAND ST NY NY 10002 Leonie Lablanc 40.6168306 -73.9501481
    1711 GROVE ST NY NY 11385 Doris Herrman 40.7143151 -73.9800558
    785 W END AVE NY NY 10025 Cyndy Kossman 40.7032053 -73.9111942
    6040 HUXLEY AVE NY NY 10471 Donya Ponte 40.796763 -73.972483

    Head Over to Mapbox Studio

    While we can technically do everything programmatically, Mapbox's GUI is simply too easy to ignore. Using Mapbox Studio, we can upload our data and turn it into a tileset; the heart and soul of what makes our maps interesting.

    Once you've uploaded your CSV (or JSON, or whatever) as a dataset, we can immediately see what this information looks like on a map by previewing it as a tileset. Mapbox is surprisingly intelligent in that it can deduce lat/long values from poorly named or formatted columns (such as Lat/Long, Latitutde/Longitude, start_longitude_lol/start_latitude_lmao, etc). Mapbox gets it right most of the time.

    If y'all went well you should see a cluster of points on a map - this is a preview of your Tileset. Think of this as a layer in Photoshop: we can stack these layers of information atop one another continuously to make something greater than the sum of its parts.

    If all looks good, export your Tileset via the "export" button on the top right.

    Upload your dataset and click "edit"

    Switch Over to Your Map "Style"

    You map 'style' is your blank canvas. Get in there and add a layer, and from there select the Tileset you just created. Once your Tileset is loaded, you can style the points themselves and even label them with the data in your dataset as you see fit:

    So many colorful layers.

    Simply clicking around the preloaded Tilesets should start giving you ideas of what's possible down the line. Just look at those horrifically bright Miami Vice themed streets.

    Feel free to get creative with Mapbox's tools to clarify the visual story you're trying to tell. I've distinguished points from others after adding a third data set: Every Starbucks in New York City. Yes, those map pins have been replaced with that terrifying Starbucks Logo Mermaid Sea-demon

    Take a look at that perfect grid of mocha frappa-whatevers and tell me these guys don't have a business strategy:

    God that's an ugly map.

    For all it's worth, I'd like to sincerely apologize for blinding your eyes with classless use of gifs paired with the useless corporate monstrosity of a map I've created. I have faith that you'll do better.

    Now that we've spent enough time covering the n00b stuff, it's time to take the gloves off. While Mapbox studio's GUI serves as an amazing crutch and way to customize the look of our data, we must not forget: we're programmers, God damn it! True magic lies in 1s and 0s, not WYSIWYG editors.

    Until we start using Dash, that is.

    (Suddenly, thousands of fans erupt into a roaring cheer at the very mention of It's about time.™)
    Todd Birchard's' avatar
    New York City Website
    Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

    Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.