Geocoding Raw Datasets for Mapbox

Make sense of unstructured data with enough precision to put it on a map.

This wouldn't be a proper data blog unless we spend a vast majority of our time talking about cleaning data. Chances are if you're pursuing analysis that's groundbreaking (or worthwhile), we're probably starting with some ugly, untapped information. It turns out Mapbox has an API specifically for this purpose: the Mapbox Geocoding API.

Geocoding is a blanket term for turning vague information into specific Lat/Long coordinates. How vague, you ask? The API covers:

  • Pinpointing exact location via street address.
  • Locating regions or cities by recognizable name (ie: Rio de Janeiro).
  • Locating cities by highly unspecific name (Geocoding for "Springfield" will return results for 41 American cities)
  • Locating cities or venues by name within a given region (such as searching for Ray's Pizza in NYC).

We can also use Geocoding to do the reverse of this, where passing in coordinates will return location names. If you find this useful, I'm assuming you're a spy.

Chipping Away at a Real Use Case

In a real-life example, I have two sets of data: one represents general places of residence for a particular sample group. The goal is to see how they interact with the second dataset: a list of locations they will be traveling to. I'd love to go into more detail, but:

I get one cliche meme per year.

So how can we use the Mapbox Geocoding API to systematically extract coordinates for thousands of addresses, from multiple datasets? With Pandas, of course!

I'm Just Happy to be Writing About Pandas Right Now

Pardon my excitement; I've been far overdue for posting anything Pandas-related. It's been killing me on the inside.

We need to make sense of some vague data. As seen in our Citbike example, New York has plenty of public datasets with information like Taxi pickup/dropoffs, public transit, etc. These start and end points are typically too fluid to have Lat/Long coordinates associated with them, so we'll add them in ourselves. Given that we're about to pass hundreds or thousands of addresses and locations, we'll use Pandas .apply() to fill out the missing Lat/Long columns in our dataset.

Instead of using Mapbox's Python SDK, I'll actually be using requests to hit the Mapbox REST API. For some reason, the Python SDK was a bit unpredictable on my last run.*

*UPDATE: the Python SDK "wasn't working" because I apparently don't know the difference between longitude and latitude. Awesome, so I'm a moron.

import sys
import os
import pandas as pd
import requests
import json


class GeocodeAddresses:
    """Add missing lat/long information to exisiting dataset."""

    def __init__(self, address_data):
        self.data = address_data
        self.address_df = pd.read_csv(self.data)
        self.complete_data = self.get_coords(self.address_df)


    @classmethod
    def get_coords(self, employee_address_df):
        """Fill Dataframe lat/long columns."""

        def fill_coords(row):
            """Create a route object by passing GeoJSON start/end objects."""
            base_url = 'https://api.mapbox.com/geocoding/v5/mapbox.places/'
            address = str(row.home_address)
            format = '.json'
            endpoint = base_url + address + format
            params = {
                'access_token':  'pk.eyJ1IjNoYXJkd2VthisisreallolaXdyNHQ3OTUifQ.VTAUrmzD91Ppxr1AJww'
            }
            headers = {
                'Content-Type': 'application/json'
            }
            r = requests.get(endpoint, params=params, headers=headers)
            try:
                Lat = r.json()['features'][0]['geometry']['coordinates'][0]
                Long = r.json()['features'][0]['geometry']['coordinates'][1]
                print(pd.Series([Lat, Long]))
                return pd.Series([Lat, Long])
            except IndexError:
                pass

        address_df[['Lat', 'Long']] = address_df.head(100).apply(fill_coords, axis=1)
        address_df.to_csv('geocoded.csv')

In the above example, we're using .apply() against an empty series (our Lat/Long columns) as opposed to our entire Dataframe. When get_coords() returns two values, these values will fill the empty columns on a per-row basis.

For the scope of this tutorial, we'll simply focus on getting these points plotted. Don't worry, this is only part 2 of our Mapbox series! Yes, an entire series!

Turning Your Datasets into Tilesets

In Mapbox terms, a Tileset is essentially a layer of data we can overlay on top of our blank map. The map style we created last time was really just the aesthetic unpinning of all the interesting data we can pile on time.

Tilesets can be stacked on one another, thus giving us infinite possibilities of the types of data we can communicate: especially when you consider that Mapbox supports Heatmaps and topology - far more than just plotted points.

First, we'll plot our origins. I've put together a dataset of completely falsified names (with presumably real addresses?) to demonstrate how we'd plot these points. Here's a sample of the garbage I'll be feeding into Mapbox:

address name long lat
761 ST ANNS AVE NY NY 10451 Royal Hiett 40.754466 -73.9794552
5 COLUMBUS CIR NY NY 10019 Yolanda Antonio 40.8201997 -73.9110324
1145 LENOX RD NY NY 11212 Marguerita Autry 40.7667595 -73.9815704
2800 VICTORY BLVD NY NY 10314 Alyse Peranio 40.6597804 -73.9183181
750 LEXINGTON AVE NY NY 10022 Sina Walberg 40.6080557 -74.1532418
29 BAY RIDGE AVE NY NY 11220 Ignacia Frasher 40.7625148 -73.9685564
550 RIVERSIDE DR NY NY 10027 Marta Haymond 40.6386587 -74.034633
808 W END AVE NY NY 10025 Angie Tseng 40.8159612 -73.960318
41-03 69 ST NY NY 11377 Marcella Weinstock 40.797233 -73.9713245
50 PARK AVE NY NY 10016 Filiberto Everett 40.7444514 -73.8956728
739 BROOK AVE NY NY 10451 Vernia Mcgregor 40.7492656 -73.9803386
777 W END AVE NY NY 10025 Michelina Althoff 40.8199675 -73.9122757
866 E 165 ST NY NY 10459 Dave Tauber 40.7965956 -73.9726135
130 E 37 ST NY NY 10016 Tandra Gowen 40.8237011 -73.8990202
797 ST ANNS AVE NY NY 10451 Toby Philbrick 40.7482336 -73.978566
41 AARON LN NY NY 10309 Aisha Grief 40.82089 -73.9109118
641 LEXINGTON AVE NY NY 10022 Tarah Sinkler 40.5541368 -74.2126653
4201 4 AVE NY NY 11232 Coletta Jeansonne 40.7590297 -73.9703219
1021 PARK AVE NY NY 10028 Lorie Shriver 40.650317 -74.0081672
127 RIVERSIDE DR NY NY 10024 Antwan Fullilove 40.7794132 -73.9572475
5120 BROADWAY NY NY 10034 Normand Beerman 40.7890613 -73.9806569
7124 20 AVE NY NY 11204 Wes Nieman 40.8714856 -73.9130362
3506 BEDFORD AVE NY NY 11210 Marlen Hutcherson 40.6127972 -73.9901551
550 GRAND ST NY NY 10002 Leonie Lablanc 40.6168306 -73.9501481
1711 GROVE ST NY NY 11385 Doris Herrman 40.7143151 -73.9800558
785 W END AVE NY NY 10025 Cyndy Kossman 40.7032053 -73.9111942
6040 HUXLEY AVE NY NY 10471 Donya Ponte 40.796763 -73.972483
640 5 AVE NY NY 10019 Tomika Amundsen 40.9087793 -73.8976726
789 ST ANNS AVE NY NY 10451 Carmon Troche 40.7594313 -73.9771225
6 CROWN AVE NY NY 10312 Octavio Cheatham 40.8206299 -73.9107598
673 MADISON AVE NY NY 10065 Niesha Whitelow 40.5531407 -74.1828179
641 5 AVE NY NY 10022 Ivette Labadie 40.7647958 -73.9701323
785 ST ANNS AVE NY NY 10451 Barb Kane 40.7590778 -73.9762805
1280 53 ST NY NY 11219 Robena Hendren 40.8205987 -73.9106042
177 W END AVE NY NY 11235 Noel Bender 40.6327791 -73.9946515
5302 21 AVE NY NY 11204 Kelley Forsberg 40.5775561 -73.9527078
375 RIVERSIDE DR NY NY 10025 Grazyna Victory 40.6220287 -73.9768454
173 RIVERSIDE DR NY NY 10024 Carin Ploof 40.8206272 -73.9107468
799 ST ANNS AVE NY NY 10451 Samara Arn 40.6007256 -74.1514735
23 WAVERLY PL NY NY 10003 Nga Uhlman 40.8044903 -73.9678653
2370 OCEAN AVE NY NY 11229 Elvie Scheuerman 40.7916213 -73.9781146
623 5 AVE NY NY 10022 Berna Wince 40.8208802 -73.9108469
805 ST ANNS AVE NY NY 10451 Denese Ollis 40.7306449 -73.9944405
239 CENTRAL PARK W NY NY 10024 Haydee Kunkel 40.6037567 -73.9526531
77-01 45 AVE NY NY 11373 Kiyoko Fontes 40.7580741 -73.9763654
109 MONT SEC AVE NY NY 10305 Dwana Mcgill 40.8208328 -73.9106101
1028 5 AVE NY NY 10028 Yasuko Furlong 40.784054 -73.9707483
610 W END AVE NY NY 10024 Corrie Roca 40.7412524 -73.8874201
8200 BAY PKWY NY NY 11214 Shanon Henninger 40.6073983 -74.0576452
1245 PARK AVE NY NY 10029 Harris Caldwell 40.7800202 -73.9611665
32 WASHINGTON SQ W NY NY 10011 Lidia Abdalla 40.7910765 -73.9757384
1160 PARK AVE NY NY 10128 Karin Rudy 40.8205961 -73.9105913
666 PARK AVE NY NY 10065 Min Winkleman 40.6042992 -73.9918571
1841 BROADWAY NY NY 10023 Estefana Mercedes 40.7866462 -73.9521502
30-02 BROADWAY NY NY 11106 Shaunna Pino 40.7318156 -73.9992709
47 MARKHAM LN NY NY 10310 Marybelle Gerth 40.7842533 -73.9547967
950 5 AVE NY NY 10021 Mervin Sterner 40.6018127 -74.1481708
902 59 ST NY NY 11219 Garrett Mata 40.7681576 -73.9664751
1115 5 AVE NY NY 10128 Ashly Stansfield 40.7691585 -73.9824282
3 COLUMBUS CIR NY NY 10019 Delila Cassinelli 40.76217 -73.9261799
5401 FILLMORE AVE NY NY 11234 Elfriede Buttram 40.6393636 -74.1167883

Head Over to Mapbox Studio

While we can technically do everything programmatically, Mapbox's GUI is simply to easy to ignore. Using Mapbox Studio, we can upload our data and turn it into a tileset; the heart and soul of what makes our maps interesting.

Once you've uploaded your CSV (or JSON, or whatever) as a dataset, we can immediately see what this information looks like on a map by previewing it as a tileset. Mapbox is surprisingly intelligent in that it can deduce lat/long values from poorly named or formatted columns (such as Lat/Long, LataiUtutde/LoNGitdue, start_longitude_lol/start_latitude_lmao, etc). Mapbox gets it right most of the time.

If yall went well you should see a cluster of points on a map - this is a preview of your tileset. Think of this as a layer in Photoshop: we can stack these layers of information atop one another continuously to make something greater than the sum of its parts.

If all looks good, export your tileset via the "export" button on the top right.

Upload your dataset and click "edit"

Switch Over to Your Map "Style"

You map 'style' is your blank canvas. Get in there and add a layer, and from there select the tileset you just created. Once your tileset is loaded, you can style the points themselves and even label them with the data in your dataset as you see fit:

So many colorful layers.

Simply clicking around the preloaded tilesets should start giving you ideas of what's possible down the line. Just look at those horrifically bright Miami Vice themed streets.

Feel free to get creative with Mapbox's tools to clarify the visual story you're trying to tell. I've distinguished points from others after adding a third data set: Every Starbucks in New York City. Yes, those map pins have been replaced with that terrifying Starbucks Logo Mermaid Sea-demon

Take a look at that perfect grid of mocha frappa-whatevers and tell me these guys don't have a business strategy:

God that's an ugly map.

For all it's worth, I'd like to sincerely apologize for blinding your eyes with classless use of gifs paired with the useless corporate monstrosity of a map I've created. I have faith that you'll do better.

Now that we've spent enough time covering the n00b stuff, it's time to take the gloves off. While Mapbox studio's GUI serves as an amazing crutch and way to customize the look of our data, we must not forget: we're programmers, God damn it! True magic lies in 1s and 0s, not WYSIWYG editors.

Until we start using Plot.ly Dash, that is.

(Suddenly, thousands of fans erupt into a roaring cheer at the very mention of Plot.ly. It's about time.™)
Todd Birchard Author image
New York City Website
Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.

Product manager turned engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.