Geocoding Raw Datasets for Mapbox

    This wouldn't be a proper data blog unless we spend a vast majority of our time talking about cleaning data. Chances are if you're pursuing analysis that's groundbreaking (or worthwhile), we're probably starting with some ugly, untapped information. It turns out Mapbox has an API specifically for this purpose: the Mapbox Geocoding API.

    Geocoding is a blanket term for turning vague information into specific Lat/Long coordinates. How vague, you ask? The API covers:

    • Pinpointing exact location via street address.
    • Locating regions or cities by recognizable name (ie: Rio de Janeiro).
    • Locating cities by highly unspecific name (Geocoding for "Springfield" will return results for 41 American cities)
    • Locating cities or venues by name within a given region (such as searching for Ray's Pizza in NYC).

    We can also use Geocoding to do the reverse of this, where passing in coordinates will return location names. If you find this useful, I'm assuming you're a spy.

    The Mapbox Python SDK

    The magic we'll be using today is the Mapbox Python SDK. Installing the Mapbox Python library is simple:

    pip3 install mapbox

    The only configuration we need to worry about is saving our Mapbox Token. To do this, we simply need to save our key to a .env file stored in the directory of our script:

    MAPBOX_ACCESS_TOKEN="pk.eyJ1IjoidGhldGCjfdUTFDZhdGhlciIsImEiOiJjanBkcW5tM3cza3gxM3FvYmtqMWc4N2lmIn0.Ec_d0ScVv-icz7-EGRGDF"

    Getting to Know the Mapbox Python SDK

    Mapbox has so many capabilities, and the Python SDK covers all of them by separating each of these capabilities into parts of the greater Mapbox package. Here's a high-level view of the most common sub-packages we can import from Mapbox:

    • Geocoding: This is the feature we'll be using today to transform street addresses into lat/long coordinates.
    • Directions: Get step-by-step directions between two points and draw routes to visualize these directions.
    • Static Maps: Generate static map images.
    • Upload: Uploads data to be used by Mapbox studio.
    • Datasets: Manage collections of GeoJSON data.

    Geocoding Lat/Long From Addresses

    In our Citibike example we started to play with New York public datasets. These are great sources of information containing things like Taxi pickup/dropoff locations, public transit stations, etc. These start and end points are typically stored as addresses as opposed to coordinates. Today, we'll look at how to "geocode" these values into longitude and latitude.

    Geocoding a Single Input

    To get started, we need to import Geocoder and create a global coding object:

    from mapbox import Geocoder
    
    geocoder = Geocoder()

    Our geocoder object has a method called forward() which accepts an address in the form of a string and outputs a whole bunch of more useful information. It's that simple:

    response = geocoder.forward('New York City, NY')
    print(response.content)

    forward() returns a Python requests response, from which we can view the results using response.content:

    {
      "type": "FeatureCollection",
      "query": ["new", "york", "city", "ny"],
      "features": [{
            "id": "place.15278078705964500",
            "type": "Feature",
            "place_type": ["place"],
            "relevance": 1,
            "properties": {
              "wikidata": "Q60"
            },
            "text": "New York",
            "place_name": "New York, New York, United States",
            "matching_text": "New York City",
            "matching_place_name": "New York City, New York, United States",
            "bbox": [-74.2590879797556, 40.47
              7399, -73.7008392055224, 40.917576401307
            ],
            "center": [-73.9808, 40.7648],
            "geometry": {
              "type": "Point",
              "coordinates": [-73.9808, 40.7648]
            },
            "context": [{
              "id": "region.14044236392855570",
              "short_code": "US-NY",
              "wikidata": "Q1384",
              "text": "New York"
            }, {
              "id": "country.9053006287256050",
              "short_code": "us",
              "wikidata": "Q30",
              "text": "United States"
            }]
          }
          ...
         ]
    }

    Let's appreciate how awesome this is. Not only do we get the lat/long coordinates of NYC's center point, but we also have the coordinates of the bounding box which encompasses all of NYC, stored as bbox. There's also all sorts of useful metadata like the country and context of this location.

    If we offered too vague of a location (like I did), we could actually receive multiple responses for in our list of features. The number stored as relevance is a ranking of what Mapbox determined to be a match.

    Geocoding En Masse

    Geocoding one input is fun, but not useful. Let's expand our script to handle geocoding hundreds or thousands of addresses.

    We're going to use Pandas to load a CSV of  we'll use Pandas .apply() to fill out the missing Lat/Long columns in our dataset. First things first, let's load our data:

    ...
    import pandas as pd
    
    
    def load_dataset():
        """Load data from CSV."""
        citiDF = pd.read_csv('data/citibike.csv').head(5)
        return citiDF
    
    
    citiDF = load_dataset()
    

    Let's see how that looks:

    start_station_name end_station_name
    1 Ave & E 16 St Mott St & Prince St
    1 Ave & E 30 St E 39 St & 2 Ave
    1 Ave & E 62 St E 75 St & 3 Ave
    2 Ave & E 31 St 9 Ave & W 14 St

    So, we'll need to geocode two columns per row. We can use the Pandas .apply() method here to geocode our data. I've put together a quick script for us to do just that:

    from mapbox import Geocoder
    import pandas as pd
    
    geocoder = Geocoder()
    
    
    def load_dataset():
        """Load data from CSV."""
        citiDF = pd.read_csv('data/citibike.csv').head(5)
        return citiDF
    
    
    def geocode_address(address):
        """Geocode street address into lat/long."""
        response = geocoder.forward(address)
        coords = str(response.json()['features'][0]['center'])
        coords = coords.replace(']', '')
        coords = coords.replace('[', '')
        return coords
    
    
    def geocode_dataframe(row):
        """Geocode start and end address."""
        citiDF['start_station_latlong'] = geocode_address(row['start_station_name'])
        citiDF['end_station_latlong'] = geocode_address(row['end_station_name'])
        print(row)
    
    
    citiDF = load_dataset()
    citiDF.apply(geocode_dataframe, axis=1)
    citiDF.to_csv('data/citibike_output.csv')
    
    • load_dataset() loads a CSV into a Pandas DataFrame.
    • geocode_address() accepts a single address input and outputs a lat/long string.
    • geocode_dataframe() is applied to every row of our data to geocode the start and end stations.

    Running this script gives us the following result:

    FIELD1 start_station_name end_station_name start_station_latlong end_station_latlong
    0 1 Ave & E 15 St 1 Ave & E 18 St -70.655286, 45.961259 -74.0053437, 40.7409462
    1 1 Ave & E 16 St Mott St & Prince St -70.655286, 45.961259 -74.0053437, 40.7409462
    2 1 Ave & E 30 St E 39 St & 2 Ave -70.655286, 45.961259 -74.0053437, 40.7409462
    3 1 Ave & E 62 St E 75 St & 3 Ave -70.655286, 45.961259 -74.0053437, 40.7409462
    4 2 Ave & E 31 St 9 Ave & W 14 St -70.655286, 45.961259 -74.0053437, 40.7409462

    Success! Let's create a visual for this, shall we?

    Turning Your Datasets into Tilesets

    In Mapbox terms, a Tileset is essentially a layer of data we can overlay on top of our blank map. The map style we created last time was really just the aesthetic unpinning of all the interesting data we can pile on time.

    Tilesets can be stacked on one another, thus giving us infinite possibilities of the types of data we can communicate: especially when you consider that Mapbox supports Heatmaps and topology - far more than just plotted points.

    We're going to learn how to plot points on a map using Mapbox studio. I've put together a dataset of fake data to demonstrate how we'd plot these points. Here's a sample of this garbage:

    address name long lat
    761 ST ANNS AVE NY NY 10451 Royal Hiett 40.754466 -73.9794552
    5 COLUMBUS CIR NY NY 10019 Yolanda Antonio 40.8201997 -73.9110324
    1145 LENOX RD NY NY 11212 Marguerita Autry 40.7667595 -73.9815704
    2800 VICTORY BLVD NY NY 10314 Alyse Peranio 40.6597804 -73.9183181
    750 LEXINGTON AVE NY NY 10022 Sina Walberg 40.6080557 -74.1532418
    29 BAY RIDGE AVE NY NY 11220 Ignacia Frasher 40.7625148 -73.9685564
    550 RIVERSIDE DR NY NY 10027 Marta Haymond 40.6386587 -74.034633
    808 W END AVE NY NY 10025 Angie Tseng 40.8159612 -73.960318
    41-03 69 ST NY NY 11377 Marcella Weinstock 40.797233 -73.9713245
    50 PARK AVE NY NY 10016 Filiberto Everett 40.7444514 -73.8956728
    739 BROOK AVE NY NY 10451 Vernia Mcgregor 40.7492656 -73.9803386
    777 W END AVE NY NY 10025 Michelina Althoff 40.8199675 -73.9122757
    866 E 165 ST NY NY 10459 Dave Tauber 40.7965956 -73.9726135
    130 E 37 ST NY NY 10016 Tandra Gowen 40.8237011 -73.8990202
    797 ST ANNS AVE NY NY 10451 Toby Philbrick 40.7482336 -73.978566
    41 AARON LN NY NY 10309 Aisha Grief 40.82089 -73.9109118
    641 LEXINGTON AVE NY NY 10022 Tarah Sinkler 40.5541368 -74.2126653
    4201 4 AVE NY NY 11232 Coletta Jeansonne 40.7590297 -73.9703219
    1021 PARK AVE NY NY 10028 Lorie Shriver 40.650317 -74.0081672
    127 RIVERSIDE DR NY NY 10024 Antwan Fullilove 40.7794132 -73.9572475
    5120 BROADWAY NY NY 10034 Normand Beerman 40.7890613 -73.9806569
    7124 20 AVE NY NY 11204 Wes Nieman 40.8714856 -73.9130362
    3506 BEDFORD AVE NY NY 11210 Marlen Hutcherson 40.6127972 -73.9901551
    550 GRAND ST NY NY 10002 Leonie Lablanc 40.6168306 -73.9501481
    1711 GROVE ST NY NY 11385 Doris Herrman 40.7143151 -73.9800558
    785 W END AVE NY NY 10025 Cyndy Kossman 40.7032053 -73.9111942
    6040 HUXLEY AVE NY NY 10471 Donya Ponte 40.796763 -73.972483

    Head Over to Mapbox Studio

    While we can technically do everything programmatically, Mapbox's GUI is simply too easy to ignore. Using Mapbox Studio, we can upload our data and turn it into a tileset; the heart and soul of what makes our maps interesting.

    Once you've uploaded your CSV (or JSON, or whatever) as a dataset, we can immediately see what this information looks like on a map by previewing it as a tileset. Mapbox is surprisingly intelligent in that it can deduce lat/long values from poorly named or formatted columns (such as Lat/Long, Latitutde/Longitude, start_longitude_lol/start_latitude_lmao, etc). Mapbox gets it right most of the time.

    If y'all went well you should see a cluster of points on a map - this is a preview of your Tileset. Think of this as a layer in Photoshop: we can stack these layers of information atop one another continuously to make something greater than the sum of its parts.

    If all looks good, export your Tileset via the "export" button on the top right.

    Upload your dataset and click "edit"

    Switch Over to Your Map "Style"

    You map 'style' is your blank canvas. Get in there and add a layer, and from there select the Tileset you just created. Once your Tileset is loaded, you can style the points themselves and even label them with the data in your dataset as you see fit:

    So many colorful layers.

    Simply clicking around the preloaded Tilesets should start giving you ideas of what's possible down the line. Just look at those horrifically bright Miami Vice themed streets.

    Feel free to get creative with Mapbox's tools to clarify the visual story you're trying to tell. I've distinguished points from others after adding a third data set: Every Starbucks in New York City. Yes, those map pins have been replaced with that terrifying Starbucks Logo Mermaid Sea-demon

    Take a look at that perfect grid of mocha frappa-whatevers and tell me these guys don't have a business strategy:

    God that's an ugly map.

    For all it's worth, I'd like to sincerely apologize for blinding your eyes with classless use of gifs paired with the useless corporate monstrosity of a map I've created. I have faith that you'll do better.

    Now that we've spent enough time covering the n00b stuff, it's time to take the gloves off. While Mapbox studio's GUI serves as an amazing crutch and way to customize the look of our data, we must not forget: we're programmers, God damn it! True magic lies in 1s and 0s, not WYSIWYG editors.

    Until we start using Plot.ly Dash, that is.

    (Suddenly, thousands of fans erupt into a roaring cheer at the very mention of Plot.ly. It's about time.™)
    Todd Birchard's' avatar
    Todd Birchard
    New York City Website
    Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.
    Todd Birchard's' avatar
    Todd Birchard

    Engineer with an ongoing identity crisis. Breaks everything before learning best practices. Completely normal and emotionally stable.