This wouldn't be a proper data blog unless we spend a vast majority of our time talking about cleaning data. Chances are if you're pursuing analysis that's groundbreaking (or worthwhile), we're probably starting with some ugly, untapped information. It turns out Mapbox has an API specifically for this purpose: the Mapbox Geocoding API.
Geocoding is a blanket term for turning vague information into specific Lat/Long coordinates. How vague, you ask? The API covers:
- Pinpointing exact location via street address.
- Locating regions or cities by recognizable name (ie: Rio de Janeiro).
- Locating cities by highly unspecific name (Geocoding for "Springfield" will return results for 41 American cities)
- Locating cities or venues by name within a given region (such as searching for Ray's Pizza in NYC).
We can also use Geocoding to do the reverse of this, where passing in coordinates will return location names. If you find this useful, I'm assuming you're a spy.
The Mapbox Python SDK
The magic we'll be using today is the Mapbox Python SDK. Installing the Mapbox Python library is simple:
pip3 install mapbox
The only configuration we need to worry about is saving our Mapbox Token. To do this, we simply need to save our key to a .env file stored in the directory of our script:
MAPBOX_ACCESS_TOKEN="pk.eyJ1IjoidGhldGCjfdUTFDZhdGhlciIsImEiOiJjanBkcW5tM3cza3gxM3FvYmtqMWc4N2lmIn0.Ec_d0ScVv-icz7-EGRGDF"
Getting to Know the Mapbox Python SDK
Mapbox has so many capabilities, and the Python SDK covers all of them by separating each of these capabilities into parts of the greater Mapbox package. Here's a high-level view of the most common sub-packages we can import from Mapbox:
- Geocoding: This is the feature we'll be using today to transform street addresses into lat/long coordinates.
- Directions: Get step-by-step directions between two points and draw routes to visualize these directions.
- Static Maps: Generate static map images.
- Upload: Uploads data to be used by Mapbox studio.
- Datasets: Manage collections of GeoJSON data.
Geocoding Lat/Long From Addresses
In our Citibike example we started to play with New York public datasets. These are great sources of information containing things like Taxi pickup/dropoff locations, public transit stations, etc. These start and end points are typically stored as addresses as opposed to coordinates. Today, we'll look at how to "geocode" these values into longitude and latitude.
Geocoding a Single Input
To get started, we need to import Geocoder
and create a global coding object:
from mapbox import Geocoder
geocoder = Geocoder()
Our geocoder
object has a method called forward()
which accepts an address in the form of a string and outputs a whole bunch of more useful information. It's that simple:
response = geocoder.forward('New York City, NY')
print(response.content)
forward()
returns a Python requests response, from which we can view the results using response.content
:
{
"type": "FeatureCollection",
"query": ["new", "york", "city", "ny"],
"features": [{
"id": "place.15278078705964500",
"type": "Feature",
"place_type": ["place"],
"relevance": 1,
"properties": {
"wikidata": "Q60"
},
"text": "New York",
"place_name": "New York, New York, United States",
"matching_text": "New York City",
"matching_place_name": "New York City, New York, United States",
"bbox": [-74.2590879797556, 40.47
7399, -73.7008392055224, 40.917576401307
],
"center": [-73.9808, 40.7648],
"geometry": {
"type": "Point",
"coordinates": [-73.9808, 40.7648]
},
"context": [{
"id": "region.14044236392855570",
"short_code": "US-NY",
"wikidata": "Q1384",
"text": "New York"
}, {
"id": "country.9053006287256050",
"short_code": "us",
"wikidata": "Q30",
"text": "United States"
}]
}
...
]
}
Let's appreciate how awesome this is. Not only do we get the lat/long coordinates of NYC's center point, but we also have the coordinates of the bounding box which encompasses all of NYC, stored as bbox
. There's also all sorts of useful metadata like the country and context of this location.
If we offered too vague of a location (like I did), we could actually receive multiple responses for in our list of features. The number stored as relevance
is a ranking of what Mapbox determined to be a match.
Geocoding En Masse
Geocoding one input is fun, but not useful. Let's expand our script to handle geocoding hundreds or thousands of addresses.
We're going to use Pandas to load a CSV of we'll use Pandas .apply()
to fill out the missing Lat/Long columns in our dataset. First things first, let's load our data:
...
import pandas as pd
def load_dataset():
"""Load data from CSV."""
citiDF = pd.read_csv('data/citibike.csv').head(5)
return citiDF
citiDF = load_dataset()
Let's see how that looks:
start_station_name | end_station_name |
---|---|
1 Ave & E 16 St | Mott St & Prince St |
1 Ave & E 30 St | E 39 St & 2 Ave |
1 Ave & E 62 St | E 75 St & 3 Ave |
2 Ave & E 31 St | 9 Ave & W 14 St |
So, we'll need to geocode two columns per row. We can use the Pandas .apply()
method here to geocode our data. I've put together a quick script for us to do just that:
from mapbox import Geocoder
import pandas as pd
geocoder = Geocoder()
def load_dataset():
"""Load data from CSV."""
citiDF = pd.read_csv('data/citibike.csv').head(5)
return citiDF
def geocode_address(address):
"""Geocode street address into lat/long."""
response = geocoder.forward(address)
coords = str(response.json()['features'][0]['center'])
coords = coords.replace(']', '')
coords = coords.replace('[', '')
return coords
def geocode_dataframe(row):
"""Geocode start and end address."""
citiDF['start_station_latlong'] = geocode_address(row['start_station_name'])
citiDF['end_station_latlong'] = geocode_address(row['end_station_name'])
print(row)
citiDF = load_dataset()
citiDF.apply(geocode_dataframe, axis=1)
citiDF.to_csv('data/citibike_output.csv')
load_dataset()
loads a CSV into a Pandas DataFrame.geocode_address()
accepts a single address input and outputs a lat/long string.geocode_dataframe()
is applied to every row of our data to geocode the start and end stations.
Running this script gives us the following result:
FIELD1 | start_station_name | end_station_name | start_station_latlong | end_station_latlong |
---|---|---|---|---|
0 | 1 Ave & E 15 St | 1 Ave & E 18 St | -70.655286, 45.961259 | -74.0053437, 40.7409462 |
1 | 1 Ave & E 16 St | Mott St & Prince St | -70.655286, 45.961259 | -74.0053437, 40.7409462 |
2 | 1 Ave & E 30 St | E 39 St & 2 Ave | -70.655286, 45.961259 | -74.0053437, 40.7409462 |
3 | 1 Ave & E 62 St | E 75 St & 3 Ave | -70.655286, 45.961259 | -74.0053437, 40.7409462 |
4 | 2 Ave & E 31 St | 9 Ave & W 14 St | -70.655286, 45.961259 | -74.0053437, 40.7409462 |
Success! Let's create a visual for this, shall we?
Turning Your Datasets into Tilesets
In Mapbox terms, a Tileset is essentially a layer of data we can overlay on top of our blank map. The map style we created last time was really just the aesthetic unpinning of all the interesting data we can pile on time.
Tilesets can be stacked on one another, thus giving us infinite possibilities of the types of data we can communicate: especially when you consider that Mapbox supports Heatmaps and topology - far more than just plotted points.
We're going to learn how to plot points on a map using Mapbox studio. I've put together a dataset of fake data to demonstrate how we'd plot these points. Here's a sample of this garbage:
address | name | long | lat |
---|---|---|---|
761 ST ANNS AVE NY NY 10451 | Royal Hiett | 40.754466 | -73.9794552 |
5 COLUMBUS CIR NY NY 10019 | Yolanda Antonio | 40.8201997 | -73.9110324 |
1145 LENOX RD NY NY 11212 | Marguerita Autry | 40.7667595 | -73.9815704 |
2800 VICTORY BLVD NY NY 10314 | Alyse Peranio | 40.6597804 | -73.9183181 |
750 LEXINGTON AVE NY NY 10022 | Sina Walberg | 40.6080557 | -74.1532418 |
29 BAY RIDGE AVE NY NY 11220 | Ignacia Frasher | 40.7625148 | -73.9685564 |
550 RIVERSIDE DR NY NY 10027 | Marta Haymond | 40.6386587 | -74.034633 |
808 W END AVE NY NY 10025 | Angie Tseng | 40.8159612 | -73.960318 |
41-03 69 ST NY NY 11377 | Marcella Weinstock | 40.797233 | -73.9713245 |
50 PARK AVE NY NY 10016 | Filiberto Everett | 40.7444514 | -73.8956728 |
739 BROOK AVE NY NY 10451 | Vernia Mcgregor | 40.7492656 | -73.9803386 |
777 W END AVE NY NY 10025 | Michelina Althoff | 40.8199675 | -73.9122757 |
866 E 165 ST NY NY 10459 | Dave Tauber | 40.7965956 | -73.9726135 |
130 E 37 ST NY NY 10016 | Tandra Gowen | 40.8237011 | -73.8990202 |
797 ST ANNS AVE NY NY 10451 | Toby Philbrick | 40.7482336 | -73.978566 |
41 AARON LN NY NY 10309 | Aisha Grief | 40.82089 | -73.9109118 |
641 LEXINGTON AVE NY NY 10022 | Tarah Sinkler | 40.5541368 | -74.2126653 |
4201 4 AVE NY NY 11232 | Coletta Jeansonne | 40.7590297 | -73.9703219 |
1021 PARK AVE NY NY 10028 | Lorie Shriver | 40.650317 | -74.0081672 |
127 RIVERSIDE DR NY NY 10024 | Antwan Fullilove | 40.7794132 | -73.9572475 |
5120 BROADWAY NY NY 10034 | Normand Beerman | 40.7890613 | -73.9806569 |
7124 20 AVE NY NY 11204 | Wes Nieman | 40.8714856 | -73.9130362 |
3506 BEDFORD AVE NY NY 11210 | Marlen Hutcherson | 40.6127972 | -73.9901551 |
550 GRAND ST NY NY 10002 | Leonie Lablanc | 40.6168306 | -73.9501481 |
1711 GROVE ST NY NY 11385 | Doris Herrman | 40.7143151 | -73.9800558 |
785 W END AVE NY NY 10025 | Cyndy Kossman | 40.7032053 | -73.9111942 |
6040 HUXLEY AVE NY NY 10471 | Donya Ponte | 40.796763 | -73.972483 |
Head Over to Mapbox Studio
While we can technically do everything programmatically, Mapbox's GUI is simply too easy to ignore. Using Mapbox Studio, we can upload our data and turn it into a tileset; the heart and soul of what makes our maps interesting.
Once you've uploaded your CSV (or JSON, or whatever) as a dataset, we can immediately see what this information looks like on a map by previewing it as a tileset. Mapbox is surprisingly intelligent in that it can deduce lat/long values from poorly named or formatted columns (such as Lat/Long, Latitutde/Longitude, start_longitude_lol/start_latitude_lmao, etc). Mapbox gets it right most of the time.
If y'all went well you should see a cluster of points on a map - this is a preview of your Tileset. Think of this as a layer in Photoshop: we can stack these layers of information atop one another continuously to make something greater than the sum of its parts.
If all looks good, export your Tileset via the "export" button on the top right.
Switch Over to Your Map "Style"
You map 'style' is your blank canvas. Get in there and add a layer, and from there select the Tileset you just created. Once your Tileset is loaded, you can style the points themselves and even label them with the data in your dataset as you see fit:
Simply clicking around the preloaded Tilesets should start giving you ideas of what's possible down the line. Just look at those horrifically bright Miami Vice themed streets.
Feel free to get creative with Mapbox's tools to clarify the visual story you're trying to tell. I've distinguished points from others after adding a third data set: Every Starbucks in New York City. Yes, those map pins have been replaced with that terrifying Starbucks Logo Mermaid Sea-demon
Take a look at that perfect grid of mocha frappa-whatevers and tell me these guys don't have a business strategy:
For all it's worth, I'd like to sincerely apologize for blinding your eyes with classless use of gifs paired with the useless corporate monstrosity of a map I've created. I have faith that you'll do better.
Now that we've spent enough time covering the n00b stuff, it's time to take the gloves off. While Mapbox studio's GUI serves as an amazing crutch and way to customize the look of our data, we must not forget: we're programmers, God damn it! True magic lies in 1s and 0s, not WYSIWYG editors.
Until we start using Plot.ly Dash, that is.
(Suddenly, thousands of fans erupt into a roaring cheer at the very mention of Plot.ly. It's about time.™)