Whether it be Kaggle, Google Cloud, or the federal government, there's plenty of reliable open-sourced data on the web. While there are plenty of reasons to hate being alive in our current chapter of humanity, open data is one of the few redeeming qualities of life on Earth today. But what is the opposite of "open" data, anyway?
Like anything free and easily accessible, the only data inherently worth anything is either harvested privately or stolen from sources that would prefer you didn't. This is the sort of data business models can be built around, as social media platforms such as LinkedIn have shown us as our personal information is bought and sold by data brokers. These companies attempted to sue individual programmers like ourselves for scraping the data they collected via the same means, and epically lost in a court of law:
The topic of scraping data on the web tends to raise questions about the ethics and legality of scraping, to which I plea: don't hold back. If you aren't personally disgusted by the prospect of your life being transcribed, sold, and frequently leaked, the court system has ruled that you legally have a right to scrape data. The name of this publication is not People Who Play It Safe And Slackers. We're a home for those who fight to take power back, and we're going to scrape the shit out of you.
Tools for the Job
Web scraping in Python is dominated by three major libraries: BeautifulSoup, Scrapy, and Selenium. Each of these libraries intends to solve very different use cases. Thus, it's essential to understand what we're choosing and why.
- BeautifulSoup is one of the most prolific Python libraries, in some part having shaped the web as we know it. BeautifulSoup is a lightweight, easy-to-learn, and highly effective way to programmatically isolate information on a single webpage at a time. It's common to use BeautifulSoup in conjunction with the requests library, where
requests
fetches a page, andBeautifulSoup
extracts the resulting data. - Scrapy has an agenda that is much closer to mass pillaging than BeautifulSoup. Scrapy is a tool for building crawlers: these are absolute monstrosities unleashed upon the web like a swarm, loosely following links and hastily grabbing data where data exists to be grabbed. Because Scrapy serves the purpose of mass-scraping, it is much easier to get in trouble with Scrapy.
- Selenium isn't exclusively a scraping tool as much as an automation tool that can scrape sites. Selenium is the nuclear option for attempting to navigate sites programmatically and should be treated as such: there are much better options for simple data extraction.
We'll be using BeautifulSoup, which should be anybody's default choice until the circumstances ask for more. BeautifulSoup is more than enough to steal data.
Preparing Our Extraction
Before we steal any data, we need to set the stage. We'll start by installing our two libraries of choice:
As mentioned, requests will provide us with our target's HTML, and beautifulsoup4 will parse that data.
We need to recognize that many sites have precautions to fend off scrapers from accessing their data. The first thing we can do to get around this is spoofing the headers we send along with our requests to make our scraper look like a legitimate browser:
This is only a first line of defense (or offensive, in our case). There are plenty of ways sites can still keep us at bay, but setting headers works shockingly well to fix most issues.
Now, let's fetch a page and inspect it with BeautifulSoup:
We set things up by making a request to https://example.com. We then create a BeautifulSoup object that accepts the raw content of that response via req.content
. The second parameter, 'html.parser'
, is our way of telling BeautifulSoup that this is an HTML document. There are other parsers available for parsing stuff like XML if you're into that.
When we create a BeautifulSoup object from a page's HTML, our object contains the HTML structure of that page, which all sorts of methods can easily parse. First, let's see what our variable soup
looks like by using print(soup.prettify())
:
Targeting HTML Elements
Many methods are available for pinpointing and grabbing the information we're trying to get out of a page. Finding the exact information we want from a web page is an art form: effective scraping requires us to recognize patterns in the document's HTML that we can take advantage of to ensure we only grab the pieces we need. This is especially the case when dealing with sites that try to prevent us from doing just that.
Understanding the tools we have at our disposal is the first step to developing a keen eye for what's possible. We'll start with the meat and potatoes.
Using find() & find_all()
The most straightforward way to find information in our soup
variable is by utilizing soup.find(...)
or soup.find_all(...)
. These two methods work the same with one exception: find returns the first HTML element found, whereas find_all returns a list of all elements matching the criteria (even if only one element is found, find_all will return a list of a single item).
We can search for DOM elements in our soup
variable by searching for certain criteria. Passing a positional argument to find_all will return all anchor tags on the site:
We can also find all anchor tags with the class name "boy."with Passing the class_
argument allows us to filter by class name. Note the underscore!
If we wanted to get any element with the class name "boy" besides anchor tags, we can do that too:
We can search for elements by ID, similar to how we searched using classes. Remember that we should only expect a single element to be returned with an id, so we should use find here:
Often, we'll run into situations where elements don't have reliable class or id values. Luckily, we can search for DOM elements with any attribute, including non-standard ones:
CSS Selectors
Searching HTML using CSS selectors is one of the most powerful ways to find what you're looking for, especially for sites trying to make your life difficult. CSS selectors enable us to find and leverage highly specific patterns in the target's DOM structure. This is the best way to ensure we're grabbing exactly the content we need. If you're rusty on CSS selectors, I highly recommend becoming reacquainted. Here are a few examples:
In this example, we're looking for an element with a "widget" class and an "author" class. Once we have that element, we go deeper to find any paragraph tags held within that widget. We could also modify this to get only the second paragraph tag inside the author widget:
To understand why this is so powerful, imagine a site that intentionally has no identifying attributes on its tags to keep people like you from scraping their data. Even without names to select by, we could observe the DOM structure of the page and find a unique way to navigate to the element we want:
A specific pattern like this is likely unique to only a single collection of <li>
tags on the page we're exploiting. The downside of this method is we're at the whim of the site owner, as their HTML structure could change.
Get Some Attributes
Chances are we'll almost always want the contents or the attributes of a tag, as opposed to the entirety of a tag's HTML. If we're scraping anchor tags, for instance, we probably just want the href
value as opposed to the entire tag. The .get
method can be used here to retrieve values of attributes on a tag:
The above finds the destination URLs for all <a>
tags on a page. Another example can have us grab a site's logo image:
Sometimes, it's not attributes we're looking for but just the text within a tag:
Pesky Tags to Deal With
In our example of creating link previews, a good first source of information would be the page's meta tags, specifically the og
tags they've specified to provide the bite-sized information we're looking for openly. Grabbing these tags is a bit more difficult to deal with:
Now that's ugly. Meta tags are an especially interesting case; they're all uselessly dubbed 'meta'. Thus, we need a second identifier (in addition to the tag name) to specify which meta tag we care about. Only then can we bother to get the actual content of the aforementioned tag.
Scripting a Scraper
Let's get comfortable with scraping the web by writing a small script that returns the metadata for a given URL. To accomplish this, we can think of our script as having three "steps":
- Fetching the HTML content of a URL using Python's
requests
library. - Parsing the returned HTML into a
BeautifulSoup
object, from which we can extract individual elements, such as "title." - Returning the output as a simple dictionary
Here's how I've chosen to create an entry point for our script. This function isolates logic for requesting URLs in a fetch module and handles the scraping logic via a scrape module:
Requesting the Contents of a URL
The fetch_html_from_url
function is a simple implementation of making a GET request and returning the contents:
Scraping Metadata from an HTML Page
The scrape_page_metadata()
function will attempt to scrape various elements of the given HTML body and output a dictionary with these values:
"""Scrape metadata attributes from a requested URL."""
from typing import Optional
from bs4 import BeautifulSoup
from requests import Response
def scrape_page_metadata(resp: Response, url: str) -> dict:
"""
Parse page & return metadata.
:param Response resp: Raw HTTP response.
:param str url: URL of targeted page.
:return: dict
"""
html = BeautifulSoup(resp.content, "html.parser")
metadata = {
"title": get_title(html),
"description": get_description(html),
"image": get_image(html),
"favicon": get_favicon(html, url),
"theme_color": get_theme_color(html),
}
return metadata
...
Each key in our dictionary has a corresponding function that attempts to scrape the corresponding information. Here's what we have for fetching a page's title, description, image, favicon, and color values:
You may not have been expecting this wall-of-text level of complexity! When scraping the web, it is best to assume that something will go wrong. Not every page on the internet is perfectly tagged with the correct data thus, we need to implement fallbacks in the common scenario that data is missing.
get_title()
attempts to parse the classic<title>
HTML tag, which has a very low chance of failing. If the page is missing this tag, we fall back to parsing the Open Graph equivalent from theog:title
meta tag. If all of this still fails, we finally resort to trying to pull the first<h1>
tag on the page (if we get to this point, we're probably scraping a garbage site).get_description()
is nearly identical to our method for scraping page titles. The last resort is a desperate attempt to pull the first paragraph on the page.get_image()
looks for the page's "share" image, which is used to generate link previews on social media platforms. Our last resort is to pull the first<img>
tag containing a source image.get_favicon()
attempts to fetch high-res "icons" via meta tags before resorting to a last-ditch effort of simply grabbing[URL]/favicon.ico
.get_theme_color()
is somewhat unique in that there are zero fallbacks if the meta tag"theme-color"
does not exist. There aren't any simple alternative ways of determining this, which is fine. This is arguably the least important attribute we're fetching.
What Did We Build?
This script we threw together is the basis for how popular chat clients generate "link previews" (think Facebook, Slack, and Discord). By fetching and parsing a posted URL, we have enough data to build a rather good synopsis of a given page before we, as users, visit them ourselves.
This started as a tutorial about BeautifulSoup, but we got carried away and wrote an entire application. Whoops.
I've uploaded a working demo of this script to Github: