Scrape the Web with Python & BeautifulSoup

Whether it be Kaggle, Google Cloud, or the federal government, there's plenty of reliable open-sourced data on the web. While there are plenty of reasons to hate being alive in our current chapter of humanity, open data is one of the few redeeming qualities of life on Earth today. But what is the opposite of "open" data, anyway?

Like anything free and easily accessible, the only data inherently worth anything is either harvested privately or stolen from sources that would prefer you didn't. This is the sort of data business models can be built around, as social media platforms such as LinkedIn have shown us as our personal information is bought and sold by data brokers. These companies attempted to sue individual programmers like ourselves for scraping the data they collected via the same means, and epically lost in a court of law:

The topic of scraping data on the web tends to raise questions about the ethics and legality of scraping, to which I plea: don't hold back. If you aren't personally disgusted by the prospect of your life being transcribed, sold, and frequently leaked, the court system has ruled that you legally have a right to scrape data. The name of this publication is not People Who Play It Safe And Slackers. We're a home for those who fight to take power back, and we're going to scrape the shit out of you.

Tools for the Job

Web scraping in Python is dominated by three major libraries: BeautifulSoup, Scrapy, and Selenium. Each of these libraries intends to solve very different use cases. Thus, it's essential to understand what we're choosing and why.

BeautifulSoup is one of the most prolific Python libraries, in some part having shaped the web as we know it. BeautifulSoup is a lightweight, easy-to-learn, and highly effective way to programmatically isolate information on a single webpage at a time. It's common to use BeautifulSoup in conjunction with the requests library, where requests fetches a page, and BeautifulSoup extracts the resulting data.
Scrapy has an agenda that is much closer to mass pillaging than BeautifulSoup. Scrapy is a tool for building crawlers: these are absolute monstrosities unleashed upon the web like a swarm, loosely following links and hastily grabbing data where data exists to be grabbed. Because Scrapy serves the purpose of mass-scraping, it is much easier to get in trouble with Scrapy.
Selenium isn't exclusively a scraping tool as much as an automation tool that can scrape sites. Selenium is the nuclear option for attempting to navigate sites programmatically and should be treated as such: there are much better options for simple data extraction.

We'll be using BeautifulSoup, which should be anybody's default choice until the circumstances ask for more. BeautifulSoup is more than enough to steal data.

Preparing Our Extraction

Before we steal any data, we need to set the stage. We'll start by installing our two libraries of choice:

$ pip3 install beautifulsoup4 requests

Install beautifulsoup and requests

As mentioned, requests will provide us with our target's HTML, and beautifulsoup4 will parse that data.

We need to recognize that many sites have precautions to fend off scrapers from accessing their data. The first thing we can do to get around this is spoofing the headers we send along with our requests to make our scraper look like a legitimate browser:

import requests
from bs4 import BeautifulSoup


headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

Set request headers.

This is only a first line of defense (or offensive, in our case). There are plenty of ways sites can still keep us at bay, but setting headers works shockingly well to fix most issues.

Now, let's fetch a page and inspect it with BeautifulSoup:

import requests
from bs4 import BeautifulSoup

...
url = "https://example.com"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())

Scrape example.com

We set things up by making a request to https://example.com. We then create a BeautifulSoup object that accepts the raw content of that response via req.content. The second parameter, 'html.parser', is our way of telling BeautifulSoup that this is an HTML document. There are other parsers available for parsing stuff like XML if you're into that.

When we create a BeautifulSoup object from a page's HTML, our object contains the HTML structure of that page, which all sorts of methods can easily parse. First, let's see what our variable soup looks like by using print(soup.prettify()):

<html class="gr__example_com"><head>
    <title>Example Domain</title>
    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta property="og:site_name" content="Example dot com">
    <meta property="og:type" content="website">
    <meta property="og:title" content="Example">
    <meta property="og:description" content="An Example website.">
    <meta property="og:image" content="https://example.com/img/image.jpg">
    <meta name="twitter:title" content="Hackers and Slackers">
    <meta name="twitter:description" content="An Example website.">
    <meta name="twitter:url" content="https://example.com/">
    <meta name="twitter:image" content="https://example.com/img/image.jpg">
</head>

<body data-gr-c-s-loaded="true">
  <div>
    <h1>Example Domain</h1>
      <p>This domain is established to be used for illustrative examples in documents.</p>
      <p>You may use this domain in examples without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
  </div>
</body>
    
</html>

HTML for example.com

Targeting HTML Elements

Many methods are available for pinpointing and grabbing the information we're trying to get out of a page. Finding the exact information we want from a web page is an art form: effective scraping requires us to recognize patterns in the document's HTML that we can take advantage of to ensure we only grab the pieces we need. This is especially the case when dealing with sites that try to prevent us from doing just that.

Understanding the tools we have at our disposal is the first step to developing a keen eye for what's possible. We'll start with the meat and potatoes.

Using find() & find_all()

The most straightforward way to find information in our soup variable is by utilizing soup.find(...) or soup.find_all(...). These two methods work the same with one exception: find returns the first HTML element found, whereas find_all returns a list of all elements matching the criteria (even if only one element is found, find_all will return a list of a single item).

We can search for DOM elements in our soup variable by searching for certain criteria. Passing a positional argument to find_all will return all anchor tags on the site:

soup.find_all("a")
# <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a> 
# <a href="https://example.com/tillie" class="girl" id="link3">Tillie</a>

Find all <a> tags.

We can also find all anchor tags with the class name "boy."with Passing the class_ argument allows us to filter by class name. Note the underscore!

soup.find_all("a" class_="boy")
# <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a>

Find all <a> tags assigned to a certain class

If we wanted to get any element with the class name "boy" besides anchor tags, we can do that too:

soup.find_all(class_="boy")
# <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="https://example.com/lacie" class="boy" id="link2">Lacie</a>

Find all elements assigned to a certain class

We can search for elements by ID, similar to how we searched using classes. Remember that we should only expect a single element to be returned with an id, so we should use find here:

soup.find("a", id="link1")
# <a href="https://example.com/elsie" class="boy" id="link1">Elsie</a>

Find element by ID

Often, we'll run into situations where elements don't have reliable class or id values. Luckily, we can search for DOM elements with any attribute, including non-standard ones:

soup.find_all(attrs={"data-args": "bologna"})

Find elements by attribute

CSS Selectors

Searching HTML using CSS selectors is one of the most powerful ways to find what you're looking for, especially for sites trying to make your life difficult. CSS selectors enable us to find and leverage highly specific patterns in the target's DOM structure. This is the best way to ensure we're grabbing exactly the content we need. If you're rusty on CSS selectors, I highly recommend becoming reacquainted. Here are a few examples:

soup.select(".widget.author p")

Find elements via CSS selector syntax.

In this example, we're looking for an element with a "widget" class and an "author" class. Once we have that element, we go deeper to find any paragraph tags held within that widget. We could also modify this to get only the second paragraph tag inside the author widget:

soup.select(".widget.author p:nth-of-type(2)")

Complex CSS Selection

To understand why this is so powerful, imagine a site that intentionally has no identifying attributes on its tags to keep people like you from scraping their data. Even without names to select by, we could observe the DOM structure of the page and find a unique way to navigate to the element we want:

soup.select("body > div:first-of-type > div > ul li")

Complex CSS Selection

A specific pattern like this is likely unique to only a single collection of <li> tags on the page we're exploiting. The downside of this method is we're at the whim of the site owner, as their HTML structure could change.

Get Some Attributes

Chances are we'll almost always want the contents or the attributes of a tag, as opposed to the entirety of a tag's HTML. If we're scraping anchor tags, for instance, we probably just want the href value as opposed to the entire tag. The .get method can be used here to retrieve values of attributes on a tag:

soup.find_all('a').get('href')

Get elements and extract attribute values.

The above finds the destination URLs for all <a> tags on a page. Another example can have us grab a site's logo image:

soup.find(id="logo").get('src')

Get <img> element's source link

Sometimes, it's not attributes we're looking for but just the text within a tag:

soup.find('p').get_text()

Get elements and extract text content.

Pesky Tags to Deal With

In our example of creating link previews, a good first source of information would be the page's meta tags, specifically the og tags they've specified to provide the bite-sized information we're looking for openly. Grabbing these tags is a bit more difficult to deal with:

soup.find("meta", property="og:description").get('content')

Scrape metatags

Now that's ugly. Meta tags are an especially interesting case; they're all uselessly dubbed 'meta'. Thus, we need a second identifier (in addition to the tag name) to specify which meta tag we care about. Only then can we bother to get the actual content of the aforementioned tag.

Scripting a Scraper

Let's get comfortable with scraping the web by writing a small script that returns the metadata for a given URL. To accomplish this, we can think of our script as having three "steps":

Fetching the HTML content of a URL using Python's requests library.
Parsing the returned HTML into a BeautifulSoup object, from which we can extract individual elements, such as "title."
Returning the output as a simple dictionary

Here's how I've chosen to create an entry point for our script. This function isolates logic for requesting URLs in a fetch module and handles the scraping logic via a scrape module:

"""Scrape metadata from target URL."""
import pprint

from beautifulsoup_tutorial.fetch import fetch_html_from_url
from beautifulsoup_tutorial.scrape import scrape_page_metadata

from config import TARGET_URL


def init_script() -> dict:
    """
    Fetch a given HTML page to extract & display metadata for.

    returns: dict
    """
    resp = fetch_html_from_url(TARGET_URL)
    metadata = scrape_page_metadata(resp, TARGET_URL)
    pp = pprint.PrettyPrinter(indent=4, width=120, sort_dicts=False)
    pp.pprint(metadata)
    return metadata

__init__.py

Requesting the Contents of a URL

The fetch_html_from_url function is a simple implementation of making a GET request and returning the contents:

"""Fetch raw HTML from a URL."""
from typing import Optional

import requests
from requests.exceptions import HTTPError


def fetch_html_from_url(url: str) -> Optional[str]:
    """
    Fetch raw HTML from a URL.

    :param str url: URL to `GET` contents from.

    :return: Optional[str]
    """
    try:
        headers = {
            "Access-Control-Allow-Origin": "*",
            "Access-Control-Allow-Methods": "GET",
            "Access-Control-Allow-Headers": "Content-Type",
            "Access-Control-Max-Age": "3600",
            "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
        }
        return requests.get(url, headers=headers)
    except HTTPError as e:
        print(f"HTTP error occurred: {e}")
    except Exception as e:
        print(f"Unexpected error occurred: {e}")

fetch.py

Scraping Metadata from an HTML Page

The scrape_page_metadata() function will attempt to scrape various elements of the given HTML body and output a dictionary with these values:

"""Scrape metadata attributes from a requested URL."""
from typing import Optional

from bs4 import BeautifulSoup
from requests import Response


def scrape_page_metadata(resp: Response, url: str) -> dict:
    """
    Parse page & return metadata.

    :param Response resp: Raw HTTP response.
    :param str url: URL of targeted page.

    :return: dict
    """
    html = BeautifulSoup(resp.content, "html.parser")
    metadata = {
        "title": get_title(html),
        "description": get_description(html),
        "image": get_image(html),
        "favicon": get_favicon(html, url),
        "theme_color": get_theme_color(html),
    }
    return metadata


...

Each key in our dictionary has a corresponding function that attempts to scrape the corresponding information. Here's what we have for fetching a page's title, description, image, favicon, and color values:

...


def get_title(html: BeautifulSoup) -> Optional[str]:
    """
    Scrape page title with multiple fallbacks.

    :param BeautifulSoup html: Parsed HTML object.
    :param str url: URL of targeted page.

    :returns: Optional[str]
    """
    title = html.title.string
    if title:
        return title
    elif html.find("meta", property="og:title"):
        return html.find("meta", property="og:title").get("content")
    return html.find("h1").string


def get_description(html: BeautifulSoup) -> Optional[str]:
    """
    Scrape page description.

    :param BeautifulSoup html: Parsed HTML object.
    :param str url: URL of targeted page.

    :returns: Optional[str]
    """
    description = html.find("meta", property="description")
    if description:
        return description.get("content")
    elif html.find("meta", property="og:description"):
        return html.find("meta", property="og:description").get("content")
    return html.p.string


def get_image(html: BeautifulSoup) -> Optional[str]:
    """
    Scrape preview image.

    :param BeautifulSoup html: Parsed HTML object.

    :returns: Optional[str]
    """
    image = html.find("meta", property="image")
    if image:
        return image.get("content")
    elif html.find("meta", {"property": "og:image"}):
        return html.find("meta", {"property": "og:image"}).get("content")
    return html.img.src


def get_favicon(html: BeautifulSoup, url: str) -> Optional[str]:
    """
    Scrape favicon from `icon`, or fallback to conventional favicon.

    :param Response resp: Raw HTTP response.
    :param str url: URL of targeted page.

    :returns: Optional[str]
    """
    if html.find("link", attrs={"rel": "icon"}):
        return html.find("link", attrs={"rel": "icon"}).get("href")
    elif html.find("link", attrs={"rel": "shortcut icon"}):
        return html.find("link", attrs={"rel": "shortcut icon"}).get("href")
    return f"{url.rstrip('/')}/favicon.ico"


def get_theme_color(html: BeautifulSoup) -> Optional[str]:
    """
    Scrape brand color.

    :param BeautifulSoup html: Parsed HTML object.

    :returns: Optional[str]
    """
    if html.find("meta", {"name": "theme-color"}):
        return html.find("meta", {"name": "theme-color"}).get("content")

scrape.py

You may not have been expecting this wall-of-text level of complexity! When scraping the web, it is best to assume that something will go wrong. Not every page on the internet is perfectly tagged with the correct data thus, we need to implement fallbacks in the common scenario that data is missing.

get_title() attempts to parse the classic <title> HTML tag, which has a very low chance of failing. If the page is missing this tag, we fall back to parsing the Open Graph equivalent from the og:title meta tag. If all of this still fails, we finally resort to trying to pull the first <h1> tag on the page (if we get to this point, we're probably scraping a garbage site).
get_description() is nearly identical to our method for scraping page titles. The last resort is a desperate attempt to pull the first paragraph on the page.
get_image() looks for the page's "share" image, which is used to generate link previews on social media platforms. Our last resort is to pull the first <img> tag containing a source image.
get_favicon() attempts to fetch high-res "icons" via meta tags before resorting to a last-ditch effort of simply grabbing [URL]/favicon.ico.
get_theme_color() is somewhat unique in that there are zero fallbacks if the meta tag "theme-color" does not exist. There aren't any simple alternative ways of determining this, which is fine. This is arguably the least important attribute we're fetching.

What Did We Build?

This script we threw together is the basis for how popular chat clients generate "link previews" (think Facebook, Slack, and Discord). By fetching and parsing a posted URL, we have enough data to build a rather good synopsis of a given page before we, as users, visit them ourselves.

This started as a tutorial about BeautifulSoup, but we got carried away and wrote an entire application. Whoops.

I've uploaded a working demo of this script to Github: