Scrape Structured Data with Python and Extruct

Unless you're entirely oblivious to scraping data in Python (and probably ended up here by accident), you're well aware that scraping data in Python library begins and ends with BeautifulSoup. BeautifulSoup is Python's scraping powerhouse: we first demonstrated this in a previous post where we put together a script to fetch site metadata (title, description, preview images, etc.) from any target URL. We were able to build a scraper that fetched a target site's <meta> tags (and various fallbacks) to create a fairly reliable tool to summarize the contents of any URL; which is precisely the logic used to generate link "previews" such as these:

Example of a preview link with data fetched via BeatifulSoup

Perusing the various sites and entities we refer to as "the internet" has traditionally felt like navigating an unstandardized wild-west. There's never a guarantee that the website you're targeting adheres to any web standards (despite their best interests). These situations lead us to write scripts with complicated fallbacks in case the owner of myhorriblewebsite.angelfire.com somehow forgot to give their page a <title>, and so forth. Search engines and other big players recognized this. The standardization of JSON-LD was born as a reliable format for site publishers to include machine-readable (and also quite human-readable) metadata to appease search engines and fight for relevancy.

This post will build upon the goal of scraping site metadata we previously explored with BeautifulSoup via a different method: by parsing JSON-LD metadata with Python's extruct library.

What's so great about JSON-LD, you might ask? Aside from dodging the hellish experience of transversing the DOM by hand, JSON-LD is a specification with notable advantages to old-school HTML <meta> tags. The benefits can mostly be divided into two categories: data granularity and linked data.

Data Granularity

JSON-LD allows web pages to express an impressive amount of granular information about what each page is. For instance, here's the JSON-LD for one of my posts:

{
  "@context": "https://schema.org/",
  "@type": "Article",
  "author": {
    "@type": "Person",
    "name": "Todd Birchard",
    "image": "https://cdn.hackersandslackers.com/2020/04/todd@2x.jpg",
    "sameAs": [
        "https://toddbirchard.com",
        "https://twitter.com/ToddRBirchard"
    ]
  },
  "keywords": "Golang, DevOps, Software Development",
  "headline": "Deploy a Golang Web Application Behind Nginx",
  "url": "https://hackersandslackers.com/deploy-golang-app-nginx/",
  "datePublished": "2020-06-01T07:30:00.000-04:00",
  "dateModified": "2020-06-01T09:03:55.000-04:00",
  "image": {
    "@type": "ImageObject",
    "url": "https://cdn.hackersandslackers.com/2020/05/golang-nginx-3.jpg",
    "width": "1000",
    "height": "523"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Hackers and Slackers",
    "founder": "Todd Birchard",
    "logo": {
      "@type": "ImageObject",
      "url": "https://cdn.hackersandslackers.com/2020/03/logo-blue-full.png",
      "width": 60,
      "height": 60
    }
  },
  "description": "Deploy a self-hosted Go web application using Nginx as a reverse proxy. ",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://hackersandslackers.com"
  }
}

Example JSON-LD for a Hackers and Slackers post

There's significantly more information stored in the above snippet than all other meta tags on the same page combined. There are surely more supported attributes in JSON-LD than in traditional meta tags. Yet, the representation of data in a JSON hierarchy makes it immediately clear how page metadata is related. It's immediately clear that we're looking at an object representing an article, written by an author, as part of an "organization."

Google's explanation of the benefits of structuring metadata goes something like this:

Structured data is a standardized format for providing information about a page and classifying the page content; for example, on a recipe page, what are the ingredients, the cooking time and temperature, the calories, and so on.

Type

The term "web page" is useless and ambiguous, as web pages are documents that can provide information in various forms. Web pages might be articles, recipes, product pages, events, and far more. The official schema of possible page types includes over one thousand possibilities for what "type" or "subtype" a page might be considered. Knowing the "type" of a page reduces ambiguity, and declaring a page "type" allows us to attach type-specific metadata to pages as well! For instance, let's compare the attributes of an Episode type to an Article type:

Episode
Property	Description
`actor`	An actor, e.g. in tv, radio, movie, video games etc., or in an event. Actors can be associated with individual items or with a series, episode, clip. Supersedes actors.
`director`	A director of e.g. tv, radio, movie, video gaming etc. content, or of an event. Directors can be associated with individual items or with a series, episode, clip. Supersedes directors.
`episodeNumber`	Position of the episode within an ordered group of episodes.
`musicBy`	The composer of the soundtrack.
`partOfSeason`	The season to which this episode belongs.
`partOfSeries`	The series to which this episode or season belongs. Supersedes partOfTVSeries.
`productionCompany`	The production company or studio responsible for the item e.g. series, video game, episode etc.
`trailer`	The trailer of a movie or tv/radio series, season, episode, etc.

Article
Property	Description
`articleBody`	The actual body of the article.
`articleSection`	Articles may belong to one or more 'sections' in a magazine or newspaper, such as Sports, Lifestyle, etc.
`backstory`	For an Article, typically a NewsArticle, the backstory property provides a textual summary giving a brief explanation of why and how an article was created. In a journalistic setting this could include information about reporting process, methods, interviews, data sources, etc.
`pageEnd`	The page on which the work ends; for example "138" or "xvi".
`pageStart`	The page on which the work starts; for example "135" or "xiii".
`pagination`	Any description of pages that is not separated into pageStart and pageEnd; for example, "1-6, 9, 55" or "10-12, 46-49".
`speakable`	Indicates sections of a Web page that are particularly 'speakable' in the sense of being highlighted as being especially appropriate for text-to-speech conversion. Other sections of a page may also be usefully spoken in particular circumstances; the 'speakable' property serves to indicate the parts most likely to be generally useful for speech.
`wordCount`	The number of words in the text of the Article.

There are data attributes of television shows which don't apply to news articles (such as actors, directors, etc.), and vice versa. The level of specificity achievable is nearly unfathomable when we discover that types have subtypes. For instance, our article might be an opinion piece article, which has extended the Article type with even more attributes.

Who

All content has a creator, yet content creators can take many forms. Authors, publishers, and organizations could simultaneously be considered the responsible party for any content, as these properties are not mutually exclusive. For instance, here's how my author data is parsed on posts like this one:

Author
Property	Description
@type	Person
name	Todd Birchard
image	https://cdn.hackersandslackers.com/2020/04/todd@2x.jpg
sameAs	https://toddbirchard.com/
sameAs	https://twitter.com/ToddRBirchard

What makes this data especially interesting is the values listed under the sameAs attribute, which associates the "Todd Birchard" in question to the very same Todd Birchard of the website https://toddbirchard.com/, and Twitter account https://twitter.com/ToddRBirchard. This undoubtedly assists search engines in making associations between entities on the web. Still, a keen imagination may easily recognize the opportunity to leverage these strong associations to dox or harass strangers on the internet quite easily.

Scrape Something Together

Along with Extruct, we'll be installing our good friend requests to fetch pages for us:

$ pip3 install requests extruct

Install libraries

You already know the drill — pick a single URL for now and loot them for all they've got by returning .text from our request's response:

import requests


def get_html(url):
    """Get raw HTML from a URL."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    req = requests.get(url, headers=headers)
    return req.text

Retrieve a page's HTML.

Simple stuff. Here's where extruct comes in; I'm tossing together a function called get_metadata, which will do precisely what you'd assume. We can pass raw the HTML we grabbed with get_html and pass it to our new function to pillage:

"""Fetch structured JSON-LD data from a given URL."""
from pprint import pprint
import requests
import extruct
from w3lib.html import get_base_url


def scrape(url: str):
    """Parse structured data from a target page."""
    html = get_html(url)
    metadata = get_metadata(html, url)
    pprint(metadata, indent=2, width=150)
    return metadata

...

def get_metadata(html: bytes, url: str):
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_base_url(html, url),
        syntaxes=['json-ld'],
    )['json-ld'][0]
    return metadata

Getting structured data with extruct.

Using extruct is as easy as passing raw HTML as a string and a site's "base URL" with extruct.extract(html, base_url=url). A "base URL" refers to a site's entry point (or homepage, whatever) for the targeted page. The page you're on right now is https://hackersandslackers.com/scrape-metadata-json-ld/. Thus the base URL, in this case, would be https://hackersandslackers.com/. There's a core library called w3lib that has a function to handle this exact task, hence our usage of base_url=get_base_url(html, url).

This is what our extract function returns, using one of my posts as an example:

{
  "@context": "https://schema.org/",
  "@type": "Article",
  "author": {
    "@type": "Person",
    "name": "Todd Birchard",
    "image": "https://cdn.hackersandslackers.com/2020/04/todd@2x.jpg",
    "sameAs": [
      "https://toddbirchard.com",
      "https://twitter.com/ToddRBirchard"
    ]
  },
  "keywords": "Django, Python, Software Development",
  "headline": "Creating Interactive Views in Django",
  "url": "https://hackersandslackers.com/creating-django-views/",
  "datePublished": "2020-04-23T12:21:00.000-04:00",
  "dateModified": "2020-05-02T13:31:33.000-04:00",
  "image": {
    "@type": "ImageObject",
    "url": "https://cdn.hackersandslackers.com/2020/04/django-views-1.jpg",
    "width": "1000",
    "height": "523"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Hackers and Slackers",
    "founder": "Todd Birchard",
    "logo": {
      "@type": "ImageObject",
      "url": "https://cdn.hackersandslackers.com/2020/03/logo-blue-full.png",
      "width": 60,
      "height": 60
    }
  },
  "description": "Create interactive user experiences by writing Django views to handle dynamic content, submitting forms, and interacting with data.",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://hackersandslackers.com"
  }
}

JSON-LD data for a Hackers and Slackers post.

One of the keyword arguments we passed to extruct was syntaxes, an optional argument where we specify which flavor of structured data we're after (apparently there's more than one). Possible options to pass are 'microdata', 'json-ld', 'opengraph', 'microformat', and 'rdfa'. If nothing is passed, extruct will attempt to fetch all of the above and return the results in a dictionary. This is why we follow up our extruct call by accessing the ['json-ld'] key.

Dealing with Inconsistent Results

You might be wondering why we index [0] after getting our results from extruct. This is a symptom of structured data: where traditional <meta> tags are predictably 1-dimensional, the "structure" of structured data is flexible and determined by developers. This level of flexibility gives developers the power to do things like define multiple meta images as a site's shared image as a list of dicts as opposed to a single dict. This makes the output of any given site's data unpredictable, which poses problems for Python scripts that are unaware of whether they should search a list index or access a dictionary value.

The way I handle this is by explicitly checking the Python type of data being returned before extracting it:

...


def get_metadata(html: bytes, url: str):
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_base_url(url),
        syntaxes=['json-ld'],
        uniform=True
    )['json-ld']
    if bool(metadata) and isinstance(metadata, list):
        metadata = metadata[0]
    return metadata

Check the "type" of structured data

This uncertainty of returned data types occurs everywhere. In the example where a page may have multiple meta images, I might write a function like get_image() below, where I explicitly check the type of data being returned for a given attribute while transversing the data tree:

...
    
def get_image(parsed_metadata, _data: dict) -> Optional[str]:
    """Scrape parsed_metadata `share image`."""
    image = None
    if bool(_data):
        if isinstance(_data, list):
            image = _data[0]
        if image is not None and isinstance(_data, dict):
            image = image.get('image')
    return image

Extract metadata depending on the type

Put it to Work

A script to fetch and return structured data from a site would look something like this:

"""Fetch structured JSON-LD data from a given URL."""
from pprint import pprint
import requests
import extruct
from w3lib.html import get_base_url


def scrape(url: str):
    """Parse structured data from a target page."""
    html = get_html(url)
    metadata = get_metadata(html, url)
    pprint(metadata, indent=2, width=150)
    return metadata


def get_html(url: str):
    """Get raw HTML from a URL."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    req = requests.get(url, headers=headers)
    return req.text


def get_metadata(html: bytes, url: str):
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_base_url(url),
        syntaxes=['json-ld'],
        uniform=True
    )['json-ld']
    if bool(metadata) and isinstance(metadata, list):
        metadata = metadata[0]
    return metadata

Scrape a URL for structured data

Testing our Scraper

Since we're grownups, it's best if we write a simple test or two for a script that could potentially be run on a massive scale. The bare minimum we could do is point our scraper to a site containing structured data and compare the output to the data we'd expect to see. Below is a small test written with Pytest to see that our scrape() function outputs data that matches a hardcoded copy of what I expect to get back:

"""Validate JSON-LD Scrape outcome."""
import pytest
from extruct_tutorial import scrape


@pytest.fixture
def url() -> str:
    """
    Target URL to scrape metadata.
    
    :returns: str
    """
    return 'https://hackersandslackers.com/creating-django-views/'


@pytest.fixture
def expected_json() -> dict:
    """
    Expected metadata to be returned.
    
    :returns: dict
    """
    return {
      "@context": "https://schema.org/",
      "@type": "Article",
      "author": {
        "@type": "Person",
        "name": "Todd Birchard",
        "image": "https://cdn.hackersandslackers.com/2020/04/todd@2x.jpg",
        "sameAs": [
          "https://toddbirchard.com",
          "https://twitter.com/ToddRBirchard"
        ]
      },
      "keywords": "Django, Python, Software Development",
      "headline": "Creating Interactive Views in Django",
      "url": "https://hackersandslackers.com/creating-django-views/",
      "datePublished": "2020-04-23T12:21:00.000-04:00",
      "dateModified": "2020-05-02T13:31:33.000-04:00",
      "image": {
        "@type": "ImageObject",
        "url": "https://cdn.hackersandslackers.com/2020/04/django-views-1.jpg",
        "width": "1000",
        "height": "523"
      },
      "publisher": {
        "@type": "Organization",
        "name": "Hackers and Slackers",
        "founder": "Todd Birchard",
        "logo": {
          "@type": "ImageObject",
          "url": "https://cdn.hackersandslackers.com/2020/03/logo-blue-full.png",
          "width": 60,
          "height": 60
        }
      },
      "description": "Create interactive user experiences by writing Django views to handle dynamic content, submitting forms, and interacting with data.",
      "mainEntityOfPage": {
        "@type": "WebPage",
        "@id": "https://hackersandslackers.com"
      }
    }


def test_scrape(url: str, expected_json: dict):
    """
    Match target URL's expected metadata JSON to actual metadata.
    
    :param str url: URL of page to extract JSON-LD
    
    """
    metadata = scrape(url)
    assert metadata == expected_json

test_scrape.py

Build a Metadata Scraper

Of course, the above simply put data on a silver platter for you - there's still the work of grabbing the values. To give you an example of a fully fleshed-out script to scrape metadata with extruct, I'll share with you my treasure: the script I use to generate link previews:

"""Fetch structured JSON-LD data from a given URL."""
from typing import Optional, List
import requests
import extruct
from w3lib.html import get_base_url


def scrape(url: str) -> Optional[List[dict]]:
    """
    Parse structured data from a URL.
    
    :param str url: HTTP URL to extract metadata from.
    
    :returns: Optional[List[dict]]
    """
    req = requests.get(url, headers=http_headers)
    base_url = get_base_url(req.content, url)
    json_ld = render_json_ld(req.content, base_url)
    card = [
        "bookmark", {
            "type": "bookmark",
            "url": get_canonical(json_ld, html),
            "metadata": {
                "url": get_canonical(json_ld),
                "title": get_title(json_ld),
                "description": get_description(json_ld),
                "author": get_author(json_ld),
                "publisher": get_publisher(json_ld),
                "thumbnail": get_image(json_ld),
            }
        }
    ]
    return card


def get_html(url: str) -> str:
    """
    Get raw HTML from a URL.
    
    :param str url: URL of page to extract metadata from.
    
    :returns: str
    """
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    resp = requests.get(url, headers=headers, timeout=20)
    return resp.text


def get_metadata(html: bytes, url: str) -> dict:
    """
    Fetch JSON-LD structured data from raw HTML
    
    :param bytes html: Raw HTML body of page.
    :param str url: URL from which the HTML was extracted.
    
    :returns: dict
    """
    metadata = extruct.extract(
        html,
        base_url=get_domain(url),
        syntaxes=['json-ld'],
        uniform=True
    )['json-ld']
    if bool(metadata) and isinstance(metadata, list):
        metadata = metadata[0]
    return metadata


def get_title(json_ld: dict) -> Optional[str]:
    """
    Fetch title via extruct.
    
    :param dict json_ld: Parsed JSON-LD metadata from URL.
    
    :returns: Optional[str]
    """
    title = None
    if bool(json_ld) and json_ld.get('headline'):
        if isinstance(json_ld.get('headline'), list):
            title = json_ld['headline'][0]
        elif isinstance(json_ld.get('headline'), str):
            title = json_ld.get('headline')
        if isinstance(title, str):
            return title.replace("'", "")
    if bool(json_ld) and json_ld.get('title'):
        if isinstance(json_ld.get('title'), list):
            title = json_ld['title'][0]
        elif isinstance(json_ld.get('title'), str):
            title = json_ld.get('title')
    return title


def get_image(json_ld: dict) -> Optional[str]:
    """
    Fetch share image via extruct (if exists).
    
    :param dict json_ld: Parsed JSON-LD metadata from URL.
    
    :returns: Optional[str]
    """
    image = None
    if bool(json_ld) and json_ld.get('image'):
        if isinstance(json_ld['image'], list):
            image = json_ld['image'][0]
            if isinstance(image, dict):
                image = image.get('url')
            if isinstance(image, str):
                return image
        elif isinstance(json_ld.get('image'), dict):
            image = json_ld['image'].get('url')
     return image


def get_description(json_ld: dict) -> Optional[str]:
    """
    Fetch description via extruct (if exists).
    
    :param dict json_ld: Parsed JSON-LD metadata from URL.
    
    :returns: Optional[str]
    """
    if bool(json_ld) and json_ld.get('description'):
        return json_ld['description']
    return None


def get_author(json_ld: dict) -> Optional[str]:
    """
    Fetch author name via extruct (if exists).
    
    :param dict json_ld: Parsed JSON-LTD metadata from URL.
    
    :returns: Optional[str]
    """
    author = None
    if bool(json_ld) and json_ld.get('author'):
        if isinstance(json_ld['author'], list):
            author = json_ld['author'][0].get('name')
        elif isinstance(json_ld['author'], dict):
            author = json_ld['author'].get('name')
    return author


def get_publisher(json_ld: dict) -> Optional[str]:
    """
    Fetch publisher name via extruct (if exists).
    
    :param dict json_ld: Parsed JSON-LD metadata from URL.
    
    :returns: Optional[str]
    """
    publisher = None
    if bool(json_ld) and json_ld.get('publisher'):
        if isinstance(json_ld['publisher'], list):
            publisher = json_ld['publisher'][0].get('name')
        elif isinstance(json_ld['publisher'], dict):
            publisher = json_ld['publisher'].get('name')
    return publisher


def get_canonical(json_ld: dict) -> Optional[str]:
    """
    Fetch canonical URL via extruct (if exists).
    
    :param dict json_ld: Parsed JSON-LTD metadata from URL.
    
    :returns: Optional[str]
    """
    canonical = None
    if bool(json_ld) and json_ld.get('mainEntityOfPage'):
        if isinstance(json_ld['mainEntityOfPage'], dict):
            canonical = json_ld['mainEntityOfPage'].get('@id')
        elif isinstance(json_ld['mainEntityOfPage'], str):
            return json_ld['mainEntityOfPage']
    return canonical

Metadata scraper with extruct

One More For the Toolbox

Unless you're looking to create link previews like the one I included, using extruct as a standalone library without a more extensive plan or toolkit isn't going to deliver much to you other than an easy interface for getting better metadata from individual web pages. Instead, consider the bigger picture of what a single page's metadata gives us. We now have effortless access to information crawlers can use to move through sites, associate data with individuals, and ultimately create a picture of an entity's entire web presence, whether that entity is a person, organization, or whatever.

If you look closely, one of extruct's main dependencies is BeautifulSoup. You could argue that you may have been able to write this library, and you might be right, but that isn't the point. Data mining behemoths aren't nuclear arsenals; they're collections of tools used in conjunction cleverly to wreak havoc upon the world as efficiently as possible. We're getting there.

This has been a quick little script, but if you're interested I've thrown the source up on Github here:

Until next time.