Unless you're entirely oblivious to scraping data in Python (and probably ended up here by accident), you're well-aware that scraping data in Python library begins and ends with BeautifulSoup. BeautifulSoup is Python's scraping powerhouse: we first demonstrated this in a previous post where we put together a script to fetch site metadata (title, description, preview images, etc.) from any target URL. We were able to build a scraper which fetched a target site's <meta> tags (and various fallbacks) to create a fairly reliable tool to summarize the contents of any URL; which is precisely the logic used to generate link "previews" such as these:

Scraping Data on the Web with BeautifulSoup
Use Python’s BeautifulSoup library to assist in the honest act of systematically stealing data without permission.
Example of a preview link with data fetched via BeatifulSoup.

Perusing the various sites and entities we refer to as "the internet" has traditionally felt like navigating an unstandardized wild-west. There's never a guarantee that the website you're targeting adheres to any web standards (despite their own best interests). These situations lead us to write scripts with complicated fallbacks in case the owner of myhorriblewebsite.angelfire.com somehow managed to forget to give their page a <title>, and so forth. Search engines and other big players recognized this. The standardization of JSON-LD was born as a reliable format for site publishers to include machine-readable (and also quite human-readable) metadata to appease search engines and fight for relevancy.

This post is going to build upon the goal of scraping site metadata we previously explored with BeautifulSoup via a different method: by parsing JSON-LD metadata with Python's extruct library.

What's so great about JSON-LD, you might ask? Aside from dodging the hellish experience of transversing the DOM by hand, JSON-LD is a specification with notable advantages to old school HTML <meta> tags. The multitude of benefits can mostly be boiled down into two categories: data granularity and linked data.

Data Granularity

JSON-LD allows web pages to express an impressive amount of granular information about what each page is. For instance, here's the JSON-LD for one of my posts:

{
  "@context": "https://schema.org/",
  "@type": "Article",
  "author": {
    "@type": "Person",
    "name": "Todd Birchard",
    "image": "https://hackersandslackers-cdn.storage.googleapis.com/2020/04/todd@2x.jpg",
    "sameAs": ["https://toddbirchard.com", "https://twitter.com/ToddRBirchard"]
  },
  "keywords": "Golang, DevOps, Software Development",
  "headline": "Deploy a Golang Web Application Behind Nginx",
  "url": "https://hackersandslackers.com/deploy-golang-app-nginx/",
  "datePublished": "2020-06-01T07:30:00.000-04:00",
  "dateModified": "2020-06-01T09:03:55.000-04:00",
  "image": {
    "@type": "ImageObject",
    "url": "https://hackersandslackers-cdn.storage.googleapis.com/2020/05/golang-nginx-3.jpg",
    "width": "1000",
    "height": "523"
  },
  "publisher": {
    "@type": "Organization",
    "name": "Hackers and Slackers",
    "founder": "Todd Birchard",
    "logo": {
      "@type": "ImageObject",
      "url": "https://hackersandslackers-cdn.storage.googleapis.com/2020/03/logo-blue-full.png",
      "width": 60,
      "height": 60
    }
  },
  "description": "Deploy a self-hosted Go web application using Nginx as a reverse proxy. ",
  "mainEntityOfPage": {
    "@type": "WebPage",
    "@id": "https://hackersandslackers.com"
  }
}
Example JSON-LD for a Hackers and Slackers post.

There's significantly more information stored in the above snippet than all other meta tags on the same page combined. There are surely more supported attributes in JSON-LD than traditional meta tags, yet the representation of data in a JSON hierarchy makes it immediately clear how page metadata is related. It's immediately clear that we're looking at an object representing an article, written by an author, as part of an "organization."

Google's explanation of the benefits of structuring metadata goes something like this:

Structured data is a standardized format for providing information about a page and classifying the page content; for example, on a recipe page, what are the ingredients, the cooking time and temperature, the calories, and so on.

Type

The term "web page" is useless ambiguous, as web pages are documents that can provide information in any number of forms. Web pages might be articles, recipes, product pages, events, and far more. The official schema of possible page types includes over one thousand possibilities for what "type" or "subtype" a page might be considered to be. Knowing the "type" of a page reduces ambiguity, and declaring a page "type" allows us to attach type-specific metadata to pages as well! For instance, let's compare the attributes of an Episode type to an Article type:

Episode
Property Description
actor An actor, e.g. in tv, radio, movie, video games etc., or in an event. Actors can be associated with individual items or with a series, episode, clip. Supersedes actors.
director A director of e.g. tv, radio, movie, video gaming etc. content, or of an event. Directors can be associated with individual items or with a series, episode, clip. Supersedes directors.
episodeNumber Position of the episode within an ordered group of episodes.
musicBy The composer of the soundtrack.
partOfSeason The season to which this episode belongs.
partOfSeries The series to which this episode or season belongs. Supersedes partOfTVSeries.
productionCompany The production company or studio responsible for the item e.g. series, video game, episode etc.
trailer The trailer of a movie or tv/radio series, season, episode, etc.
Article
Property Description
articleBody The actual body of the article.
articleSection Articles may belong to one or more 'sections' in a magazine or newspaper, such as Sports, Lifestyle, etc.
backstory For an Article, typically a NewsArticle, the backstory property provides a textual summary giving a brief explanation of why and how an article was created. In a journalistic setting this could include information about reporting process, methods, interviews, data sources, etc.
pageEnd The page on which the work ends; for example "138" or "xvi".
pageStart The page on which the work starts; for example "135" or "xiii".
pagination Any description of pages that is not separated into pageStart and pageEnd; for example, "1-6, 9, 55" or "10-12, 46-49".
speakable Indicates sections of a Web page that are particularly 'speakable' in the sense of being highlighted as being especially appropriate for text-to-speech conversion. Other sections of a page may also be usefully spoken in particular circumstances; the 'speakable' property serves to indicate the parts most likely to be generally useful for speech.

wordCount The number of words in the text of the Article.

There are obviously data attributes of television shows which don't apply to news articles (such as actors, director, etc.), and vice versa. The level of specificity achievable is nearly unfathomable when we discover that types have subtypes. For instance, our article might be an opinion piece article, which has extended the Article type with even more attributes.

Who

All content has a creator, yet content-creators can take many forms. Authors, publishers, and organizations could simultaneously be considered the responsible party for any given content, as these properties are not mutually exclusive. For instance, here's how my author data is parsed on posts like this one:

Author
Property Description
@type Person
name Todd Birchard
image https://hackersandslackers-cdn.storage.googleapis.com/2020/04/todd@2x.jpg
sameAs https://toddbirchard.com/
sameAs https://twitter.com/ToddRBirchard

What makes this data especially interesting is the values listed under the sameAs attribute, which associates the "Todd Birchard" in question to the very same Todd Birchard of the website https://toddbirchard.com/, and Twitter account https://twitter.com/ToddRBirchard. This undoubtedly assists search engines in making associations between entities on the web. Still, a keen imagination may easily recognize the opportunity to leverage these strong associations to dox or harass strangers on the internet quite easily.

Scrape Something Together

Along with Extruct, we'll be installing our good friend requests to fetch pages for us:

$ pip3 install requests extruct
Install libraries

You already know the drill — pick a single URL for now and loot them for all they've got by returning .text from our request's response:

import requests


def get_html(url):
    """Get raw HTML from a URL."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    req = requests.get(url, headers=headers)
    return req.text
Retrieve a page's HTML.

Simple stuff. Here's where extruct comes in; I'm tossing together a function called get_metadata, which will do precisely what you'd assume. We can pass raw the HTML we grabbed with get_html and pass it to our new function to pillage:

"""Fetch structured JSON-LD data from a given URL."""
from pprint import pprint
import requests
import extruct
from w3lib.html import get_base_url


def scrape(url):
    """Parse structured data from a target page."""
    html = get_html(url)
    metadata = get_metadata(html, url)
    pprint(metadata, indent=2, width=150)
    return metadata


...


def get_metadata(html, url):
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_base_url(html, url),
        syntaxes=['json-ld'],
    )['json-ld'][0]
    return metadata
Getting structured data with extruct.

Using extruct is as easy as passing raw HTML as a string and a site's "base URL" with  extruct.extract(html, base_url=url). A "base URL" refers to a site's entry point (or homepage, whatever) for the targeted page. The page you're on right now is https://hackersandslackers.com/scrape-metadata-json-ld/. Thus the base URL, in this case, would be https://hackersandslackers.com/. There's a core library called w3lib that has a function to handle this exact task, hence our usage of base_url=get_base_url(html, url).

This is what our extract function returns, using one of my posts as an example:

{ '@context': 'https://schema.org/',
  '@type': 'Article',
  'author': { '@type': 'Person',
              'image': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/04/todd@2x.jpg',
              'name': 'Todd Birchard',
              'sameAs': ['https://toddbirchard.com', 'https://twitter.com/ToddRBirchard']},
  'dateModified': '2020-06-11T16:57:57.000-04:00',
  'datePublished': '2018-11-11T08:35:09.000-05:00',
  'description': "Use Python's BeautifulSoup library to assist in the honest act of systematically stealing data without permission.",
  'headline': 'Scraping Data on the Web with BeautifulSoup',
  'image': { '@type': 'ImageObject',
             'height': '523',
             'url': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/06/beautifulsoup-1-1.jpg',
             'width': '1000'},
  'keywords': 'Python, Data Engineering',
  'mainEntityOfPage': {'@id': 'https://hackersandslackers.com', '@type': 'WebPage'},
  'publisher': { '@type': 'Organization',
                 'founder': 'Todd Birchard',
                 'logo': { '@type': 'ImageObject',
                           'height': 60,
                           'url': 'https://hackersandslackers-cdn.storage.googleapis.com/2020/03/logo-blue-full.png',
                           'width': 60},
                 'name': 'Hackers and Slackers'},
  'url': 'https://hackersandslackers.com/scraping-urls-with-beautifulsoup/'}
JSON-LD data for a Hackers and Slackers post.

One of the keyword arguments we passed to extruct was syntaxes, which is an optional argument where we specify which flavor of structured data we're after (apparently there's more than one). Possible options to pass are 'microdata', 'json-ld', 'opengraph', 'microformat', and 'rdfa'. If nothing is passed, extruct will attempt to fetch all of the above and return the results in a dictionary. This is why we follow up our extruct call by accessing the ['json-ld'] key.

Lastly, you're probably wondering why we index [0] after getting our results from extruct. I honestly couldn't tell you why, but extruct always returns each syntax's data in the form of a list — almost always a list of a single item. My best guess is this is to accommodate pages which might have multiple blocks of structured content on a single page. This seems like a strange design decision considering such pages would be misusing structured data, but whatever. I'm not a critic; I'm just a guy who writes tutorials on the internet.

Here's the script in its entirety:

"""Fetch structured JSON-LD data from a given URL."""
from pprint import pprint
import requests
import extruct
from w3lib.html import get_base_url


def scrape(url):
    """Parse structured data from a target page."""
    html = get_html(url)
    metadata = get_metadata(html, url)
    pprint(metadata, indent=2, width=150)
    return metadata


def get_html(url):
    """Get raw HTML from a URL."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    req = requests.get(url, headers=headers)
    return req.text


def get_metadata(html, url):
    """Fetch JSON-LD structured data."""
    metadata = extruct.extract(
        html,
        base_url=get_base_url(html, url),
        syntaxes=['json-ld'],
    )['json-ld'][0]
    return metadata
Scrape a single page for structured data.

One More For the Toolbox

Unless you're actually looking to create link previews like the one I included, using extruct as a standalone library without a more extensive plan or toolkit isn't going to deliver much to you other than an easy interface for getting better metadata from individual web pages. Instead, consider looking at the bigger picture of what a single page's metadata gives us. We now have effortless access to information that crawlers can use to move through sites, associate data with individuals, and ultimately create a picture of an entity's entire web presence, whether that entity is a person, organization, or whatever.

If you look closely, one of extruct's main dependencies is actually BeautifulSoup. You could argue that you may have been able to write this library yourself, and you might be right, but that isn't the point. Data mining behemoths aren't nuclear arsenals; they're collections of tools used in conjunction cleverly to wreak havoc upon the world as efficiently as possible. We're getting there.

This has been a quick little script, but if you're interested I've thrown the source up on Github here:

hackersandslackers/jsonld-scraper-tutorial
🌎 🖥 Supercharge your scraper to extract quality page metadata by parsing JSON-LD data via Python’s extruct library. - hackersandslackers/jsonld-scraper-tutorial

Until next time.