Scrape Structured Data with Python and Extruct
Unless you're entirely oblivious to scraping data in Python (and probably ended up here by accident), you're well aware that scraping data in Python library begins and ends with BeautifulSoup. BeautifulSoup is Python's scraping powerhouse: we first demonstrated this in a previous post where we put together a script to fetch site metadata (title, description, preview images, etc.) from any target URL. We were able to build a scraper that fetched a target site's
<meta> tags (and various fallbacks) to create a fairly reliable tool to summarize the contents of any URL; which is precisely the logic used to generate link "previews" such as these:
Perusing the various sites and entities we refer to as "the internet" has traditionally felt like navigating an unstandardized wild-west. There's never a guarantee that the website you're targeting adheres to any web standards (despite their own best interests). These situations lead us to write scripts with complicated fallbacks in case the owner of myhorriblewebsite.angelfire.com somehow managed to forget to give their page a
<title>, and so forth. Search engines and other big players recognized this. The standardization of JSON-LD was born as a reliable format for site publishers to include machine-readable (and also quite human-readable) metadata to appease search engines and fight for relevancy.
This post is going to build upon the goal of scraping site metadata we previously explored with BeautifulSoup via a different method: by parsing JSON-LD metadata with Python's extruct library.
What's so great about JSON-LD, you might ask? Aside from dodging the hellish experience of transversing the DOM by hand, JSON-LD is a specification with notable advantages to old-school HTML
<meta> tags. The multitude of benefits can mostly be boiled down into two categories: data granularity and linked data.
JSON-LD allows web pages to express an impressive amount of granular information about what each page is. For instance, here's the JSON-LD for one of my posts:
There's significantly more information stored in the above snippet than all other meta tags on the same page combined. There are surely more supported attributes in JSON-LD than traditional meta tags, yet the representation of data in a JSON hierarchy makes it immediately clear how page metadata is related. It's immediately clear that we're looking at an object representing an article, written by an author, as part of an "organization."
Google's explanation of the benefits of structuring metadata goes something like this:
Structured data is a standardized format for providing information about a page and classifying the page content; for example, on a recipe page, what are the ingredients, the cooking time and temperature, the calories, and so on.
The term "web page" is useless ambiguous, as web pages are documents that can provide information in any number of forms. Web pages might be articles, recipes, product pages, events, and far more. The official schema of possible page types includes over one thousand possibilities for what "type" or "subtype" a page might be considered to be. Knowing the "type" of a page reduces ambiguity, and declaring a page "type" allows us to attach type-specific metadata to pages as well! For instance, let's compare the attributes of an Episode type to an Article type:
||An actor, e.g. in tv, radio, movie, video games etc., or in an event. Actors can be associated with individual items or with a series, episode, clip. Supersedes actors.|
||A director of e.g. tv, radio, movie, video gaming etc. content, or of an event. Directors can be associated with individual items or with a series, episode, clip. Supersedes directors.|
||Position of the episode within an ordered group of episodes.|
||The composer of the soundtrack.|
||The season to which this episode belongs.|
||The series to which this episode or season belongs. Supersedes partOfTVSeries.|
||The production company or studio responsible for the item e.g. series, video game, episode etc.|
||The trailer of a movie or tv/radio series, season, episode, etc.|
||The actual body of the article.|
||Articles may belong to one or more 'sections' in a magazine or newspaper, such as Sports, Lifestyle, etc.|
||For an Article, typically a NewsArticle, the backstory property provides a textual summary giving a brief explanation of why and how an article was created. In a journalistic setting this could include information about reporting process, methods, interviews, data sources, etc.|
||The page on which the work ends; for example "138" or "xvi".|
||The page on which the work starts; for example "135" or "xiii".|
||Any description of pages that is not separated into pageStart and pageEnd; for example, "1-6, 9, 55" or "10-12, 46-49".|
||Indicates sections of a Web page that are particularly 'speakable' in the sense of being highlighted as being especially appropriate for text-to-speech conversion. Other sections of a page may
be usefully spoken in particular circumstances; the 'speakable' property serves to indicate the parts most likely to be generally useful for speech.|
||The number of words in the text of the Article.|
There are obviously data attributes of television shows which don't apply to news articles (such as actors, director, etc.), and vice versa. The level of specificity achievable is nearly unfathomable when we discover that types have subtypes. For instance, our article might be an opinion piece article, which has extended the Article type with even more attributes.
All content has a creator, yet content-creators can take many forms. Authors, publishers, and organizations could simultaneously be considered the responsible party for any given content, as these properties are not mutually exclusive. For instance, here's how my author data is parsed on posts like this one:
What makes this data especially interesting is the values listed under the sameAs attribute, which associates the "Todd Birchard" in question to the very same Todd Birchard of the website https://toddbirchard.com/, and Twitter account https://twitter.com/ToddRBirchard. This undoubtedly assists search engines in making associations between entities on the web. Still, a keen imagination may easily recognize the opportunity to leverage these strong associations to dox or harass strangers on the internet quite easily.
Scrape Something Together
Along with Extruct, we'll be installing our good friend requests to fetch pages for us:
You already know the drill — pick a single URL for now and loot them for all they've got by returning
.text from our request's response:
Simple stuff. Here's where extruct comes in; I'm tossing together a function called get_metadata, which will do precisely what you'd assume. We can pass raw the HTML we grabbed with get_html and pass it to our new function to pillage:
Using extruct is as easy as passing raw HTML as a string and a site's "base URL" with
extruct.extract(html, base_url=url). A "base URL" refers to a site's entry point (or homepage, whatever) for the targeted page. The page you're on right now is https://hackersandslackers.com/scrape-metadata-json-ld/. Thus the base URL, in this case, would be https://hackersandslackers.com/. There's a core library called w3lib that has a function to handle this exact task, hence our usage of
This is what our extract function returns, using one of my posts as an example:
One of the keyword arguments we passed to extruct was syntaxes, which is an optional argument where we specify which flavor of structured data we're after (apparently there's more than one). Possible options to pass are
'rdfa'. If nothing is passed, extruct will attempt to fetch all of the above and return the results in a dictionary. This is why we follow up our extruct call by accessing the
Dealing with Inconsistent Results
You're might be wondering why we index
 after getting our results from extruct. This is a symptom of structured data: where traditional
<meta> tags are predictably 1-dimensional, the "structure" of structured data is flexible and determined by developers. This level of flexibility gives developers power to do things like define multiple meta images as a site's share image as a list of dicts as opposed to a single dict. This means makes the output of any given site's data unpredictable, which poses problems for Python scripts that are unaware of whether they should searching a list index or accessing a dictionary value.
The way I handle this is by explicitly checking the Python type of data being returned before extracting it:
This uncertainly of returned data types occurs everywhere. In the example where a page may have multiple meta images, I might write a function like
get_image() below, where I explicitly check the type of data being returned for a given attribute while transversing the data tree:
Put it to Work
A script to return fetch and return structured data from a site would look something like this:
Testing our Scraper
Since we're grownups, it's best if we write a simple test or two for a script that could potentially be run on a massive scale. The bare minimum we could do is point our scraper to a site containing structured data and compare the output to the data we'd expect to see. Below is a small test written with Pytest to see that our
scrape() function outputs data that matches a hardcoded copy of what I expect to get back:
Build a Metadata Scraper
Of course, the above simply puts data on a silver platter for you - there's still the work of grabbing the values. To give you an example of a fully fleshed-out script to scrape metadata with extruct, I'll share with you my own personal treasure: the script I use to generate link previews:
One More For the Toolbox
Unless you're actually looking to create link previews like the one I included, using extruct as a standalone library without a more extensive plan or toolkit isn't going to deliver much to you other than an easy interface for getting better metadata from individual web pages. Instead, consider looking at the bigger picture of what a single page's metadata gives us. We now have effortless access to information that crawlers can use to move through sites, associate data with individuals, and ultimately create a picture of an entity's entire web presence, whether that entity is a person, organization, or whatever.
If you look closely, one of extruct's main dependencies is actually BeautifulSoup. You could argue that you may have been able to write this library yourself, and you might be right, but that isn't the point. Data mining behemoths aren't nuclear arsenals; they're collections of tools used in conjunction cleverly to wreak havoc upon the world as efficiently as possible. We're getting there.
This has been a quick little script, but if you're interested I've thrown the source up on Github here:
Until next time.