Unless you're entirely oblivious to scraping data in Python (and probably ended up here by accident), you're well-aware that scraping data in Python library begins and ends with BeautifulSoup. BeautifulSoup is Python's scraping powerhouse: we first demonstrated this in a previous post where we put together a script to fetch site metadata (title, description, preview images, etc.) from any target URL. We were able to build a scraper which fetched a target site's
<meta> tags (and various fallbacks) to create a fairly reliable tool to summarize the contents of any URL; which is precisely the logic used to generate link "previews" such as these:
Perusing the various sites and entities we refer to as "the internet" has traditionally felt like navigating an unstandardized wild-west. There's never a guarantee that the website you're targeting adheres to any web standards (despite their own best interests). These situations lead us to write scripts with complicated fallbacks in case the owner of myhorriblewebsite.angelfire.com somehow managed to forget to give their page a
<title>, and so forth. Search engines and other big players recognized this. The standardization of JSON-LD was born as a reliable format for site publishers to include machine-readable (and also quite human-readable) metadata to appease search engines and fight for relevancy.
This post is going to build upon the goal of scraping site metadata we previously explored with BeautifulSoup via a different method: by parsing JSON-LD metadata with Python's extruct library.
What's so great about JSON-LD, you might ask? Aside from dodging the hellish experience of transversing the DOM by hand, JSON-LD is a specification with notable advantages to old school HTML
<meta> tags. The multitude of benefits can mostly be boiled down into two categories: data granularity and linked data.
JSON-LD allows web pages to express an impressive amount of granular information about what each page is. For instance, here's the JSON-LD for one of my posts:
There's significantly more information stored in the above snippet than all other meta tags on the same page combined. There are surely more supported attributes in JSON-LD than traditional meta tags, yet the representation of data in a JSON hierarchy makes it immediately clear how page metadata is related. It's immediately clear that we're looking at an object representing an article, written by an author, as part of an "organization."
Google's explanation of the benefits of structuring metadata goes something like this:
Structured data is a standardized format for providing information about a page and classifying the page content; for example, on a recipe page, what are the ingredients, the cooking time and temperature, the calories, and so on.
The term "web page" is useless ambiguous, as web pages are documents that can provide information in any number of forms. Web pages might be articles, recipes, product pages, events, and far more. The official schema of possible page types includes over one thousand possibilities for what "type" or "subtype" a page might be considered to be. Knowing the "type" of a page reduces ambiguity, and declaring a page "type" allows us to attach type-specific metadata to pages as well! For instance, let's compare the attributes of an Episode type to an Article type:
||An actor, e.g. in tv, radio, movie, video games etc., or in an event. Actors can be associated with individual items or with a series, episode, clip. Supersedes actors.|
||A director of e.g. tv, radio, movie, video gaming etc. content, or of an event. Directors can be associated with individual items or with a series, episode, clip. Supersedes directors.|
||Position of the episode within an ordered group of episodes.|
||The composer of the soundtrack.|
||The season to which this episode belongs.|
||The series to which this episode or season belongs. Supersedes partOfTVSeries.|
||The production company or studio responsible for the item e.g. series, video game, episode etc.|
||The trailer of a movie or tv/radio series, season, episode, etc.|
||The actual body of the article.|
||Articles may belong to one or more 'sections' in a magazine or newspaper, such as Sports, Lifestyle, etc.|
||For an Article, typically a NewsArticle, the backstory property provides a textual summary giving a brief explanation of why and how an article was created. In a journalistic setting this could include information about reporting process, methods, interviews, data sources, etc.|
||The page on which the work ends; for example "138" or "xvi".|
||The page on which the work starts; for example "135" or "xiii".|
||Any description of pages that is not separated into pageStart and pageEnd; for example, "1-6, 9, 55" or "10-12, 46-49".|
||Indicates sections of a Web page that are particularly 'speakable' in the sense of being highlighted as being especially appropriate for text-to-speech conversion. Other sections of a page may
be usefully spoken in particular circumstances; the 'speakable' property serves to indicate the parts most likely to be generally useful for speech.|
||The number of words in the text of the Article.|
There are obviously data attributes of television shows which don't apply to news articles (such as actors, director, etc.), and vice versa. The level of specificity achievable is nearly unfathomable when we discover that types have subtypes. For instance, our article might be an opinion piece article, which has extended the Article type with even more attributes.
All content has a creator, yet content-creators can take many forms. Authors, publishers, and organizations could simultaneously be considered the responsible party for any given content, as these properties are not mutually exclusive. For instance, here's how my author data is parsed on posts like this one:
What makes this data especially interesting is the values listed under the sameAs attribute, which associates the "Todd Birchard" in question to the very same Todd Birchard of the website https://toddbirchard.com/, and Twitter account https://twitter.com/ToddRBirchard. This undoubtedly assists search engines in making associations between entities on the web. Still, a keen imagination may easily recognize the opportunity to leverage these strong associations to dox or harass strangers on the internet quite easily.
Scrape Something Together
Along with Extruct, we'll be installing our good friend requests to fetch pages for us:
You already know the drill — pick a single URL for now and loot them for all they've got by returning
.text from our request's response:
Simple stuff. Here's where extruct comes in; I'm tossing together a function called get_metadata, which will do precisely what you'd assume. We can pass raw the HTML we grabbed with get_html and pass it to our new function to pillage:
Using extruct is as easy as passing raw HTML as a string and a site's "base URL" with
extruct.extract(html, base_url=url). A "base URL" refers to a site's entry point (or homepage, whatever) for the targeted page. The page you're on right now is https://hackersandslackers.com/scrape-metadata-json-ld/. Thus the base URL, in this case, would be https://hackersandslackers.com/. There's a core library called w3lib that has a function to handle this exact task, hence our usage of
This is what our extract function returns, using one of my posts as an example:
One of the keyword arguments we passed to extruct was syntaxes, which is an optional argument where we specify which flavor of structured data we're after (apparently there's more than one). Possible options to pass are
'rdfa'. If nothing is passed, extruct will attempt to fetch all of the above and return the results in a dictionary. This is why we follow up our extruct call by accessing the
Lastly, you're probably wondering why we index
 after getting our results from extruct. I honestly couldn't tell you why, but extruct always returns each syntax's data in the form of a list — almost always a list of a single item. My best guess is this is to accommodate pages which might have multiple blocks of structured content on a single page. This seems like a strange design decision considering such pages would be misusing structured data, but whatever. I'm not a critic; I'm just a guy who writes tutorials on the internet.
Here's the script in its entirety:
One More For the Toolbox
Unless you're actually looking to create link previews like the one I included, using extruct as a standalone library without a more extensive plan or toolkit isn't going to deliver much to you other than an easy interface for getting better metadata from individual web pages. Instead, consider looking at the bigger picture of what a single page's metadata gives us. We now have effortless access to information that crawlers can use to move through sites, associate data with individuals, and ultimately create a picture of an entity's entire web presence, whether that entity is a person, organization, or whatever.
If you look closely, one of extruct's main dependencies is actually BeautifulSoup. You could argue that you may have been able to write this library yourself, and you might be right, but that isn't the point. Data mining behemoths aren't nuclear arsenals; they're collections of tools used in conjunction cleverly to wreak havoc upon the world as efficiently as possible. We're getting there.
This has been a quick little script, but if you're interested I've thrown the source up on Github here:
Until next time.