There are plenty of reliable and open sources of data on the web. Datasets are freely released to the public domain by the likes of Kaggle, Google Cloud, and of course local & federal government. Like most things free and open, however, following the rules to obtain public data can be a bit... boring. I'm not suggesting we go and blatantly break some grey-area laws by stealing data, but this blog isn't exactly called People Who Play It Safe And Slackers, either.

My personal Python roots can actually be traced back to an ambitious side-project: to aggregate all new music from across the web and deliver it the masses. While that project may have been abandoned (after realizing it already existed), BeautifulSoup was more-or-less my first ever experience with Python.

Tools for the Job

Web scraping in Python is dominated by three major libraries: BeautifulSoup, Scrapy, and Selenium. Each of these libraries intends to solve for very different use cases, thus it's important to understand what we're choosing and why.

  • BeautifulSoup is one of the most prolific Python libraries in existence, in some part having shaped the web as we know it. BeautifulSoup is a lightweight, easy-to-learn, and highly effective way to programmatically isolate information on a single webpage at a time. BeautifulSoup is typically paired with the requests library, where requests will fetch a page and BeautifulSoup will extract the resulting data.
  • Scrapy has an agenda much closer to mass pillaging than BeautifulSoup. Scrapy is designed to create crawlers: absolute monstrosities unleashed upon the web like a swarm, loosely following links and haste-fully grabbing data where data exists to be grabbed. Because Scrapy serves the purpose of mass-scraping, it is much easier to get in trouble with Scrapy.
  • Selenium isn't exclusively a scraping tool as much as an automation tool which can be used to scrape sites. Selenium is the nuclear option for attempting to programmatically navigate sites, and should be treated as such: there are much better options for simple data extraction.

We'll be using BeautifulSoup, which should truly be anybody's default choice until the circumstances ask for more. BeautifulSoup is more than enough to steal data.

Preparing Our Extraction

Before we steal any data, we need to set the stage. We'll start by installing our two libraries of choice:

pip3 install beautifulsoup4 requests

As we mentioned before, requests will provide us with our target's HTML, and beautifulsoup4 will parse that data.

Before we do anything, we need to recognize that a lot of sites have precautions in place to fend off scrapers from accessing their data. The first thing we can do to get around this is spoofing the headers we send along with our requests to make it look like we're a legitimate browser:

import requests

# Set headers  
headers = requests.utils.default_headers()
headers.update({ 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'})

This is only a first line of defense (or offensive, in our case). There are plenty of ways sites can still keep us at bay, but setting headers works shockingly well to fix most issues.

Now let's fetch a page and inspect it with BeautifulSoup:

from bs4 import BeautifulSoup

...
url = "http://example.com"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, 'html.parser')
print(soup.prettify())

We set things up by making a request to http://example.com. We then create a BeautifulSoup object which accepts the raw content of that response via req.content. The second parameter, 'html.parser', is our way of telling BeautifulSoup this is an HTML document. There are actually other parsers available for parsing stuff like XML, if you're into that.

When we create a BeautifulSoup object from a page's HTML, our object contains the full HTML structure of that page which can now be easily parsed by all sorts of methods. First, let's see what our variable soup looks like by using print(soup.prettify()):

<html class="gr__example_com"><head>
    <title>Example Domain</title>
    <meta charset="utf-8">
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <meta property="og:site_name" content="Example dot com">
    <meta property="og:type" content="website">
    <meta property="og:title" content="Example">
    <meta property="og:description" content="An Example website.">
    <meta property="og:image" content="http://example.com/img/image.jpg">
    <meta name="twitter:title" content="Hackers and Slackers">
    <meta name="twitter:description" content="An Example website.">
    <meta name="twitter:url" content="http://example.com/">
    <meta name="twitter:image" content="http://example.com/img/image.jpg">
</head>

<body data-gr-c-s-loaded="true">
  <div>
    <h1>Example Domain</h1>
      <p>This domain is established to be used for illustrative examples in documents.</p>
      <p>You may use this domain in examples without prior coordination or asking for permission.</p>
    <p><a href="http://www.iana.org/domains/example">More information...</a></p>
  </div>
</body>
    
</html>
HTML for example.com

Targeting HTML Elements

There are a lot of methods available to us for pinpointing and grabbing the information we're trying to get out of a page. Finding the exact information we want out of a web page is bit of an art form: effective scraping requires us to recognize patterns in document's HTML that we can take advantage of to ensure we only grab the pieces we need. This is especially the case when dealing with sites which actively try to prevent us from doing just that.

Understanding the tools we have at our disposal is the first step to developing a keen eye for what's possible. We'll start with the meat and potatoes.

Using find() & find_all()

The most straightforward way to finding information in our soup variable is by utilizing soup.find(...) or soup.find_all(...). These two methods work the same with one exception: find returns the first HTML element found, where as find_all returns a list of all elements matching the criteria (even if only one element is found, find_all will return a list of a single item).

We can search for DOM elements in our soup variable by searching for certain criteria. Passing a positional argument to find_all will return all anchor tags on the site:

soup.find_all("a")
# <a href="http://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="http://example.com/lacie" class="boy" id="link2">Lacie</a> 
# <a href="http://example.com/tillie" class="girl" id="link3">Tillie</a>

We can also find all anchor tags which have the class name "boy". Passing the class_ argument allows us to filter by class name. Note the underscore!

soup.find_all("a" class_="boy")
# <a href="http://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="http://example.com/lacie" class="boy" id="link2">Lacie</a> 

If we wanted to get any element with the class name "boy" besides anchor tags, we can do that too:

soup.find_all(class_="boy")
# <a href="http://example.com/elsie" class="boy" id="link1">Elsie</a>
# <a href="http://example.com/lacie" class="boy" id="link2">Lacie</a> 

We can search for elements by id in the same way we searched for classes. Remember that we should only expect a single element to be returned with an id, so we should use find here:

soup.find("a", id="link1")
# <a href="http://example.com/elsie" class="boy" id="link1">Elsie</a>

Often times we'll run into situations where elements don't have reliable class or id values. Luckily for us, we can search for DOM elements with any attribute, including non-standard ones:

soup.find_all(attrs={"data-args": "bologna"})

CSS Selectors

Searching HTML using CSS selectors is one of the most powerful ways to find exactly what you're looking for, especially for sites trying to make your life difficult. By using CSS selectors, we can find patterns in the target's DOM structure to create elaborate searches based on the patterns we recognize. If you're rusty on CSS selectors, I highly recommend becoming reacquainted. Here are a few examples:

soup.select(".widget.author p")

In this example, we're looking for an element which has a "widget" class, as well as an "author" class. Once we have that element, we go deeper to find any paragraph tags held within that widget. We could also modify this to get only the second paragraph tag inside the author widget:

soup.select(".widget.author p:nth-of-type(2)")

To understand why this is so powerful, imagine a site which intentionally has no identifying attributes on its tags to keep people like you from scraping their data. Even without names to select by, we could observe the DOM structure of the page and find a unique way to navigate to the element we want:

soup.select("body > div:first-of-type > div > ul li")

A specific pattern like this is  likely unique to only a single collection of <li> tags on the page we're exploiting. The downside of this method is we're at the whim of the site owner, as their HTML structure could change.

Get Some Attributes

Chances are we'll almost always want the contents or the attributes of a tag, as opposed to the entirety of a tag's HTML. For example, if we're scraping anchor tags we probably just want destination of the link, as opposed to the the entire tag. The .get method can be used here to retrieve values of attributes on a tag:

soup.find_all('a').get('href') 

The above finds the destination URLs for all <a> tags on a page. Another example can have us grab a site's logo image:

soup.find(id="logo").get('src') 

Sometimes it's not attributes we're looking for, but just the text within a tag:

soup.find('p').get_text()

Pesky Tags to Deal With

In our example of creating link previews, a good first source of information would obviously be the page's meta tags: specifically the og tags they've specified to openly provide the bite-sized information we're looking for. Grabbing these tags are a bit more difficult to deal with:

soup.find("meta", property="og:description").get('content')

Oh yeah, now that's some ugly shit right there. Meta tags are especially interesting because they're all uselessly dubbed 'meta', thus we need a second differentiator in addition to the tag name to specify which meta tag we care about. Only then can we bother to get the actual content of said tag.

Realizing Something Will Always Break

If we were to try the above selector on an HTML page which did not contain an og:description, our script would break unforgivingly. Not only do we miss this data, but we miss out on everything entirely - this means we always need to build in a plan B, and at the very least deal with a lack of tag altogether.

It's best to break out this logic one tag at a time. First, let's look at an example for a base scraper with all the knowledge we have so far:

import requests
from bs4 import BeautifulSoup

def scrape(url):
    """Scrape scheduled link previews."""
    headers = requests.utils.default_headers()
    headers.update({
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
    })
    r = requests.get(url)
    raw_html = r.content
    soup = BeautifulSoup(raw_html, 'html.parser')
    links = soup.select('body p > a')
    previews = []
    for link in links:
        url = link.get('href')
        r2 = requests.get(url, headers=headers)
        link_html = r2.content
        embedded_link = BeautifulSoup(link_html, 'html.parser')
        link_preview_dict = {
            'title': getTitle(embedded_link),
            'description': getDescription(embedded_link),
            'image': getImage(embedded_link),
            'sitename': getSiteName(embedded_link, url),
            'url': url
            }
        previews.append(link_preview_dict)
        print(link_preview_dict)

Great - there's a base function for snatching all links out of the body of a page. Ultimately we'll create a JSON object for each of these links containing preview data, link_preview_dict.

To handle each value of our dict, we have individual functions:

def getTitle(link):
    """Attempt to get a title."""
    title = ''
    if link.title.string is not None:
        title = link.title.string
    elif link.find("h1") is not None:
        title = link.find("h1")
    return title


def getDescription(link):
    """Attempt to get description."""
    description = ''
    if link.find("meta", property="og:description") is not None:
        description = link.find("meta", property="og:description").get('content')
    elif link.find("p") is not None:
        description = link.find("p").content
    return description


def getImage(link):
    """Attempt to get a preview image."""
    image = ''
    if link.find("meta", property="og:image") is not None:
        image = link.find("meta", property="og:image").get('content')
    elif link.find("img") is not None:
        image = link.find("img").get('href')
    return image


def getSiteName(link, url):
    """Attempt to get the site's base name."""
    sitename = ''
    if link.find("meta", property="og:site_name") is not None:
        sitename = link.find("meta", property="og:site_name").get('content')
    else:
        sitename = url.split('//')[1]
        name = sitename.split('/')[0]
        name = sitename.rsplit('.')[1]
        return name.capitalize()
    return sitename

In case you're wondering:

  • getTitle tries to get the <title> tag, and falls back to the page's first <h1> tag (surprisingly enough some pages are in fact missing a title).
  • getDescription looks for the OG description, and falls back to the content of the page's first paragraph.
  • getImage looks for the OG image, and falls back to the page's first image.
  • getSiteName similarly tries to grab the OG attribute, otherwise it does it's best to extract the domain name from the URL string under the assumption that this is the origin's name (look, it ain't perfect).

What Did We Just Build?

Believe it or not, the above is considered to be enough logic to be a paid service with a monthly fee. Go ahead and Google it; or better yet, just steal my source code entirely:

import requests
from bs4 import BeautifulSoup
from flask import make_response


def getTitle(link):
    """Attempt to get a title."""
    title = ''
    if link.title.string is not None:
        title = link.title.string
    elif link.find("h1") is not None:
        title = link.find("h1")
    return title


def getDescription(link):
    """Attempt to get description."""
    description = ''
    if link.find("meta", property="og:description") is not None:
        description = link.find("meta", property="og:description").get('content')
    elif link.find("p") is not None:
        description = link.find("p").content
    return description


def getImage(link):
    """Attempt to get image."""
    image = ''
    if link.find("meta", property="og:image") is not None:
        image = link.find("meta", property="og:image").get('content')
    elif link.find("img") is not None:
        image = link.find("img").get('href')
    return image


def getSiteName(link, url):
    """Attempt to get the site's base name."""
    sitename = ''
    if link.find("meta", property="og:site_name") is not None:
        sitename = link.find("meta", property="og:site_name").get('content')
    else:
        sitename = url.split('//')[1]
        name = sitename.split('/')[0]
        name = sitename.rsplit('.')[1]
        return name.capitalize()
    return sitename


def scrape(request):
    """Scrape scheduled link previews."""
    if request.method == 'POST':
        # Allows POST requests from any origin with the Content-Type
        # header and caches preflight response for an 3600s
        headers = {
            'Access-Control-Allow-Origin': '*',
            'Access-Control-Allow-Methods': 'POST',
            'Access-Control-Allow-Headers': 'Content-Type',
            'Access-Control-Max-Age': '3600'
        }
        request_json = request.get_json()
        target_url = request_json['url']
        headers.update({
            'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
        })
        r = requests.get(target_url)
        raw_html = r.content
        soup = BeautifulSoup(raw_html, 'html.parser')
        links = soup.select('.post-content p > a')
        previews = []
        for link in links:
            url = link.get('href')
            r2 = requests.get(url, headers=headers)
            link_html = r2.content
            embedded_link = BeautifulSoup(link_html, 'html.parser')
            preview_dict = {
                'title': getTitle(embedded_link),
                'description': getDescription(embedded_link),
                'image': getImage(embedded_link),
                'sitename': getSiteName(embedded_link, url),
                'url': url
                }
            previews.append(preview_dict)
        return make_response(str(previews), 200, headers)
    return make_response('bruh pls', 400, headers)