Web scraping involves extracting data from websites and presenting it in a meaningful format. But raw HTML is unstructured and is often very difficult to extract contents from. In this article, we will focus on solving the problem of extracting content from a given URL using Perl and Mojo::DOM.
Before we dive into the details, here’s a short summary of steps involved in the process:
- Getting the HTML of the web page using a Headless Browser
- Converting unstructured HTML into a DOM tree using an HTML Parser
- Selecting required elements and extracting content using CSS Selectors
- Dynamically generating CSS Selectors which can be extended to other pages that have the same template
Headless browsers are tools that help in browser automation tasks such as rendering pages, taking screenshots and testing.
With the recent release of Chrome, Google has added support for headless browsers. This feature allows for page HTML to be dumped with a very simple command:
chrome --headless --disable-gpu --dump-dom https://www.chromestatus.com/
Though the recent release has a few rough edges, it looks promising and is expected to get better. FYI, a list of other headless browsers and browser automation tools can be found here.
Let’s say we have the HTML of a page and need to extract all images from it.
We can also recursively traverse each node in the DOM tree and process them individually:
Another way to process nodes is to select them using CSS selectors. CSS selectors select a specific group of node(s)/ element(s) in HTML. Once, the DOM tree is constructed using Mojo::DOM, CSS selectors can be applied to extract the required contents.
Here’s an example of how document title can be extracted:
Many websites have a lot of metadata defined in the HTML page using schema markup. These help search engines understand the page and provide better search results. itemprop is one such attribute which can be extracted using the CSS selector : *[itemprop="name"]
Open Graph meta tags also provide rich meta information about pages. These control what shows up when the page is shared on Facebook. We can extract the title using CSS selector : meta[property="og:title"]
Generating CSS Selectors Dynamically
Writing CSS selectors to extract contents from different elements is cumbersome and difficult to maintain. It would be a lot easier if we can auto-generate the CSS selector for those elements. Let’s say we have found a node which matches certain text and want to generate the CSS selector for that node so that it can be selected every time to extract its contents.
Skeleton code for this below:
After the CSS selector is generated, we can check if it uniquely selects a particular element or not.
If multiple elements are selected, depending on the use case the logic in generateCssSelector above can be fine tuned accordingly. Some recommendations for handling this:
- include nearby element(s) in the css selector
- nth-child attribute can be added to select a specific child
- depending on the domain, keep a list of good attributes that can used. Eg. itemprop attribute is useful whereas data-reactid is not
Once the CSS selector is generated for a particular element in a web page, it can be used to extract contents from other pages which follow the same template.
By starting with a dump of raw HTML, it is possible to apply a series of steps to process and extract meaningful information from webpages.
Whether you aim to navigate the DOM, come up with CSS selectors or even generate them on-the-fly, I hope this article has provided some useful starting points to guide you in the process.
At Semantics3, we work on related challenges such as unsupervised extraction of structured content from never-seen-before websites and HTML templates, using machine learning. If you’re interested to chat further or would like to join us, drop us a note!