Writing crawlers to extract data from websites is a seemingly intractable problem. The issue is that while it’s easy to build a one-off crawler, writing systems that generalize across sites is not easy, since websites usually have distinct unique underlying patterns. What’s more, website structures change with time, so these systems have to be robust to change.

In the age of machine learning, is there a smarter, more hands-off way of doing crawling? This is a goal that we’ve been chipping away at for years now, and over time we’ve made decent progress in doing automated generalizable crawling for a specific domain — ecommerce. In this article, I’d like to describe the system that we’ve built and the algorithms behind it; this work is the subject of a recent patent filing by our team.

The goal of our automated crawling project

Our goal in this project is to extract the entire catalog of an ecommerce site given just its homepage URL (see image above). This involves three key challenges.

Challenge #1 — Identifying Product URLs

Types of Pages

Websites contain all sorts of pages, so at the outset, we need a system to separate the useful pages from the rest.

In the case of ecommerce, websites have:

  • Category landing pages (e.g. a landing page for women’s apparel)
  • Brand specific pages (e.g. a page for Nike products)
  • Deal specific pages (discounts and offers)
  • Seller information pages (about the business selling the product, in the case of marketplaces)
  • Review pages and forums (customer generated content)
  • Product pages (which describe a specific product and allow you to purchase it)

… and more. Since our goal is to capture product details, we’ll look to isolating just the product pages. Other pages aren’t of use to use, except to discover product pages.

Challenge #1 — Identifying product URLs
Sample product page

Supervised Classification?

Our initial approach to tackling this problem was to frame it as a supervised multi-class classification problem, with the HTML and screenshot of each page being sent through a classifier.

Supervised classification approach to identifying product URLs, both visually and using HTML (features)

The issue with this approach though is that it requires that we crawl URLs first in order to classify them. And wasted crawling is resource inefficient. To reduce throwaway crawls, we need a system that can make a decision using just the URL of the webpage.

Enter, URL clustering

The aim of URL clustering is to divide URLs into groups based on string patterns.

Clustering URLs into groups

The first step in doing this is to split URLs by path and query string, and count the rank of each component group.

Identify number of URL parameters

Next, we attempt to represent each component with a set of comprehensive pre-defined mini regular expressions.

Find a mini-regex for each parameter

Finally, we break down each URL into a tree, specific to its path rank — query string rank combo as above.

Create a URL tree for each rank-pair (URL #1)

This is a binary tree in which each node splits into two children, one which represents the specific text of the URL component, and another which represents the generalized mini regex. At each leaf node, we note down two numbers — the count of number of URLs that decomposed into that leaf node, and a count of the number of generalized regex patterns used in that path.

Let’s run another URL through the same tree above and see how this changes.

Create a URL tree for each rank-pair (URL #2)

Note how the frequency counters of overlapping tree paths have changed with the  second URL coming in.

To determine the representative regex of a URL, we pick the path that has the highest leaf frequency, with the fewest regex generalizations. This allows us to cluster the URLs as shown above — in this example, the representative regex is acme.com/forum?sid=([^\]+).

Tagging the group

We’re almost there. Now that we’ve divided URLs into groups, how do we decide which group represents product URLs? We use the supervised classification approach shown above of course!

This time around though, since we need to label groups rather than each individual URL, all we need to do is to sample a few URLs from each group and voila, our scalability challenge is overcome.

Sidebar: This regex based approach also allows us to canonicalize URLs, i.e., deduplicate different URL forms that effectively point to the same webpage. This is because one of the regex groups in the URL cluster usually points to the unique web SKU ID of the product.

Challenge #2 — Mining Product URLs

Spidering overload

Now that we know what to look for, we need to figure out how to find as many product URLs as we can.

The obvious way to do this is to visit the homepage, collect all the URLs that it leads to, visit each of them in turn, and repeat.

Spidering explosion

This issue with this approach, however, is that the number of URLs gathered explodes within a few levels of spidering. Crawling every single one of these spidering URLs is prohibitive from both a cost and time perspective. Are there smarter ways of doing this?

What would a human do?

If an intern were tasked with mining product URLs, how would he/she go about it? One strategy would be to look for category listing pages.

Category listing pages

And then systematically paginate through.

Pagination elements

Some strategies will be more optimal than others. For example, in comparison, visiting recommended similar products from product URL pages is less likely to systematically maximize the number of URLs discovered.

Product recommendation pages usually offer diminishing returns

How do we distil this intuition around discovering strategies and evaluating them into algorithm form?

Let’s gamify our challenge

Multi-armed bandits to the fore

We can set the problem up as a game, with the mechanics defined as follows:

  • Choices: Each pull is the choice of a URL (cluster) to crawl.
  • Actions: At each turn, an agent selects a URL cluster from which a sampled URL is crawled.
  • Reward: Awarded for each unique product URL discovered.
  • Cost: Every action/crawl accrues a fixed negative reward; this penalizes the agent for triggering excessive fruitless crawls.
  • State: Keeps track of which URLs have been discovered, which of them have been visited, and how many reward points have accrued to each cluster.

From here, even a simple Epsilon-Greedy strategy works effectively:

  • Initial Exploration: Agent randomly samples k URLs from each cluster.
  • Exploitation: Steady state, the cluster with the highest payout rate is selected 1-e% of the time.
  • Random Exploration: e% of the time, a random cluster is selected.

And there we have it. Combined with our approach towards identifying product URLs, we now have a way to generate product URLs given just the homepage of the ecommerce domain.

One step to go

Challenge #3 — Content Extraction from Product URLs

Final boss

Finally, we come to the challenge of extracting structured attributes from product URLs, the hardest part of web crawling.

Challenge #3 — Content extraction from product URLs

Baseline wins

We use meta tag mappings and DOM tree rules to achieve quick wins where possible — a baseline set of effective heuristics go a long way in chipping away at the problem. Where these rules don’t deliver sufficient recall though, we turn to element-wise classification.

Input features

Our approach is to build deep learning classifiers to tag HTML elements to a list of specific known attributes.

We render the webpage through a Headless Chrome session, and retrieve key elements from the webpage, along with the following features:

  1. HTML features: e.g., tag type, font-size, font-weight
  2. Visual features: e.g., color, width, height, area
  3. Text features: e.g. length of string, data type
  4. Positional features: x-y coordinates of the bounding box

Then, elements with similar coordinates and properties are merged together. This group ensures that only the distinctive elements on the page that could represent the kind of information that we are looking for are sent to the next stage.

The classifier

Once this data has been gathered, the inputs are fed into the classifier, along with other available multi-modal data, such as a relevant screenshot of the element or text characters.

HTML Element Classifier

The output classes conform to a standard group of possible fields such as name, crumb, price, images, specifications, description and so on.

How, you ask, do we deal with a need for new fields, namely category-specific fields such as screen size or RAM memory? Most ecommerce websites represent such fields under umbrella sections of the webpage such as specifications or description. Our approach is to capture all such information as either unstructured text or key-value pairs, and subsequently run it through our text attribute extraction pipelines, a subject for a different discussion.

A word on features

Notice that in the list of input features to the model, we didn’t significantly emphasize text characters or image pixel inside the element; even without these key signals, using just metadata features, it is often possible to confidently predict what a particular element refers to.

Consider, for example, a screenshot from a Japanese ecommerce website shown below — even to an observer who doesn’t understand Japanese, it’s not difficult to map segments of the page to commonly understood attributes. This is because ecommerce websites often conform to norms of design patterns.

Content extraction demo on Japanese website — note the extracted fields and boxes drawn

This works for Arabic websites too, even though the text is right-to-left oriented.

Content extraction demo on Arabic website (right to left oriented) — note the extracted fields and boxes drawn

What about variations?

Here’s where it gets really tricky. Ecommerce products often come with variation options (think color and size) — extracting such information requires interaction with the webpage, and identifying nested patterns and combinations where relevant.

Variation palette #1
Variation palette #2

Our solution to this involves two steps. First, we identify which elements have variant options, using a binary classifier.

Detect variation elements using binary classifiers

Then, using Puppeteer on a Headless Chrome session, we systematically click through and capture the state and feature bundles of each variation combination.

Finally, each combination of features is run through our previously discussed element classification system to generate one output product set per variation.

Content extraction classifiers run on every variation combination

Thus, we’re able to generate catalogs from ecommerce websites using completely automated methods.

In practice, since the goal of most crawling projects is to be as close to perfect precision and recall as possible, these unsupervised methods are used in tandem with human-in-the-loop approaches. These algorithms complement our operational efforts to build parsers and spiders at scale, and are layered with inputs in our in-house DSL (Domain Specific Language) to achieve optimal business outcomes.

Shout-out to Praveen Sekar, our lead data scientist on this project, and to our entire Data Ops team for their valuable insights.