Web crawling is one of those tasks that is so easy in theory (well you visit some pages, figure out the outgoing links, figure out which haven’t been visited, queue them up, pop the queue, visit the page, repeat), but really hard in practice, especially at scale. Launching a badly written webcrawler is equivalent to unleashing a DoS attack (and in a distributed crawler, it constitutes a DDoS attack!)
At Semantics3, we aggregate a large number of products from quite a few websites. Doing so requires some heavy duty web crawling and we have built a distributed web crawler to suit our needs. In this post I will be describing the design architecture of our web crawler, implementation details and finally some future improvements.
60-node cluster (bulk of them are spot micro-instances) running on Amazon AWS. We crawl 1-3 million pages a day at a cost of ~$3 a day (excluding storage costs).
Built using the Gearman map-reduce framework based on a supervisor-worker architecture.
Stack is redis(supervisor) + gearman(coordinator) + perl workers on AWS. Chef and Capistrano are used to manage the server herd.
Future plans are to use bloom filters to keep track of visited URLs and switch to a distributed db like Riak or Cassandra for storing the state of the crawl.
Let me begin with how our crawling has evolved over the past few months. We started off with a single threaded, single process crawler architecture written in 200 lines of Perl. As the number of requests increased, we rewrote it to run in a multi-threaded, multi-process manner. When even that was not enough, we went for bigger machines (more CPU power and more memory), like the extra-large high CPU EC2 instances. Finally even this didn’t seem adequate. It was pretty obvious that we had to figure out a way for our web crawler to run on multiple machines.
The original plan was to go with something out of the box. There were two solutions which we investigated. One was 80Legs, a web-crawler-as-a-service API. The other was running Nutch on a Hadoop cluster. The former was very restrictive as we had no control on how the pages were crawled, while the latter required a lot of grokking of internals to customize crawling to our needs.
We came to a conclusion that building our own web crawler architecture made a lot of sense. Another reason was that we had quite a few other tasks (processing our data, resizing images, etc.) that needed to be run in a distributed manner. Our web crawler is written as a specific type of job as part of a distributed job work system, built to suit our needs.
High level overview
Our architecture is one based on a supervisor-worker model and follows the map-reduce pattern. The supervisor controls the state of each of the web crawls and determines which page to crawl next.
The workers are the ones that do the actual work (duh) – in this case the downloading of pages and processing of HTML. Once the workers have finished their work, they update the supervisor with details such as the HTTP status of the page download and links discovered. They then store the processed HTML content in our database.
Failed pages are retried (5XX responses) or discarded (404s). If the number of failed URLs crosses a threshold, the crawl is immediately paused.
(c) Mark Handley – http://bit.ly/QSX5q8
Perl is our weapon of choice. We do a lot of text processing and Perl has been a godsend (though the quirky syntax sometimes annoys me). We use it for pretty much everything except for our API, which is written in Node.js.
The actual distribution of work tasks is done through Gearman, which is a work distribution library. This could be said to be the bedrock of our distributed web crawler. It handles concurrency related issues, failures, and most importantly figures, out which machine to farm out work to, all automagically without any fuss. The best thing about Gearman is that it automatically takes care of horizontal scaling. We can just keep spinning up more instances and Gearman will auto-detect them and start farming work to them.
If Gearman is the bedrock of our distributed system, then Redis is the oracle. Redis is used to hold the state of all web crawls. It stores the status of each crawl (running, paused, cancelled), the URL crawl queue and the hash of all the URLs that have been previously visited.
Our entire distributed system is built on top of Amazon AWS:
Supervisor runs on a small instance, running Redis and a Perl script that figures out which URL to crawl next, which it then sends to the Gearman server.
Gearman runs on a micro instance. This farms out work to all the workers who have registered with the Gearman server.
Workers all run on spot micro-instances. We have, at any given time, about 20-50 worker instances running based on our workload. The 'Worker' script runs on a single process, in an evented manner. They update the 'Supervisor' server with the status of the crawl and write the processed HTML to our 'DB' servers which run MongoDB. Another design feature is that the 'Workers' are stateless – they can be spun up or terminated at will without affecting jobs state.
The actual crawling is done 10 pages a time from each domain. We resolve and cache the DNS request, keep the TCP connection open and fetch the 10 pages asynchronously.
In terms of costs, we are able to crawl about 1-3 million pages daily for just $3 a day (excluding storage costs).
Why ‘Almost’ Distributed?
The actual weak-point in our system is the 'Supervisor' which constitutes a single point of failure (to somewhat mitigate this, we use replication and also store Redis state on disk) and hence our system is not purely distributed.
The primary reason is because there is no distributed version of Redis and we are addicted to its support for in-memory data structures. We use pretty much all their data structures – sets, sorted sets, lists and hash – extensively.
We use priority search based graph traversal for our web crawling. Its basically breadth first search, but we use a priority queue instead of a queue to determine which URL to crawl next.
We assign an importance score to each URL which we discover and then crawl them accordingly. We use Redis sorted sets to store the priority associated with each URL and hashes to store the visited status of the discovered URLs. This, of course, comes with a large memory footprint.
A possible alternative would be to use Bloom filters. A Bloom filter is a probabilistic data structure and is used for answering set-existential questions (eg: has this URL been crawled before?). Due its probabilistic nature, it can give erroneous results in the form of false positives. You can however tweak the error rate, allowing for only a small number of false positives. The great benefit is the large amount of memory you can save (much more memory efficient than Redis Hashes). If we start crawling pages in the hundreds of millions, we definitely would have to switch to this data structure. As for the false positives, well, there ain’t no harm in occasionally crawling the same page twice.
Since one of the tasks we do is in building price histories of products, recrawling of pages needs to be done intelligently. We want to recrawl products that frequently change prices frequently (duh) as compared to pages that don’t really change their price. Hence a brute-force complete recrawl doesn’t really make sense.
We use a power law distribution (some pages are crawled more frequently than other pages) for recrawling products, with pages ranked based on their importance (using signals ranging from previous price history changes to how many reviews they have). Another challenge is in page discovery (eg: if a new product has been launched, how quickly can we have it in our system).
I will be writing a post detailing our recrawl strategies sometime in the future.
Here are some academic papers for reference:
Finally a really good post I came across on HackerNews, How To Crawl A Quarter Billion Webpages in 40 Hours , which motivated me to write this post.
Building you own distributed web crawler is a really fun challenge and you get to learn so much about distributed systems, concurrency, scaling, databases, economics, etc. But once again, you may want to evaluate the pros and cons before you decide to roll out your own. If what you do is plain ol’ vanilla crawling, then you may want to just use an open source web crawler like Nutch.
We just launched our closed beta of our API. We would be most glad if you could give it a try. Here is the signup link (it comes preloaded with the invitation code)
Don’t hesitate to share it with your developer friends.