The true cost of maintaining a product database
How the 80/20 principle helped us optimize our data management and cost
A multi-part series on delivering structured ecommerce data on demand
Part 1 of a multi-part series
4 weeks ago, the gang at Semantics3 decided to travel to Alaska (in the middle of winter no less) for a grand, freezing total of 9 days. We did this for 3 reasons:
2. Sometimes the comfort and warmth of San Francisco lulls you into a sense of blue funk. It made sense to take everyone out of their comfort zone, and into a completely different environment so we could reflect a bit more on the work that we do at Semantics3.
3. We also wanted to meditate on our place in the ecommerce universe — as the central, authoritative database on every product sold online, giving everyone a birds-eye view of the ecommerce industry.
Which got us thinking about how little people actually know about the technicalities of delivering fresh product data via our API.
The Devil is in the Details
We'll start with 2 major pieces of work in ecommerce product data curation: updating all available prices for a particular product; and collecting data from product URLs of various retailers (which is then tagged to each unique product)
Keeping product prices fresh
The Semantics3 API provides access to a 60-million product database. This is 60 million unique products, which collectively consist of 4 BILLION offers that we track across the vast universe of ecommerce retailers.
That’s an average of 66.7 offers per unique product (a unique product would be The iPhone 6; not an iPhone 6 from Bestbuy or Apple)
Tracking this vast universe of price offers is incredibly challenging computationally. The work never ceases. Once you finish a cycle of updates, its time to begin, and sometimes you have to restart even before you’re done.
The point in all this is that, contrary to what most people think, it is really difficult to maintain fresh pricing on every single item in the database.
This is an issue that even Google grapples with (which is why you still see cached results in Google results that don’t turn up on the actual pages)
As with all things in life, there is a general rule called the 80–20 principle; the idea originated in management consulting as conceived by Joseph Juran where he began to apply it to quality control, explaining it as the “the vital few and the trivial many”.
This concept was further expanded upon by Richard Koch at L.E.K. Consulting — the idea being that 80% of revenue originates from only 20% of a company’s business activities.
We employ a similar tactic in our efforts. Generally, we discovered that about 20% of products change prices frequently enough to warrant daily checks in our database.
This is a reasonable idea, given that a washing machine would not change price as frequently as say, a New York Times bestseller book (of all things in our database, books change price the most frequently, followed by cellphone cases).
What then, if we employed an intelligent approach to updating prices? Rather than checking every product ever so often (an immense computation challenge), we developed an optimized approach where our 800 drones respectfully crawl product pages and process data at specific, optimized frequencies that exploit the 80–20 principle in an intelligent way.
By looking at multiple signals like average user rating, sales rank, seller’s rank, site ranking and past data on price changes, our predictive algorithms decide when to next update a particular product.
This is in addition to multiple demand signals we take into account, such as specific customer requests for custom recrawls, and high site popularity.
This is how we are designing our brand new API — completely modularized, with specific update cycles on a site-specific and overall database level.
Extracting data from product URLs
Here’s what a typical data processing cycle for a single product URL looks like:
All of this work, just for 1 URL.
A typical ecommerce website has at least 100,000 product URLs; Amazon alone has at least 100 million.
That’s a LOT of data to process, clean-up and deliver to customers.
And that’s not even including redundancies, database resilience and monitoring. If one database goes down, there has to be multiple copies so that the service is still up.
All of this doesn’t come cheap.
Data processing is, by far, the most resource-hungry part in our data curation. There are innumerous algorithms to clean, normalize, and structure data. There are incredibly complex heuristics and machine learning processes involved in categorizing products (for example, we can’t make a mistake in miscategorizing a women’s belt as pants). All of this demands immense computational power — all supplied by Amazon Web Services (AWS).
Which makes us a little bit different from most API companies; we’ve solved the mind-numbing, boring work of organizing product data by using intelligent algorithms to categorize, match and normalize structured data from unstructured, HTML sources, making it available via a convenient database API.
But all of this does come at a cost.
Essentially our business model is not in offering a platform, but a convenient way to access a precious resource:
Being able to achieve breadth of coverage, while ensuring price freshness is one of the biggest challenges in the price monitoring business. Balancing the need for customers to get near real-time pricing and breadth of coverage within an affordable price point is one of our main differentiating factors from other product API services.
One of our major initiatives is in offering complimentary site indexing services to customers, where we will index sites that are not currently in the database for customers who request it — free of charge (provided they are premium subscribers). We use this as an optimized method of expanding our database to the sites most requested by the market.
We found that this was a more cost-effective way of determining which data sources matter the most — the 80/20 principle at work again.
Through optimizing and leveraging cloud computing, we’ve managed to bring the enormous cost of managing a database (typically this consists of at least 2 engineers’ salaries plus overheads — at least $200,000 a year at a conservative estimate) down to a much more manageable $1499/month.
In the next few weeks, I’ll go in-depth into how we managed to algorithmically solve the challenges of matching identical products from different retailers, categorizing these products, and storing large amounts of data in the cloud and piping it on demand to customers via our API.
Because there’s nothing more useful than having structured data on tap for your apps, analytics and websites.
Published at: March 16, 2015