A multi-part series on delivering structured ecommerce data on demand
In 1999, Bill Gates published a watershed book entitled Business @ the Speed of Thought; in which he laid out a few predictions for the future of information systems. Considering the state of the tech industry back in 1999, many of his predictions were amazingly accurate — which included inventions like the smartphone, Facebook and travel websites.
On that list of predictions, item #1 stood out amongst everything else:
Automated price comparison services will be developed, allowing people to see prices across multiple websites, making it effortless to find the cheapest product for all industries.
In fact, price comparison applications have become powerful tools in retail arbitrage, with many people making a livelihood by buying stuff cheap from brick-and-mortar stores and selling it on Amazon Marketplace.
But there is one underlying piece of technology that ties this all together, without which price comparison would be impossible — Product Matching.
Product Matching is the unsung hero of our ecommerce database — without it, you would never be able to draw price comparisons of products between major retailers.
How, for example, would you be able to know that an iPhone 6 from Bestbuy is the very same iPhone from Apple.com? How indeed, are those price comparison engines able to generate competing offers for a particular product you’re interested in?
The answer lies in the science of Product Matching.
The premise is very simple: given two products with all available data-points from each source, determine if these are actually identical products — the one and the same.
In practice, this is a lot more difficult than you think it is. Humans, of course, find this task elementary. But given more subtle challenges like the following:
This 2 pack toothbrush from Drugstore.com costs $6.99
The single toothbrush from MidwestDental below costs $93.95
Are they the same product?
Its not that easy to tell isn't it?
While technically they may be the same product, on a SKU level, they are not; at 2-pack unit is treated differently from a single pack unit in terms of stock keeping.
Technically they are variations of the same product.
They're also priced differently.
One of the challenges in product matching is figuring out which products are identical, and which are variations of the same product.
The basic immutable rule of product matching is to minimize false positives. All algorithmic end-goals point to this result over time in an iterative fashion.
A false positive is generated when the algorithm indicates that two products are identical, but aren’t. This is a bad outcome as users don’t want, for example, an iPhone 6 matched with an iPhone 4.
In order to do this, we employ nine different strategies in combination, as a series of “safety valves” — a product is pushed through this series of safety valves if it passes the test (Yes or No), or if the algorithm lacks sufficient data to make a judgement (Don’t know).
The governing principle behind safety valves is to invalidate dissimilar features in products — a product that fails the test is immediately discarded.
The product-matching “safety valves” include the following:
Most of these are pretty straightforward binary checks — its going to be either a Yes, No or “I don't know”.
However, some checks can be subtle — modules like Image matching, keyword similarity checks, brand and feature normalization aren’t straightforward — different retailers would use different formats and nomenclature to describe products. Analyzing these different facets can be tricky.
Which is why we also use specialized learning modules that analyze millions of data points across our data points to develop heuristics in determining product matches.
→ Figuring out that “XL” refers to “Extra Large” in garments (important in determining size variations.
→ Determining that “GE” usually refers to products from General Electric (useful in brand normalization).
Once a product passes through all of these safety valves —an ideally only one product out of hundreds of candidates passes through successfully — it is assigned a confidence interval that evaluates the number of safety valves that were only able to assign an “I don’t know” tag to the product.
This is a measure of uncertainty in terms of endorsing the product match. If this confidence interval is less than 99%, the product match candidate is immediately discarded.
Thanks to the stringent quality checks in our system, we are able to confidently match up identical products and to indicate which products are variations of each other.
This is extremely useful to our customers who rely on this system to display accurate price comparisons across the industry, as well as offer multiple variations to satisfy demand (e.g. selling both 16 and 64GB variations of the iPhone)
Over time, as we gather increasingly larger volumes of data across the retail landscape, our learning algorithms are getting better at understanding the nuances of product data formats and structures across retailers. We’re getting better at creating important heuristics that sift between millions of products to draw connections between pricing strategies of different retailers.
Which put together, is creating a much more transparent marketplace for everybody.
Which is moving us ever closer to Gates’ first prediction.
Update, Nov 2017: This article was written in April 2015. Since then, while the fundamentals of how we approach the problem has remained the same, the tools have changed. Our algorithms are now almost entirely deep-learning driven. Here are some more recent articles on the topic of product matching —