So we spent most of last week in a haunted Chicago hotel, scurrying around the Internet Retailer Conference & Exhibition. This is what we learnt.
Part of a multi-part series on datascience in Ecommerce
So we just got back from beautiful Chicago, which was super nice to us, weather-wise
And we also stayed at the SUPER CREEPY Congress Plaza Hotel, apparently Chicago’s most haunted hotel (guess that’s why we got a great deal on Orbitz).
But ghosts and ghouls aside, we had an AWESOME time chatting to exhibitors and conference delegates. It gave us so many insights on what the market is looking for in ecommerce data.
There is a need, more than ever, for high quality ecommerce product data
We found that there really aren't that many high quality ecommerce structured data providers out there. Almost every vendor we spoke to at IRCE15 there had used some form of product database or API out there, but were not impressed. There were several reasons for this:
- Many ecommerce data companies that offer product APIs tend to underemphasize these highly technical, data-driven products in favor of higher-margin UI applications, analytics and higher abstractions of ecommerce product data
- By failing to invest more in creating good data science technology in favor of fancy dashboards, these product APIs are often poorly built, have staler pricing data, are not as comprehensive retailer-wise, and have lower quality structured data
- Other product API companies choose to pursue a “Mechanical Turk” approach extracting, cleaning up and manipulating data. Often, this involves using large pools of tech labour in India to manually extract, manipulate and clean-up datasets (so called “data-entry” tasks), instead of using smarter artificial intelligence to automate these tasks. This “smoke-and-mirrors” approach, while giving an impression of sophisticated algorithms and artificial intelligence almost always leads to poor data quality standards (humans are fundamentally error-prone)
Investing in good data-science intelligence and fully leveraging computational assets pays massive dividends
Semantics3's entire database is hosted on Amazon Web Services. We currently operate 1,500 machines on AWS to power our data extraction, product matching, categorization, and clean-up. In comparison, most platform-based startups operate about 2–3 machines on average (since their database needs aren’t as high)
Our entire data pipeline is managed by 15 highly talented engineers, many of whom graduated from the best engineering colleges in Singapore (National University of Singapore), and the US (Stanford).
In comparison, other product intelligence startups often employ over 30 engineers — and that’s an underestimation. Operating large teams creates larger inertia and makes it tempting to favor human grunge-work over efficient computer algorithms.
By operating a smaller team, we can afford to hire mathematical geniuses, geeks, hackers and dreamers to create much more sophisticated data algorithms. It also makes us much nimbler and responsive to market demands — in the last six 6 months, we’ve pushed out 3 new products, completely revamped our website and greatly improved our API user experience.
Product count is a vanity metric — unique product counts and price freshness matter more
Many ecommerce data companies like to parade their product count as vanity metrics. “A billion is better than a million” kinda metrics are often poorly explained or inflated to make database numbers look good.
When we spoke to exhibitors at IRCE15, many mentioned that while initially higher product counts seemed attractive to them, rooting out duplicates in these databases was a bigger nightmare for them, leading to a loss of faith in these product APIs.
The truth is, unique products matter more than just all products. Many ecommerce databases contain duplicates — this is because they don't invest in developing product matching algorithms, which means they double-count identical products as they import them from multiple retailers.
E.g. A Silver iPhone 16gb from both Bestbuy and Walmart are counted as 2 separate products, when they actually should be counted as one unique product with multiple offers
In our database, we have over 70 million unique products, developed thanks to product matching. If you want to count all the various multiple offers each product has from different retailers, it would add up to a large, (but meaningless) 10 billion products.
No custom pricing — utilizing a “product first” strategy helps drive better value for customers
Many ecommerce products data and analytics companies do not publicly display their pricing. This apparently was a source of frustration among many exhibitors, especially start-ups who wanted a plug-and-play API that they could test out.
Truth is, not having a pricing page means that most of these APIs have custom pricing that in turn comes from custom data jobs (i.e. consulting — i.e. human bodies thrown at big data problems).
We decided to be different. That’s why we created clear and transparent pricing so that our customers get what they pay for, at a price point that’s affordable to most companies.
We want to make our data accessible to as many companies as possible.
That’s why we built the Semantics3 API.