How to effectively index an online retailer
The parallels between a zoologist navigating the Amazon jungle and a data scientist analyzing the a large retail website are uncanny.
We detail three lessons where the parallels are apparent.
Respect the jungle
A zoologist’s prime concern in the Amazon is to study fauna and not mess with it.
A data scientist’s prime concern while analyzing a retailer’s webpage is to not interfere with the site in any way — that means respecting Robots.txt — a document included in the HTML of any site that lists the rules that indexing bots must abide by.
At Semantics3, we protect sites by respecting each website’s rules on frequency of pinging the website. We follow the “Politness Rule” — i.e. the minimum delay between subsequent pings. This is to avoid something called a “DDOS” — distributed denial of service.
Try to find new specii
A zoologist knows of no greater joy than discovering a new species of animal life. To do this, a zoologist needs to know which species are already identified and which are new. Knowing this will save repeated visits to the jungle.
Our data scientists get similar joys from finding a new product that was not in the market seconds prior.
The challenge in detecting a new product is not to compare each potential product with the existing database of millions of products.
Our scientists use an intelligent product ranking algorithm to tackle this issue. Search is done by categories of products rather than individual products reducing the processing time required. We can find all the new products on any retailer within 24–48 hours (by which time newer products have already been introduced!)
Record all your observations
A zoologist will always record the findings in the jungle. Even if it is on a damp notebook on a hot and humid day when the ink barely flows from the pen. The challenge is wisely choosing what to record and ensure that you do not run out of pages in that one notebook!
A Data scientist, on the other hand, needs to record data on every web-page that has been analyzed. This becomes impossible when you’re analyzing 10 million pages in a day.
At Semantics3, we use intelligent product matching algorithms to ensure that we record relevant information from only unique products. Yet, we still find ourselves with a database of 75 million unique products, with 15% changing every month and growing at 1% every month.
Even after the matching, we need the storage and computational power from over 1500 AWS machines, processing over 10 terra bytes of data every day. Not surprisingly, computing and storage accounts for the largest cost bucket in our operations.
While a massive online database-as-an-API isn’t quite your neighbourhood zoo, it does perform a similar function — making that massive online mass and diversity of data accessible to you in an easy-to-format and easy-to-query data structure