The curious challenge of categorizing women’s clothing
A multi-part series on delivering structured e-commerce data on demand
Categorization can be a bane or a boon to retailers. They are the backbone of any product catalog; a neat, organized map of all products in your inventory that users can find.
Just like in a library, organizing your products neatly by category is incredibly important as it is a crucial product discovery tool for users. If the search box on your retailer site doesn’t provide satisfactory results, users often turn to the category list to further refine results. If your customers don’t find their product, or even if they get the impression that the product isn’t available, they will head over to another retailer.
In short, if you have poor categorization, you're effectively handing over your revenue to competing retailers.
But that doesn't mean that categorization is an easy task, by any means. In fact, it can be incredibly challenging to algorithmically categorize a product.
Take for instance, pants.
Pants are ubiquitous. Its very evident that pants are pants, just by looking at them.
But what if the product in question also has the word “belt” in them?
That’s a bit more tricky. If you’re running a keyword analysis, the algroithm might pick up “belt” and mis-classify the pant as a belt. This is relatively easy to fix on an individual level — but when you’re dealing with similar levels of ambiguity for millions of products — the task becomes a lot harder than you think.
Of course, as humans, its pretty evident that these are pants. But to an algorithm, this presents a conundrum: without contextual knowledge, given just the name, how would it categorize this product, given that both “belt” and “pant” turn up in the product title? Remember: if you miscategorize, you’ll miss out on opportunities to sell this product in either category.
The obvious way to handle this problem would be to hand it off the to the best computers ever — human beings. This is easier said than done though. To begin with, categorizing hundreds of thousands of SKUs can take months of manpower, which comes at high cost; Mechanical Turk, SamaSource and the like offer great solutions if you can afford them.
The other problem is that manual categorization involves tagging a product with one among thousands of category keywords. Human beings can’t actively recall lists this large, so the less popular categories are likely to suffer from poor categorization.
Furthermore, when you attempt to increase the specificity of categories (i.e. use even lower sub-categories), accuracy drops, sometimes exponentially. Knowing when you have reached the limit of accuracy is important as well.
Unfortunately, this problem remains open-ended. At Semantics3, we adopt a variety of approaches to keep our categorization relatively accurate.
Categorization is hard. Period.
While collecting or extracting data from retailers is pretty elementary, categorizing that product data is not. Product data is notoriously non-standard.
For example, its not likely that Nordstrom would conveniently use the same product category taxonomy as Amazon, would it?
That creates a problem, because if they did, categorizing products would be a walk in the park.
Another typical problem that crops up are category breadcrumbs; these are navigation aids used in e-commerce sites to help users navigate between products they are looking up. Usually these offer a wealth of information in categorization since they often map to a central category taxonomy. But if non-standard crumbs are used (or erroneous crumbs) these introduce a whole level of complexity.
Additionally, if e-commerce retailers wrongly categorize products that basically throws out any useful information that can be used in categorization — heck even humans make mistakes!
Applying statistics and machine learning algorithms can help
Since most of the data that we collect is in unstructured or semi-structured HTML, converting the data into a structured form and deciding which category to put a product under is very important and challenging at the same time.
If you look closely at the data, it becomes apparent very quickly that each retailer has a different approach to labeling/categorizing a product.
For example, site A would prefer to categorize an iPhone under Mobile Phones and site B would categorize an iPhone under Cell Phones — or even under Electronics.
In essence there isn’t an industry wide standard for categorizing a product.
One obvious and tedious to way solve this problem is to build out an ontology for all the products in our data universe, including potentially all the products sold in any given ecommerce store.
However, doing this on a large scale is a herculean task and often prone to errors and misjudgments.
There can be some shortcuts though.
In certain categories (fashion for example), ontologies are usually available right off the shelf. For example, if it is a legging, it almost certainly is a women’s bottoms/pants.
It certainly isn’t the case when someone says Samsung Galaxy.
An obvious approach is to categorize on a statistically determined case to case basis. We strive to provide the categorization as accurately as possible leveraging algorithms and human efforts as is appropriate in a scalable manner.
Which leads to our first lesson in categorization:
Whenever a category has an ontology that is readily available, we use it.In computer science and information science, an ontology is a formal naming and definition of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse. It is thus a practical application of philosophical ontology, with a taxonomy.
Here’s a challenge: Lets try categorizing the product “Men’s Brown Chino Trousers”:
A watered down or rather simple version of our algorithm would look like this:
- Reduce the product name into it’s constituent tokens.
- Probe each of the token into the ontology and it’s corresponding decision tree.
- Rinse and Repeat till the best decision(cat_id) is obtained.
The first token, Men’s happens to be term variant/established relationship with one of the pivotal categories, this gives us confidence that it’s a men’s clothing item.
Okay, that’s good. Can we categorize it deeper into the category tree?
With the next token, Brown, there isn’t a notable relationship in the onotlogy.
Chino/Trousers — have a relationship- they're keywords that point towards Pants
This way, depending on how the decision tree is traversed, we arrive at a final category, in the example above, the product is categorized under Men’s | Pants.
While categorizing the Brown Chino, we only took into account the product name, and didn’t consider other data points.
However, in the production scale of our algorithm, we look at everything that’s available for the product — this ranges from specifications, product descriptions and so on.
Although this technique works pretty well, building and completing the ontology up so that it can effectively categorize more than 60 million unique products can take monthsNot to mention the fact that we have to update this entire system if we encounter a strange new product and need to create a new category
This brings us to the next lesson we learnt in Categorization:
Use a combination of different learning algorithms at different stages to collectively categorize your catalog.
In our next blogpost in the series, we shall discuss how we at Semantics3 leverage a combination of Statistical Models and Deep Neural Networks to categorize products.
By combining all three techniques, we’ve managed to build one of the best product categorization engines out there.
Lovingly handcrafted by Abishek Bhat
and Hari Viswanathan
Give our product API a try at www.semantics3.com