Of the thousands of attributes that we handle while curating product catalogs, the hardest and perhaps most important attribute is brand. Consumers often begin their searches with brand names of the products they're looking for ... which is why our customers (marketplaces, retailers, brands and logistics companies) are keen on having high coverage of standardized values for this field.

Unfortunately, many of the data sources from which we build our catalogs often fail to provide brand as an explicit field ... or unwittingly carry an incorrect value in the brand field. In these cases, the challenge falls to us to build datasets that are robust to such issues. Specifically, this entails:

  • extracting brand from unstructured fields like name and description,
  • inferring brand where altogether absent,
  • standardizing the brand string to a unique representation consistent across the catalog.

Over the last 7 years, we've tried to tackle this problem in many different ways - hacks, statistical methods, NLP, heuristics, algorithms, annotation and more. In this article, I'd like to chronicle some of the approaches that've worked for us.

Before we begin though, let's delve deeper into the nuances that make this problem particularly challenging:

  1. Multiple Representations: The same brand can be represented in multiple forms - think General Electric vs GE or Apple vs AAPL.
  2. Multiple Brands?: Is the Huawei Google Nexus 6P a Huawei phone or a Google phone?
  3. Long-Tail: Global brand conglomerates are easy to map out, but in a world of D2C brands, there's no clear way to build a defined universal corpus of brands.
  4. Source Data Reliability: Since there's no defined universe of brands to work with, we have to rely on the very catalog data that we're looking to augment as our source of knowledge. It doesn't help that e-commerce sites often have spelling mistakes - we've seen values like Samsnug, Samnsug, Sammsung in production listings quite often.
  5. Proper Nouns: Who's to say that Samsnug isn't a legitimate brand word? Popular knowledge dictates that it probably isn't, but there are no grammatical constructs that help us determine this, since we're dealing with proper nouns here.
  6. Sparseness: Some brands only sell a single product - it's difficult to differentiate them from actual mistakes.
  7. Similarly Named Brands: Distinct valid brand names often differ by just one character. Sometimes, there isn't even a string differentiation - did you know that there are two different brands called Remington, one a personal care brand and another a firearms brand?
remington.com vs remingtonproducts.com

We haven't been able to scale all of these hurdles, however, we have made several inroads into chipping away at the problem as a whole. Here are some of the more notable approaches we've tried, and the insight underlying each one of them.

1. Co-Occurrence across Fields

For listings that come pre-provided with a structured brand, we track the co-occurrence of the brand phrase with other words in the product listing. When the co-occurrence of two phrases is significantly high, we deem them to be equivalent.

For example, {"brand" : "Samsung Electronics", "name" : "Samsung Personal Security Window"} allows us to build the understanding that Samsung Electronics is equivalent to Samsung. This works because brand names are often referenced multiple times in a product listing, typically in the name and manufacturer fields.

The approach is particularly potent for handling spelling errors–most people who create e-commerce listings make these mistakes due to haste rather than lack of knowledge. This results in several listings that have the same brand name spelt correctly in one field of the listing and incorrectly in another. We exploit co-occurrence in these cases to mine common spelling errors. What's more, since most brands are misspelt the same way, the results of these efforts allow us to even fix listings where all instances of the brand are spelt incorrectly.

2. Ngram Probabilities (Language Modelling)

Consider the words Adasfasd and Mokibas. If you were compelled to decide which of the two is a valid brand, you'd probably select the latter, even if you've never seen either word before. This is because Mokibas uses vowels and consonants in a probability distribution that is more representative of the average English word. Using this intuition, we can sift out the grain from the chaff with decent precision ... although brands like BCBGMAXAZRIA can still fall through the cracks.

3. Web Search on Bloomberg and the Like

Brands names aren't just words in a file - they represent real corporations with real employees and often, tangible web presences. Bloomberg and LinkedIn search are useful in validating a brand by linking it to a real organization.

Generic web search is useful as well - if a web search of a brand phrase provides top results from a domain name that looks very similar to the input, then the input brand is more likely to be legitimate.

4. Image Matching

Quality-wise, images are usually more reliable than text, since there's no equivalent of spelling errors in images. If OCR on an image provides text that matches up with the brand field of a listing, then we can be quite sure that the brand name is valid.

This technique works particularly well for categories of products that are displayed with packaging.

When OCR corroborates the text value of a brand, the entry is deemed valid

5. Named Entity Recognition with Conditional Random Fields and Bi-Directional LSTMs

Product names don't follow natural language conventions. ASUS 11.6" Laptop - Intel Celeron and Sunflowers by Elizabeth Arden, 3 Piece Gift Set for Women don't use nouns, articles and adjectives the way normal English text does. Therefore, out-of-the-box NER models don't translate well to the domain of e-commerce, not even for basic proper noun detection.

That said, there still is information in the structure of product names that can be captured. For example, electronic products usually display brand names at the very beginning of the title, while perfumes and beauty products often do so at the end. These relationships can be captured to a non-trivial extent through CRFs and neural networks built on Bi-Directional LSTMs.

Our NER testing framework

6. Inference from the Model Field

When dealing with consumer generated listings, we run into the issue of popular models that don't carry a brand name. An example of this is the iPhone - creators of iPhone listings may fail to mention Apple at all in their entry, but as curators, we still have to populate the brand field accordingly for purposes of search, SEO and more. We deal with such examples by tapping into expert sources of facts such as Wikipedia, and maintaining a list of model keywords to brand mappings.

These are just some of the techniques we've used to tackle brand extraction and normalization. We still have a fair bit of room for improvement as far as solving this problem goes, and it continues to be one of the hardest problems we've tried to tackle at Semantics3.

Personally, my key learning in dealing with this problem has been that knowing and applying the latest greatest algorithms isn't always the most potent skill that a data-scientist can bring to the table. In each of these cases, thinking deeply about the problem and having the right intuition helped us go further than playing around with fancy algorithms did. Most of these approaches were the results of interesting conversations followed by quick sessions of coding and experimentation ... with many a failed dataset and experiment born of superficial insights in between.

If you have any thoughts on other methods we could try, do get in touch via govind [at] semantics3 [dot] com.