Following on from our “Google is at it again with GTIN requirements for Product Listing Ads” blog post, we thought we should share with you some of the challenges that we face when searching for GTINs. So first, let’s back up and think on what a GTIN is.
GTIN describes the 14 digit, globally unique code that identifies unique products. At the moment GTINs are used for bar codes. In North America the UPC, a 12 digit version of the GTIN, is predominantly used for bar codes. In Europe — European Article Numbers (a.k.a EANs), are the more commonly used unique identifier, which again is a subset of the GTIN.
To sum up the above, a GTIN is a unique product identifier that is the umbrella category under which all unique product identifiers sit.
“Great every product has a unique code and so it can be identified!”Wrong!
There are four main reasons why this isn’t always so:
- Human error can happen when entering or extracting a products GTIN, or any other unique identifier.
- Bad assignment of GTINs, people may make up their own numbers, so suddenly one “unique” identifier now maps to two or more different products.
- Active reselling of GTINs and UPCs by third party companies, although this is actively discouraged by GS1.
- Lack of global coordination on the issuance and usage of barcodes by GS1 as we detailed here.
Therefore you should never fully depend on just GTINs/other unique identifiers, as we have seen that they are not a 100% reliable.
How do we decide that two products are exactly the same?
The biggest challenge is to decide if they are exactly the same or if they are variations of each other. At Semantics3 we use a nine stage pipeline that matches different things at each stage and returns a yes, no or maybe.
Fields like brand, name, and images are normalized and compared. This is a complex process as these will vary hugely across retailers. Machine learning is also utilized to analyze millions of data points across our database to develop heuristics in determining product matches. Each unique product discovered is then assigned our own unique semantics3 id and stored.
So how do we find unique identifiers for products?
As always, the process starts with a phone call to us. Customers come to us with a wide range of product data sets for which they need GTIN enrichment. With the recent Google Ads GTINs requirements coming into play in only a month’s time we’ve seen an increasing rise in dataset requests for GTIN matches to product metadata.
This challenge is keeping the engineers on their toes, pushing them to keep developing up to date solutions for our clients.
Check out “Announcing Expanded UPC/Barcode Look-ups”, a previous post on keeping up with the times.
Currently we can use our API to search by the following to get unique identifiers:
- Product names and other free text searches can be used but this is not that accurate due to variation between retailers on the exact name of a product, therefore resulting in, what we call a “fuzzy” match.
- MPN, Manufacturer Part Numbers, are unique to a factory however, they again result in a “fuzzy” match, as they are alphanumeric strings rather than numbers. We have had some success when coupled with the brand name and a category ID, but it has not yet been particularly reliable.
- URLs are great unique identifiers too — however, this only works if we have crawled the site that the url is from or it is crawl-able.
So in order to get unique identifiers for products we need to have distinct and searchable data for them.
Datasets that are hard to get substantial matches for tend to contain the following:
- Products that are not easily found online.
- White Label goods, products that only one company sells online. If the only company who sells these products does not know the unique identifiers, then no matter how much we crawl the internet we wont be able to find them either.
- Non-English language products. The majority of our products are from the US market, with a smattering from the UK and Australian markets. We don’t currently normalize data from sites in other languages.
So to bring this post to a close, finding GTINs is tough and our team here at Semantics3 has put a lot of hard work into making sure that we get you the most accurate data possible.
Currently we are working on new strategies on how to extract GTINs for products with very little unique product metadata.
…and continue to watch this space for upcoming blog posts on how to solve your GTIN problems!
We shall be attending the GS1 conference in Washington June 1st-3rd — Come and find out how we can help you at the GTIN-pocalypse Stand (Booth 42)
Lovingly built in San Francisco, Singapore and Bengaluru by Anna Rogers, Sivamani Varun, and the Semantics3 Team.