In this first of two posts about data quality, I'd like to delve into the challenge of building and maintaining evolving datasets, i.e., datasets that are function of changing inputs and fuzzy algorithms and therefore subject to constant modification.

At Semantics3, we deal with many such datasets; we work with changing inputs like product and image URLs which vary depending on when they're crawled, and machine learning algorithms that are regularly subject to data patterns that they might not have seen before. As a result, output datasets can change with time, and from one run to the next.

Run-by-run volatility of this kind is not inherently a bad thing, as long as the aggregate precision and recall of the dataset is kept in check. To do this, we have, over the years, developed a set of approaches that can be broadly divided into two groups - statistical & algorithmic techniques to detect issues, and human-in-the-loop processes to resolve them. In this first part, I'd like to share some of our learnings from the former group.

In any given month, we have billions of data-points that undergo some sort of change. At this volume, human review of every data point is infeasible. Therefore, we rely on automated techniques to direct our human reviewers to the pockets of data that are most likely to be problematic.

Below, I'll run through some of the most powerful techniques that can also be generalized across domains.

1. Standard Deviation

For numerical data fields, the easiest approach is to surface values that significantly deviate from the mean. Once your dataset is sufficiently large, you'll inevitably begin to see a normal distribution manifest in each of your numerical fields. Values that are more than a chosen number of standard deviations away from the mean will be flagged as faulty.

2. One-Class SVMs / Auto-encoders (Anomaly Detection)

Deviation from the mean is easy to setup for numerical fields, but what about text fields? One of our favorite approaches for text is to use one-class SVMs.

Let's say our goal is to detect garbled product names (think words like Anasdhaseas Oisdfsdg). We would train a one-class SVM on ALL the names in our product database, and then filter out the entries that are more than a certain tolerance distance away from our norm.

Why one-class? Since mistakes are statistically sparse, it's difficult to specifically engineer a negative class of mistakes in the training dataset. What's more, there's usually just a handful of ways in which text fields can be correctly represented, but several many more ways in which they can go wrong. Given this, we aim instead to build a representation of what a normal version of the field will look like, and call out the values that differ from this normal representation.

One-Class SVMs

3. Frequency Checks

A lot of the attributes that we encounter are theoretically meant to be unique - for example, a UPC should not occur in a dataset of unique products more than once. The same applies to fields like manufacturer part number, model, and to a lesser degree, to name, images and descriptions. For such fields, a powerful approach is to simply count the number of times each value occurs for a given field, and flag values that occur more than a set permitted frequency.

4. Negative Correlation between Fields

Consider the example: {"length" : "10 mm", "weight" : "100kg"}. This configuration of length and weight is improbable, even though the individual values in isolation are quite possible. By charting correlations between related numerical fields, cases in which the correlation is significantly skewed can be spotted.

5. Clustering

The potency of the techniques mentioned thus far depend on which sub-group of the dataset you've chosen to analyze. The obvious approach is to treat the dataset as a single whole. But there is also value to slicing and dicing the dataset by specific fields - like brand and category - or even by clusters generated through unsupervised methods like K-Means.

A simple example of this: {"brand" : "Apple", "category" : "Phones", "price" : "$1.00"}. In a dataset containing a wide spread of e-commerce products taken as a whole, $1.00 will be deemed as a valid price, but given lenses such as Phones or Apple, this case will easily be found out.

6. Template Matching

For images, we've curated a list of faulty images (think placeholder logos and HTTP error messages), which we add to as and when new problematic images are detected. When new images are checked for quality, we simply run a template match against this list, and call out the images that match.

7. Multi-Modal Translation

Fields often encode data that can be parsed out and translated to a different mode. For example, images can be translated to text through OCR techniques, color detection and even generic labels via ImageNet-like models. If the resultant text disagrees with the rest of the product attributes, then you might have a problematic case at hand.

Similarly, ID fields like GTINs and MPNs may carry interpretable information that can be verified against other fields - GTIN-14 entries that begin with "0978" represent books, models that end with "BLK" are likely to indicate black color products and so on.

Understanding Bookland

8. Metadata Bounds

Simple metadata checks can go a surprisingly long way - length of a URL, size of an image file, dominant image pixel, number of words in the product name and so on.

9. Older Model Versions

Algorithms used in production are typically developed iteratively. When a new version returns improved precision or recall metrics, it is likely to be pushed to production as an upgrade. However, just because metrics have improved doesn't mean that correct answers of a newer version are a superset of the correct answers of the previous version. The newer model could be more correct in newer ways, and less correct in fewer older ways. Comparing examples that differ between a previous version and a newer version of a model can be surprisingly useful in unearthing errors in the newer model.

Of course, in all of these cases, the devil is in the details. There's an art to determining the right set of approaches for the right fields, categories and even customers. But armed with these approaches, we typically find that there are quick wins to be had before humans are introduced into the loop.

In part 2, we'll delve into how our operational teams tackle data quality issues once they've been identified, using more human-centric efforts.