A couple of weeks ago, I posted a two part series detailing how we do data QA at Semantics3. In the days that followed, I've had people get in touch to discuss the best ways to establish or improve data/product quality initiatives at their own organizations. For some, the impetus was a desire to stem customer churn, and for others just to make customers happier.

Here were some of the common discussion points from these conversations:

  • Should we hire a dedicated QA analyst to solve our data quality issues?
  • What sort of profile should such a hire have?
  • How do you decide what your data QA budget should be?
  • How do you draw a line between automation and manual effort?
  • We don't want to build on heuristics because they don't scale - any alternatives?

The inherent theme in all of these questions was that everyone looking for a framework to think about quality analysis. So, while my previous posts addressed specific ways in which we do data QA at Semantics3, in this article, I'd like to speak to the philosophy behind them.

1. Measure everything

First, you need a metric that you can optimize. Dig into your customer support tickets, and assign a metric that can quantify the extent of the problem. Design tests to sample and measure the data if that's what it takes.

2. Set realistic targets with your customers

Once you have a metric, figure out what your target is. Depending on who your customers are and how they are using your data, you will find that some errors are more acceptable than others. Ideally, speak to each customer, explain to them how you quantify quality in your offerings, and identify optimal targets that works for them. If the goals that a customer expects are unrealistic or infeasible, be candid with them; you'll be surprised how often customers are willing to adapt either their expectations or their budgets. It's always better to be candid about what you can realistically deliver, rather than risk reneging on lofty promises.

3. Think statistically

Implicit in the previous step is the fact that all stakeholders need to build the habit of thinking statistically. Datasets are rarely completely free of errors - it helps nobody when a senior executive arbitrarily discovers an error and sends on-the-ground teams scurrying to fix edge-cases. Decisions should be made based on metrics calculated on meaningfully large samples.

4. Understand biases in your measurement processes

The metrics that you do build out as part of your zeal to measure everything may be biased by how you choose to evaluate them. What data was available to the annotation team at the time of decision making? How were the questions posed? How was the dataset sampled? Did the annotators lack the cultural context to make their assessments?

5. Don't look for superheroes - build culture

A few years ago, we had an all-hands gathering to determine how best to manage customer churn due to data quality issues. The outcome of this meeting was to appoint a dedicated data quality analyst to think systematically about the issue. Fast forward six months, the problem continued to be as rampant as ever, despite several innovative efforts by the quality analyst.

The issue was that nobody else in the company was incentivized to do anything about the quality problem; Gotham City was on fire and we'd decided that all we needed was Batman. We saw many a building on fire, but didn't call the fire department because we assumed that our superhero would eventually turn up. But turns out that Batman isn't really a superhero.

In the strictest sense, Batman isn't a superhero

Quality has to be a system-wide initiative - anybody who interfaces with pipelines through which data flows, be it algorithms, infrastructure or applications, should actively think about the impact of his or her deployments on quality.

6. Get your smartest generalist engineering minds on it

QA is often viewed as an unglamorous and low RoI thing to do - engineers and managers often think that it's beneath them. But I believe that if quality matters to your product, you need the best minds on it. Especially the minds of generalist engineers who have the capacity of finding patterns of issues in datasets, and building systems to tame them.

7. Build process and embrace repetition

As soon as you find that your smartest engineers are running out of ideas for building systems though, hand the reins over to somebody who can think in terms of process and execution. To answer one of the questions raised at the beginning of the article, this is probably the right time to make your first dedicated QA hire.

Ideally, this QA hire should be great at building processes to continuously measure and control the health of your datasets, especially so that you never have to worry about past errors reoccurring.

8. Do things that don't scale

I've seen many a startup run away from building out sustainable QA efforts, because of the misconception that QA primarily requires human effort, which is fundamentally not scalable. Instead, teams choose to work on initiatives that fix the fundamental problem - better data sources and improved algorithms. I think Paul Graham's adage "do things that don't scale" applies perfectly here - by all means look for a scalable solution, but don't run away from seemingly unscalable efforts. You'd be surprised to find that ostensibly unscalable approaches can be made to scale with a bit of effort.

9. Enjoy it

Nobody aspires to become a QA analyst, at least at first. But set your biases aside and you will find that it's a rather rewarding experience. It gives you significant scope for creativity, and empowers your colleagues in product, customer success and data science to further their own work. Most importantly, it allows your company to deliver products that customers happy.


If these principles resonate with your experience, or what you are looking to do, do get in touch at govind [at] semantics3 [dot] com