In this second of two posts about data quality, I'd like to delve into the challenge of building and maintaining evolving datasets, i.e., datasets that are function of changing inputs and fuzzy algorithms and therefore subject to constant modification.
At Semantics3, we deal with many such datasets; we work with changing inputs like product and image URLs which vary depending on when they're crawled, and machine learning algorithms that are regularly subject to data patterns that they might not have seen before. As a result, output datasets can change with time, and from one run to the next.
Run-by-run volatility of this kind is not inherently a bad thing, as long as the aggregate precision and recall of the dataset is kept in check. To do this, we have, over the years, developed a set of approaches that can be broadly divided into two groups - statistical & algorithmic techniques to detect issues, and human-in-the-loop processes to resolve them. In this second part, I'd like to share some of our learnings from the latter group.
In Part 1, we looked at automated techniques to detect data quality issues. In this post, we'll look into how we resolve these problems, and try to ensure that they don't crop up again.
1. Creating Logical Expressions on Google Sheets
First, we attempt to express the issue as a logical expression, preferably with IF / OR / AND operators alone. Where possible, we also try to express the solution via an expression. For example, given the issue "mobile phone cases are wrongly being categorized as mobile phones", we would try to express the solution as:
IF 'case' in product.name AND product.category == 'Mobile Phones > Phones' THEN product.category == 'Mobile Phones > Cases'
Once this is done, we document the example in Google Sheets in a defined format. For this specific issue, we'd create a sheet with 3 columns: name_word, original_category and new_category.
2. Issue Type Detection
One-off issues can be symptomatic of a larger malaise. Armed with an example of a specific issues, we often kick-off a search for detecting further such issues.
The simplest approach is usually to randomly sample a set of products from the infected pocket of the dataset, and send the sample for human verification. Our external annotation agency in India is quite nimble, so a phone call or an email is all it takes. Where possible, we use more sophisticated sampling or detection methods. We also use Google CloudPrep where possible to help draw the QA lead's attention to outliers in related pockets of data.
For new issues that are surfaced, we follow the same process of generating logical expressions.
3. Heuristic Testing and Deployment
Once the Google Sheet is populated with all possible expressions that the QA team can hand-curate, we use an internal library to programmatically pull the contents of the sheet and apply the heuristics generated to a wider dataset. This resultant dataset is sent for human verification to determine whether the heuristics extrapolate well enough.
The heuristics that test well are retained, and packaged into a post-processing layer that is integrated into the production data pipeline.
The intuition here is that problems needn't always be solved the complicated way - if a machine learning model underperforms in a very noticeable way, the most potent solution is often to simply overlay a human expressible fix. It's surprising how often the simplest approaches work well.
3. Involving Data-Science
Concurrently, the data-science team is notified of the use of these heuristics. Effort to tweak models or integrate representative examples into training datasets are spun off. With enough time and examples, the intuition captured in the heuristics makes its way to the core model itself.
Once the dust has settled on the immediate issue, the issue owner steps back and decides if the issue is tractable and can be resolved one-off, or is sufficiently problematic to require constant monitoring. If it's the latter, the issue owner builds a process flowchart on Confluence (our internal documentation tool) to document the sequence of steps required for issue detection and resolution.
Long lasting issues may also require that an altogether different approach be adopted for generating the dataset under consideration. For example, we may choose to tweak the kind of model we use for categorization, or drop a certain NER module that doesn't perform well for a specific domain. Where required, customized pipelines are crafted, documented on Confluence and converted into workflow schemas on JIRA.
5. Automation Packaging
The cycle of generating sample datasets for issue detection, ingesting results of human annotation, syncing heuristics from Google Sheets, packaging the heuristics into a deployable Python library, applying customized post-processing steps to the target dataset, and auto-creating JIRA tickets for human review at regular intervals is programmatically automated via Airflow.
6. Reporting on Metabase
Metrics generated during the course of the process are pushed to a Postgres table, and exposed via graphs on Metabase. If required, we configure alerts for worrying thresholds.
Our QA processes rely heavily on a very specific set of tools:
We <3 Google Sheets at Semantics3.
It empowers non-technical team-members, since the interface is intuitive, and comes baked with powerful formulae. It's perfect for annotation work: we can never get enough of the
=image(CELL) function, which renders of image URLs directly on Google Sheets.
Developers love it too – programmatically reading and writing to Google Sheets is a matter of a few lines of code.
Process is probably the word most commonly heard at the office. And our go-to tool for documenting processes is Confluence. All of our collective knowledge, from broad learnings to niche perspectives, lives on Confluence. Over the years, we've seen significant business growth with little need for an increase in our team size; process allows us to optimize our days towards high-value activities, and to cut out redundant conversations and repetitive work.
Airflow works well for programmatic workflows that don't require intricate human involvement. Google Sheets works well for simple annotation tasks, but not so much when the annotator seeks more context than is available on the sheet.
Jatayu helps us bridge this gap. It's an in-house tool that empowers non-technical users to better interact with their datasets. With a few clicks of the button, it allows users to trigger parallel processing jobs, submit heuristics to a pipeline, generate Google Sheets populated with results of a sampling exercise, or run search queries to build more context to an annotation decision.
Again, another tool that empowers team members who don't write code. Results of QA efforts are pushed into Postgres tables by programmers, often with little processing overhead, after which issue managers write SQL queries to generate metrics and graphs. Key metrics/graphs graduate to dashboards, which provide holistic views that help team-members quickly get an overview of the health of underlying datasets.
The easiest thing to do when you go hunting for issues and their causes is to load your dataset into CloudPrep and play around with the graphs/stats that turn up. For numerical fields in particular, CloudPrep often comes in very handy in highlighting outliers.
For the longest time, our immediate reaction to data quality issues was to assign tickets to the engineer in charge of the problematic module, and wait for an immediate turnaround. This was a recipe for frustration and disaster since it would inevitably trigger friction between our customer operations team (which is answerable to customers), and our data-science team (which would protest that machine learning models and crawlers can't just be instructed to solve specific micro examples). The systematic approaches described above give us the tools needed to resolve stakeholders issues, and pre-empt the occurrence of further issues of the same type.
Done the right way, statistical and algorithmic approaches, combined with human-in-the-loop processes, allow us to tame the multi-headed beast that is data quality for evolving datasets.