Here's a standard storyline I've seen played out across organizations many times over:

  1. A problem worth applying machine learning to is identified
  2. An investment in building quality datasets is made
  3. Data science team trains state of the art models on the dataset
  4. Metrics are impressive in pockets, but, overall, not good enough for production
  5. Data issues are identified, more annotation is commissioned, and investment in the project continues
  6. Skepticism among executives builds; eventually, budgets run out, and the project is shelved

Lots of wasted resources. But critically, opportunities lost. Why does this happen, even in scenarios when really smart people are involved?

Approach

The issue, I think, lies in approach. As practitioners, most of the content that we read in public domain involves innovative improvements to models. So naturally, we assume that the path to progress lies primarily in bettering our own models. Excessive focus on tweaking and tuning models results in innovation at the dataset curation and operational levels of the project being short changed. It doesn't help that the ignored elements are less glamorous aspects of machine learning that most data scientists try to, dare I say, avoid.

To data science teams that find themselves in this situation, I would like to suggest two key underutilized, in my empirical observation, techniques that have helped us in our journey at Semantics3.

Weak Supervision

Building representative datasets manually is costly and requires time. So aim to reduce the amount of hand annotation that your project needs, through weak supervision.

Weak supervision is the practice of training your model on datasets that have been generated through heuristics, statistical techniques and more. The idea is that by applying simple techniques like pattern recognition, you can programmatically generate your initial supervised training dataset. By training your model on a large albeit somewhat inaccurate dataset, you can effectively tune your randomly initialized weights (or pre-trained weights) to converge towards your ideal model weights.

Once you've gone as far as you can with this approach, you can bring in manually annotated datasets to take you the last mile. The benefit is that the volume of expensively curated data that you will need will be an order of magnitude lesser than what you originally required.

Since codifying a pattern is almost always cheaper than annotating a 1000 instances of the pattern, this can be your go to option each time you identify an edge-case or a pocket of the dataset that your model underperforms on.

At Semantics3, we've built tools to cater specifically to this approach, one of the reasons why we've been able to keep our costs and turn-around times to a minimum. We've also invested in approaches to synthetically amplify records that do undergo manual annotation, a topic for future discussion.

Active Learning

While weak supervision gives us the first round of 10x savings, active learning gives us the next. Active learning is a semi-supervised approach to machine learning which involves using intelligence to decide which records you want human-in-the-loop inputs for.

Here's the traditional approach to dataset annotation:

Generate an Excel file -> Send it to a labelling team -> Wait for a bit -> Receive an annotated file -> Train your model on the data

This, as my teenage cousin would say, is like so 2015! The way to do this now is to build systems that integrate your model, your dataset and your annotators into a tight feedback loop. Here's how it should work:

  1. Your model selects a set of dataset records to be sent for annotation
  2. An annotator completes the task and sends the results back
  3. Your model trains on this data and updates its weights
  4. The model now selects another set of records, using the worst performing candidates on the updated weights and selecting for diversity from various clusters

By actively involving the model in the loop, you can focus annotation resources on the most representative records, and ensure that your gold training dataset covers a wide spectrum of patterns for your model to learn from.

A crucial additional benefit of active learning - if you have a target metric to achieve for a successful deployment, you can use this approach to estimate the resources needed to get to that target. If active learning is done right, over time, each additional annotation batch will have diminishing returns in terms of improvements to precision and recall. By tracing the marginal utility of the endeavor, you can make an informed decision on whether your efforts will lead to fruition.

These techniques can help make a huge difference to not only the success, but also the cost and timely deployment of data science projects. Whether you're a data scientist, a product manager, or a team lead, the next time you hit a wall with a data science project, do look to these techniques for a way forward!