Working on data-science problems can be both exhilarating and frustrating. Exhilarating because the occasional insight that boosts your algorithm’s performance can leave you with a lasting high. Frustrating, because you’ll often find yourself at the dead-end of a one-way street, wondering what went wrong.
In this article, I’d like to recount five key lessons that I’ve learnt after one too many walks down dead alleyways. I’ve framed these as five questions that I’ve learned to ask myself before taking on new problems or approaches:
- Question #1: Never mind a neural network; can a human with no prior knowledge, educated on nothing but a diet of your training dataset, solve the problem?
- Question #2: Is your network looking at your data through the right lens?
- Question #3: Is your network learning the quirks in your training dataset, or is it learning to solve the problem at hand?
- Question #4: Does your network have siblings that can give it a leg-up (through pre-trained weights)?
- Question #5: Is your network incapable or just lazy? If it’s the latter, how do you force it to learn?
Q1: Never mind a neural network; can a human with no prior knowledge, educated on nothing but a diet of your training dataset, solve the problem?
This question is particularly relevant for supervised training problems. The typical premise underlying such problems is that a small high-quality dataset (say N entities) can help your model approximate an underlying function, which can generalize to your entire dataset (1000N entities). The allure of these approaches, of course, is that humans do the hard work on a small amount of data, and machines learn to replicate the work for a wider range of examples.
In the real world though, problems don’t always have an underlying pattern that can be identified. Humans draw on external general knowledge to solve cognitive challenges more often than we realize, which often leads us to falsely expect our algorithms to be able to solve the same challenges, without the benefit of the general knowledge that we posses.
Here’s an example. Consider the following names:
1. “Pets First Arkansas Dog Jersey, X-Small, Pink”
2. “Pets First Arizona Dog Jersey, X-Small, Pink”
3. “Pets First AR Dog Jersey, X-Small, Pink”
Two of the three names represent the same product. Can you find the odd one out?
Most Americans wouldn’t have a problem with this, because the fact that AR=Arkansas and AR!=Arizona is common knowledge. Some might not even notice the nuance in the question. Find someone who isn’t familiar with the US though, and they’ll probably get their answers mixed up.
Your freshly minted neural network has little chance of solving a problem like this, unless it’s seen a specific instance of Arkansas equating to AR. There’s no underlying rule here to mimic, since the mapping of abbreviation to state name is linguistically arbitrary.
Problems like these, (some difficult to even grasp in the first place), arise all the time when dealing with real-world challenges. It can be incredibly hard to look past your own conditioning, and identify what additional knowledge your network requires to think the way you do. This is why an active effort to step outside of your own conditioned mental frameworks can be quite useful.
Q2: Is your network looking at your data through the right lens?
Assume that state abbreviations are always a function of the first two letters of the name of the state. For example, Arkansas=AR, Massachusetts=MA and so on. In this world, we’ll assume that name clashes don’t occur (sorry Arizona, for this example, we’re going to have to disregard your existence).
Now let’s revisit the product matching problem with a different set of examples:
1. “Pets First Arkansas Dog Jersey, X-Small”
2. “Pets First Arkansas Dog Jersey, Extra-Small”
3. “Pets First AR Dog Jersey, X-Small”
4. “Pets First Arkansas Dog Jersey, Large”
5. “Pets First MA Dog Jersey, Large”
Your goal is to build a network that identifies that candidates 1, 2 and 3 refer to the same product, but candidates 4 and 5 represent different products. This task internally involves understanding equivalence of sizes (X-Small=Extra-Small, but X-Small!=Large) and learning the concept of abbreviations (Arkansas=ARkansas=AR because the first two letters match, but Arkansas!=MA), among other things.
In solving this problem, you may be tempted to use a Word2Vec approach, and build an embedding space that maps X-Small to Extra-Small. Moreover, you might choose to lower-case your inputs because Extra-Small is, after all, equivalent to extra-small.
In applying these fairly standard techniques though, you may obstruct your network’s view to the nuance that you want it to learn. If “AR” is lowercased to “ar”, it’s difficult even for a human being to tell whether the word token in question is an abbreviation (is “ar”=Arizona or is it a typo of the word “are”?). Likewise, if you choose to build a word-based embedding space, which effectively involves mapping each word to a unique token number, then you give your network no shot at being able to understand the constituent characters in “ARkansas”.
This issue of obscuring your network’s view to the nuances that you want it to learn turns up surprisingly often, rearing its head especially when building models that take into account different types of input signals.
Q3: Is your network learning the quirks in your training dataset, or is it learning to solve the problem at hand?
Let’s say you’re building a binary image classifier that has to detect whether a given document contains computer-printed text or hand-written text. To build a training dataset of computer-printed text, you’ve generated JPEGs of text using a software package on your laptop. For hand-written text samples, you’ve sent the same JPEGs to an annotation firm, asked them to transcribe these samples and send you scanned JPEGs of the uploaded files.
Next up, you run your classifier, and hoorah, it’s converged to 99%+ training and validation accuracy within a few epochs. When tested in real world scenarios though, the classifier performs miserably. Why?
Your network could’ve picked up on simple indicative biases — the scanned hand-written JPEGs could’ve had a slightly off-white background color, whereas the software generated computer-printed JPEGs could’ve had a pure white background. To solve the task at hand, your network might’ve picked up on this subtle hint, without giving pause to the content, context, shape or color of the text itself.
It’s important to remember that neural networks have no programmed understanding of what your larger goal is. All it knows is the data at hand and the target, and given a quick path to “cheat” its way to an easy answer, it always will.
Thoroughly vetting datasets for such quirks can help save on cost and time.
Q4: Does your network have siblings that can give it a leg-up (through pre-trained weights)?
In domain-specific problems, pre-trained models such as GloVe and Inception may fall short. This forces you to get started with naked randomly initialized neural networks, which means that your experiments might not show their true colors until after days of training. Add to this the potential issue that your training dataset is of low quality or small even after synthetic amplification (rotation, shifting, etc.), and you have a problem at hand.
In such cases, it helps to look to sister problems for a helping hand. These sister problems should meet two criteria:
- They shouldn’t suffer from the same issues of quality or quantity that your dataset at hand does.
- Their networks should have a set of layers that capture the same concepts that your network requires.
Q5: Is your network incapable or just lazy? If it’s the latter, how do you force it to learn?
Let’s say you, a layperson, were tasked with the complicated challenge of finding the right price of a group of expensive paintings. You’re given three signals — the age of the painting, the price of the painting from ten years ago and a high-resolution image of the painting. Without any prior training, you’re asked to take a stab at the problem and minimize your errors. What would you do?
Would you sign-up for a two-month course on identifying the intricacies of paintings, or would you come up with a quick-fire equation to approximate the price using just the age and previous price of the painting? Even if you recognize that the ideal price is a combination of all three signals, you might very well resort to the lazy option that leads to sub-optimal but acceptable answers — relying only on the signals that are easily understood and expressed as succinct numbers.
In real-world problems, this issue occurs when you work with models that have multiple inputs of different complexities (e.g. a regression model which accepts single-byte numerical & numerical inputs, alongside colored images with multiple bytes per pixel). Try to train such models and within a few epochs, you may find your model nestled at a local maxima, refusing to learn further.
In such cases, you may find that the best approach is to systematically drop some inputs to the network and see how the overall metrics change. If a particular signal fails to have any effect on the overall goal, despite a hunch that it should, then you should consider training the model separately on that signal alone. After the model has learnt to take the signal into account, then feed in the rest of the signals incrementally.
For the example laid out above, here’s the analog. Assume that you’re forced to price the painting based solely on its image, for a few months, leaving you with no choice but to put in the hard work. Due to a lack of an alternative easy way out, you’d have had no choice but to put in the sort of effort that you otherwise might not have done. Eventually, when you’re given access to the signals that were kept obscured from you, you’d be far better equipped than you would otherwise have been.
In my experience, the right intuition can help avoid months of wasted effort. While the lessons and problems described above are context-specific, I hope that it contains general lessons that are transferrable to a more general audience across a wider range of data science problems. If you relate to any of these problems or have suggestions, do let me know.
This article was originally posted on our Engineering blog.
At Semantics3, we’re working on pioneering ecommerce problems such as product matching, unsupervised extraction, categorization and feature extraction from unstructured text. If you’d like to join us, drop us a note.To get access to our AI APIs for Categorization, Matching or Feature Standardization & Normalization, get in touch.