The Annotators Dilemma: When Humans Teach Machines to Fail

What does a machine learning model trained via supervised learning and a lion raised in captivity have in common? … They’re both likely to die in the wild!

Now that might sound like a joke aimed at getting PETA to boycott my blog, but this is no laughing matter. Captive bred lions are more likely to die in the wild and so are machine learning models trained with human annotated data.

According to a 2008 National Geographic Article captive bred predators often die in the wild because they never learn the natural behaviors necessary for success. This is because their human captors never teach the animals the necessary survival skills (e.g., hunting) or inadvertently teach behaviors that are detrimental to their survival (e.g., no fear of humans). It’s not for lack of trying, but for a variety of reasons it is impractical or impossible to expose a captive predator to an environment that completely mirrors their ultimate home.

Why Humans Teach Machines to Fail

Not surprisingly, when humans teach machines, they fail for many of the same reasons. In both supervised and semi-supervised learning machine learning we train a model using human annotated data. Unfortunately, human cognitive and sensory capabilities can introduce a variety of consequences that often lead us to teach machines the wrong thing or fail to fully expose them to the environment they will find the wild. While there are likely many areas that impact the quality of human annotations, I’d like to cover five that I believe have the greatest impact on success.

Missing Fundamental and the Transposed Letter Effect (Priming)

Have you ever had someone explain an interesting factoid that sticks with you for life? One such example in my life is the “Missing Fundamental” effect first introduced to me by my freshman year Music Theory professor. The question he posed was “how are you able to hear the low A on a piano when the vast majority of audio equipment cannot reproduce the corresponding fundamental frequency”. It turns out the low A on a piano has a fundamental frequency of 27.5 HZ and most run of the mill consumer audio equipment is incapable of producing a frequency that low with any measurable gain. Yet we can hear the low A on a Piano recording even with those crappy speakers. The reason why is because of the “Missing Fundamental” effect. In essence the human brain can infer the missing fundamental frequency from upper harmonics.

A similar concept is the “Transposed Letter” effect. I’m sure you’ve seen the meme. You know those images with scrambled letters that tell you you’re a genius if you can read it. Your ability to read those sentences is due to transposed letter effect and related to priming. Basically, even if the letters in words printed on the page are jumbled, reading it can still activate the same region of the brain as the original word.

You might be asking what any of this to do with annotating data and teaching machines. The problem arises when you realize we humans can correctly identify something even when all the data is not actually present to do so. If we annotate data this way we are assuming the machine has the same capabilities and that might not be so. Teaching a machine that “can you raed this” and “can you read this” are the same thing may have unintended consequences.

Selection Bias

If you gave me millions of images and asked me to find and label all the images with tomatoes, I am probably going to quickly scan for anything red and circular. Why? Because that’s my initial vision of a tomato and scanning the images that way would likely speed up the process of going through them all. But that is me imparting my bias of what a tomato looks like into the annotated data set. And it’s exactly how a machine never learns that the Green Zebra and Cherokee Purple varieties are indeed tomatoes!

Multisensory Integration

Humans often make use of multiple senses simultaneously to improve our perception of the world. For example, it has been well documented that speech perception, especially in noisy environments, is greatly enhanced when the listener can leverage visual and auditory cues. However, the vast majority of commercial machine learning models are single modality (reading text, scanning images, scanning audio). So, if my ability to understand a noisy speech signal is only possible due to a corresponding video it may be dangerous to try and teach a machine what was said since the machine likely does not have access to the same raw data.

Response Bias

I am ashamed to admit this but every time I get an election ballot, I feel an almost compulsive need to select a candidate for every position. Even when I have little or no knowledge of the office the candidates are running for, the background of the competing candidates, and their policy positions. Usually, I arbitrarily select a candidate based on their party affiliation or what college they went to, which is probably only slightly better than selecting the first name on the ballot. My need to select a candidate even though I have no basis for doing so is likely a form of response bias. The problem with response bias is it generally leads to inaccurate data. If your annotators suffer from response bias, you are likely teaching the machine with inaccurate data.

Zoning Out

Have you ever driven somewhere only to get to your destination and have no recollection of how you got there? If so, like me, you have experienced zoning out. With repetitive tasks we tend to start with an implied speed versus accuracy metric but over time as the task gets boring, we start to zone out or get distracted but we maintain the same speed which ultimately leads to errors. Annotating data is a highly repetitive task and therefore has a high probability of generating these types of errors. And when we use error-ridden annotated data to teach our machines we likely teach them the wrong thing.

How to be a Better Teacher

While the problems above might seem daunting there are things we can do to help minimize the effects of human behavior on our ability to accurately teach machines.

Provide a Common Context

The missing fundamental and multisensory integration problems are both issues with context. In each of these cases either historical or current context allows us humans to discern something another species (a.k.a. the machine) may not be able to comprehend. The solution to this problem is to make sure humans teach the machine with a shared context. The easiest way to fix this problem is to limit the annotator to the same modality the machine will operate with. If the task at hand is to teach a machine to recognize speech from audio, then don’t provide the annotator access to any associated video content. If the task is to identify sarcasm in written text don’t provide the annotator with audio recordings of the text being spoken. This will ensure the annotator teaches the machine with mutually accessible data.

Beyond tooling you can also train your annotators to try and interpret data from multiple perspectives to ensure their previous experiences don’t cause brain activations that the machine might not benefit from. For example, it is very easy to read text in your head with your own internal inflections that might change the meaning. After all the slightest change in inflection can turn a benign comment into a sarcastic insult. However, if you train annotators to step back and try to read the text with multiple inflections you might avoid this problem.

Introduce Randomness

While it might be tempting to let users search around for items they think will help teach a machine doing so can increase the likelihood of selection bias. There may be good reasons to allow users to search to speed up data collection of certain classes, but it is also important to make sure a sizeable portion of your data is randomly selected. Make sure you set up different jobs for your annotators and ensure some proportion of your labeling effort is from randomly selected examples.

Reduce Cognitive Load

While we may not be able to prevent boredom and zoning out, we can reduce complexity in our labeling tools. By reducing the cognitive load we are more likely to minimize mistakes when people get distracted. Some ways to reduce cognitive load include limiting labeling tasks to single step processes (i.e., only label one thing at a time) and providing clear and concise instructions that remove ambiguity.

Be Unsure

Last but not least, allow people to be unsure. If you force people to put things in one of N buckets they will. By giving people the option of being “unsure” you minimize how often you’ll get inaccurate data due to people’s compulsion to provide an answer even if no correct answer is obvious.

Final Thoughts

No teacher wants to see their students fail. So it’s important to remember whether training a lion or a machine learning model, that different species likely learn in different ways. If we cater our teaching towards our students, we just might find our machine learning models fat and happy long after we sent them off into the wild.

*I’d like to thank Shane Walker and Phil Lipari who inspired this post and have helped me successfully teach many machine learning models to survive in the wild.