The Annotators Dilemma: When Humans Teach Machines to Fail

What does a machine learning model trained via supervised learning and a lion raised in captivity have in common? … They’re both likely to die in the wild!

Now that might sound like a joke aimed at getting PETA to boycott my blog, but this is no laughing matter. Captive bred lions are more likely to die in the wild and so are machine learning models trained with human annotated data.

According to a 2008 National Geographic Article captive bred predators often die in the wild because they never learn the natural behaviors necessary for success. This is because their human captors never teach the animals the necessary survival skills (e.g., hunting) or inadvertently teach behaviors that are detrimental to their survival (e.g., no fear of humans). It’s not for lack of trying, but for a variety of reasons it is impractical or impossible to expose a captive predator to an environment that completely mirrors their ultimate home.

Why Humans Teach Machines to Fail

Not surprisingly, when humans teach machines, they fail for many of the same reasons. In both supervised and semi-supervised learning machine learning we train a model using human annotated data. Unfortunately, human cognitive and sensory capabilities can introduce a variety of consequences that often lead us to teach machines the wrong thing or fail to fully expose them to the environment they will find the wild. While there are likely many areas that impact the quality of human annotations, I’d like to cover five that I believe have the greatest impact on success.

Missing Fundamental and the Transposed Letter Effect (Priming)

Have you ever had someone explain an interesting factoid that sticks with you for life? One such example in my life is the “Missing Fundamental” effect first introduced to me by my freshman year Music Theory professor. The question he posed was “how are you able to hear the low A on a piano when the vast majority of audio equipment cannot reproduce the corresponding fundamental frequency”. It turns out the low A on a piano has a fundamental frequency of 27.5 HZ and most run of the mill consumer audio equipment is incapable of producing a frequency that low with any measurable gain. Yet we can hear the low A on a Piano recording even with those crappy speakers. The reason why is because of the “Missing Fundamental” effect. In essence the human brain can infer the missing fundamental frequency from upper harmonics.

A similar concept is the “Transposed Letter” effect. I’m sure you’ve seen the meme. You know those images with scrambled letters that tell you you’re a genius if you can read it. Your ability to read those sentences is due to transposed letter effect and related to priming. Basically, even if the letters in words printed on the page are jumbled, reading it can still activate the same region of the brain as the original word.

You might be asking what any of this to do with annotating data and teaching machines. The problem arises when you realize we humans can correctly identify something even when all the data is not actually present to do so. If we annotate data this way we are assuming the machine has the same capabilities and that might not be so. Teaching a machine that “can you raed this” and “can you read this” are the same thing may have unintended consequences.

Selection Bias

If you gave me millions of images and asked me to find and label all the images with tomatoes, I am probably going to quickly scan for anything red and circular. Why? Because that’s my initial vision of a tomato and scanning the images that way would likely speed up the process of going through them all. But that is me imparting my bias of what a tomato looks like into the annotated data set. And it’s exactly how a machine never learns that the Green Zebra and Cherokee Purple varieties are indeed tomatoes!

Multisensory Integration

Humans often make use of multiple senses simultaneously to improve our perception of the world. For example, it has been well documented that speech perception, especially in noisy environments, is greatly enhanced when the listener can leverage visual and auditory cues. However, the vast majority of commercial machine learning models are single modality (reading text, scanning images, scanning audio). So, if my ability to understand a noisy speech signal is only possible due to a corresponding video it may be dangerous to try and teach a machine what was said since the machine likely does not have access to the same raw data.

Response Bias

I am ashamed to admit this but every time I get an election ballot, I feel an almost compulsive need to select a candidate for every position. Even when I have little or no knowledge of the office the candidates are running for, the background of the competing candidates, and their policy positions. Usually, I arbitrarily select a candidate based on their party affiliation or what college they went to, which is probably only slightly better than selecting the first name on the ballot. My need to select a candidate even though I have no basis for doing so is likely a form of response bias. The problem with response bias is it generally leads to inaccurate data. If your annotators suffer from response bias, you are likely teaching the machine with inaccurate data.

Zoning Out

Have you ever driven somewhere only to get to your destination and have no recollection of how you got there? If so, like me, you have experienced zoning out. With repetitive tasks we tend to start with an implied speed versus accuracy metric but over time as the task gets boring, we start to zone out or get distracted but we maintain the same speed which ultimately leads to errors. Annotating data is a highly repetitive task and therefore has a high probability of generating these types of errors. And when we use error-ridden annotated data to teach our machines we likely teach them the wrong thing.

How to be a Better Teacher

While the problems above might seem daunting there are things we can do to help minimize the effects of human behavior on our ability to accurately teach machines.

Provide a Common Context

The missing fundamental and multisensory integration problems are both issues with context. In each of these cases either historical or current context allows us humans to discern something another species (a.k.a. the machine) may not be able to comprehend. The solution to this problem is to make sure humans teach the machine with a shared context. The easiest way to fix this problem is to limit the annotator to the same modality the machine will operate with. If the task at hand is to teach a machine to recognize speech from audio, then don’t provide the annotator access to any associated video content. If the task is to identify sarcasm in written text don’t provide the annotator with audio recordings of the text being spoken. This will ensure the annotator teaches the machine with mutually accessible data.

Beyond tooling you can also train your annotators to try and interpret data from multiple perspectives to ensure their previous experiences don’t cause brain activations that the machine might not benefit from. For example, it is very easy to read text in your head with your own internal inflections that might change the meaning. After all the slightest change in inflection can turn a benign comment into a sarcastic insult. However, if you train annotators to step back and try to read the text with multiple inflections you might avoid this problem.

Introduce Randomness

While it might be tempting to let users search around for items they think will help teach a machine doing so can increase the likelihood of selection bias. There may be good reasons to allow users to search to speed up data collection of certain classes, but it is also important to make sure a sizeable portion of your data is randomly selected. Make sure you set up different jobs for your annotators and ensure some proportion of your labeling effort is from randomly selected examples.

Reduce Cognitive Load

While we may not be able to prevent boredom and zoning out, we can reduce complexity in our labeling tools. By reducing the cognitive load we are more likely to minimize mistakes when people get distracted. Some ways to reduce cognitive load include limiting labeling tasks to single step processes (i.e., only label one thing at a time) and providing clear and concise instructions that remove ambiguity.

Be Unsure

Last but not least, allow people to be unsure. If you force people to put things in one of N buckets they will. By giving people the option of being “unsure” you minimize how often you’ll get inaccurate data due to people’s compulsion to provide an answer even if no correct answer is obvious.

Final Thoughts

No teacher wants to see their students fail. So it’s important to remember whether training a lion or a machine learning model, that different species likely learn in different ways. If we cater our teaching towards our students, we just might find our machine learning models fat and happy long after we sent them off into the wild.

*I’d like to thank Shane Walker and Phil Lipari who inspired this post and have helped me successfully teach many machine learning models to survive in the wild.

Voting is Just a Precision and Recall Optimization Problem

It’s hard to avoid the constant bickering about the results of our last election. Should mail-in voting be legal? Do we need stricter voter identification laws? Was there fraud in the last election? Did it impact the results? These are just a fraction of the questions circulating around elections and voter integrity these days. Sadly, these questions appear to be highly politicized and it’s unclear if anybody is really interested in asking what an optimal election system looks like.

In a true fair and accurate representative democracy, a vote not counted is just as costly as one inaccurately counted. More precisely, a single mother with no childcare who doesn’t vote because of 4-hour lines is just as damaging to the system as a vote for a republican candidate that is intentionally or accidentally recorded for the opposing Democratic candidate.

Therefore, we can conclude an optimal election system really involves optimizing on both axes. How do we make sure everyone who wants to vote gets to vote? And how do we ensure every vote is counted accurately? When viewed this way one can’t help but see the parallels to optimizing a machine learning classifier for precision (when we count votes for a given candidate how often did we get it right) and recall (of all possible votes for that candidate how many did we find).

Back the Truck Up! What is Precision and Recall Anyway

Precision and Recall are two metrics often used to measure the accuracy of a classifier. You might ask “why not just measure accuracy?” and that would be a valid question. Accuracy defined as everything we classified correctly divided by everything we evaluated, suffers from what is commonly known as the imbalanced class problem.

Suppose we have a classifier (a.k.a. laws and regulations) that can take a known set of voters who intend to vote “democrat” and “not democrat” (actual / input) and then outputs the recorded vote (predicted / output).

Let’s assume we evaluate 100 intended voters/votes, 97 of which intend to not vote for the democratic candidate and let’s build the dumbest classifier ever known. We are just going to count every vote as “not democrat”, regardless of whether the ballot was marked for the democratic candidate or not.

N (number of votes) = 100 Output (Predicted) Value
Democrat Not a Democrat
Input (Actual) Value Democrat TP = 0 FN = 3 TOTAL DEMOCRATS = 3
Not a Democrat FP = 0 TN = 97 TOTAL NOT DEMOCRATS = 97
POSITIVES = 0 NEGATIVES = 100

To make our calculations a little easier we can take those numbers and drop them into a table that compares inputs to outputs also known as a confusion matrix. To simplify some of our future calculations we can further define some of the cells in the table above

  • True Positives (TP): Correctly captured an intended vote for the democrats as a vote for the democrats (97)
  • True Negatives (TN): Correctly captured a vote NOT intended for the democrats as a vote, not for the democrats (97)
  • False Positives (FP): Incorrectly captured a vote NOT intended for the democrats as a vote for the democrats (0)
  • False Negatives (FN): Incorrectly captured an intended vote for the democrats as a vote not for the democrats (3)

Now we can slightly relabel our accuracy equation and calculate our accuracy with our naïve classifier and the associated values from the table above.

97% Accuracy! We just created the world’s stupidest classifier and achieved 97% accuracy! And therein lies the rub. The second I expose this classifier to the real world with a more balanced set of inputs across classes we will quickly see our accuracy plummet. Hence, we need a better set of metrics. Ladies and gentlemen, I am delighted to introduce …

  • Precision: Of the votes recorded (predicted) for the Democrats, how many were correct

  • Recall: Of all possible votes for the Democrats, how many did we find

What becomes blatantly clear from evaluating these two metrics is that our classifier, which appeared to have great accuracy, is terrible. None of the intended votes for the democrats were correctly captured and of all possible intended votes for the democrats, we found none of them. It’s worth noting that the example I’ve presented here is for a binary classifier (democrat, not democrat) but these metrics can easily be adapted to multi-class systems that more accurately reflect our actual candidate choices in the United States.

There’s No Such Thing as 100% Precision and Recall

Gödel’s incompleteness theorem, which loosely states that every non-trivial formal system is either incomplete or inconsistent, likely applies to machine learning and artificial intelligence systems. In other words, since machine learning algorithms are built around our known formal mathematical systems there will be some truths they can never describe. A consequence of that belief and something I preach to everyone I work with is that there is really no such thing as 100% precision and recall. No matter how great your model is and what your test metrics tell you. There will always be edge cases.

So if 100% precision and recall is all but impossible what do we do? When developing products around machine learning classifiers, we often ask ourselves what is most important to the customer, precision, recall, or both. For example, if I create a facial recognition system that notifies the police if you are a wanted criminal, we probably want to air on the side of precision because arresting innocent individuals would be intolerable. But in other cases, like flagging inappropriate images on a social network for human review, we might want to air on the side of recall, so we capture most if not all images and allow humans to further refine the set.

It turns out that very often precision and recall can be traded off. Most classifiers emit a confidence score of sorts (also known as a SoftMax output) and by just varying the threshold on that output we can trade-off precision for recall and vice-vera. Another way to think about this is, if I require my classifier to be very confident in its output before I accept the result, I can tip the results in favor of precision. If I loosen my confidence threshold, I can tip it back in favor of recall.

And how might this apply in voting? Well, if I structure my laws and regulations such that every voter must vote in person with 6 forms of ID and the vote is tallied in front of the voter by a 10-person bipartisan evaluation team who must all agree … we will likely have very high precision. After all, we’ve greatly increased the confidence in the vote outcome. But at what expense? We will also likely slow down the voting process and create massive lines which will significantly increase the number of people who might have intended to vote but don’t actually do so, hence decreasing recall.

Remind me again what the hell this has to do with Voting

The conservative-leaning Heritage Foundation makes the following statement on their website:

“It is incumbent upon state governments to safeguard the electoral process and ensure that every voter’s right to cast a ballot is protected.”

I strongly subscribe to that statement and I believe it is critical to the success of any representative democracy. But ensuring that every voter’s right to cast a ballot is protected, requires not only that we accurately record the captured votes, but also ensure that every voter who intends to vote is unhindered in doing so.

Maybe we need to move entirely to in-person voting while simultaneously allocating sufficient funds for more polling stations, government-mandated paid time off, and government-provided childcare. Or maybe we need all mail-in ballots but some new process or technology to ensure the accuracy of the votes. Ultimately, I don’t pretend to know the right answer, or if we even have a problem, to begin with. What I do know is that if we wish to improve our election systems we must first start with data on where we stand today and then tweak our laws and regulations to simultaneously optimize for precision and recall.

So, the next time a politician proposes changes to our election system ask … no demand, they provide data on the current system and how their proposed changes will impact precision and recall. Because only when we optimize for both these metrics can we stop worrying about making America great again and start working on making America even greater!