Introducing Task-Oriented Multiparty Conversational AI: Inviting AI to the Party

The term “conversational AI” has been around for some time. There are dozens of definitions all over the internet. But let me refresh your memory with a definition from NVIDIA’s website.

Conversational AI is the application of machine learning to develop language-based apps that allow humans to interact naturally with devices, machines, and computers using speech

https://www.nvidia.com/en-us/glossary/conversational-ai

There’s nothing wrong with that definition except for one small misleading phrase: “… allow humans to interact …”. What that should say is: “… allow a human to interact …”. Why? Because every interaction you’ve ever had with a conversational AI system has been one-on-one.

Sure, you and your kids can sit around the kitchen table blurting out song titles to Alexa (“Alexa, play the Beatles,” “No Alexa, play Travis Scott,” “No Alexa, play Olivia Rodrigo.” …). Alexa may even acknowledge each request, but she isn’t having a conversation with your family. She’s indiscriminately acknowledging and transacting on each request as if they’re coming in one by one, all from the same person.

And that’s where multiparty conversational AI comes into play.

What is Multiparty Conversational AI

With a few small tweaks, we can transform our previous definition of conversational AI to one that accurately defines multiparty conversational AI.

Multiparty conversational AI is the application of machine learning to develop language-based apps that allow AI agents to interact naturally with groups of humans using speech

While the definitions may appear similar, they are fundamentally different. One implies a human talking to a machine, while our new definition implies a machine being able to interact naturally with a group of humans using speech or language. This is the difference between one-on-one interactions versus an AI agent interacting in a multiparty environment.

Multiparty conversational AI isn’t necessarily new. Researchers have been exploring multiparty dialog and conversational AI for many decades. I personally contributed to early attempts at building multiparty conversational AI into video games with the Kinect camera nearly fifteen years ago.1 But sadly no one has been able to solve all the technical challenges associated with building these types of systems and there has been no commercial product of note.

What about the “Task-Oriented” part?

You may have wisely noted that I have not mentioned the words “task-oriented” contained in the title of this post. Conversational AI (sometimes also called dialog systems) can be divided into two categories, open-domain and task-oriented.

Open-domain systems can talk about any arbitrary topic. The goal is not necessarily to assist any particular action, but rather engage in arbitrary chitchat. Task-oriented systems are instead focused on solving “tasks”. Siri and Alexa are both task-oriented conversational AI systems.

In multiparty systems, tasks become far more complicated. Tasks are usually the output of a conversation where a consensus is formed that necessitates action. Therefore any task-oriented multiparty conversational AI system must be capable of participating in forming a consensus or it will risk taking action before it is necessary to do so

Multiparty Conversational AI, What is it Good For?

“Absolutely Everything!” Humans are inherently social creatures. We spend much of our time on this planet interacting with other humans. Some have even argued that humans are a eusocial species (like ants and bees) and that our social interactions are critical to our evolutionary success. Therefore, for any conversational AI system, to truly become an integral part of our lives, it must be capable of operating amongst groups of humans.

Nowhere is this more evident than in a corporate work environment. After all, we place employees on teams, they have group conversations on Slack/Teams and email, and we constantly gather small groups of people in scheduled or ad-hoc meetings. Any AI system claiming to improve productivity in a work environment will ultimately need to become a seamless part of these group interactions.

Building Task-Oriented Multiparty Conversational AI Systems

There is a litany of complex problems that need to be solved to reliably build a task-oriented multiparty conversational AI system that would be production-worthy. Below is a list of the most critical areas that need to be addressed.

  • Task detection and dialog segmentation
  • Who’s talking to whom
  • Semantic parsing (multi-turn intent and entity detection)
  • Conversation dissentanglement
  • Social graphs and user/organization preferences
  • Executive function
  • Generative dialog

In the next sections, we’ll briefly dive deeper into each of these areas.

Task Detection and Dialog Segmentation

In a single-party system such as Alexa or Siri task detection is quite simple. You address the system (“Hey Siri …” ) and everything you utter is assumed to be a request to complete a task (or follow up on a secondary step needed to complete a task). But in multiparty conversations, detecting tasks2 is far more difficult. Let’s look at two dialog segments below

Two aspects of these conversations make accurately detecting tasks complex.:

  • In the first dialog, our agent Xena, is an active part of the conversation, and the agent is explicitly addressed. However, in the second conversation, our agent passively observed a task assigned to someone else and subsequently proactively offered assistance. That means we need to be able to detect task-oriented statements (often referred to as a type of dialog act) that might not be explicitly addressed to the agent.
  • The second issue is that the information necessary to complete either of these tasks is contained outside the bounds of the statement itself. That means we need to be able to segment the dialog (dialog segmentation) to capture all the utterances that pertain to the specific task.

Beyond the two challenges above there is also the issue of humans often making vague commitments or hedging on ownership. This presents additional challenges as any AI system must be able to parse whether a task request is definitive or not and be able to handle vague tasks or uncertain ownership.

Who’s Talking to Whom

To successfully execute the task in a multiparty conversation we need to know who is making the request and to whom it is assigned. This raises another set of interesting challenges. The first issue is, how do we even know who is speaking in the first place?

In a simple text-based chat in Slack, it is easy to identify each speaker. The same is true of a fully remote Zoom meeting. But what happens when six people are all collocated in a conference room? To solve this problem we need to introduce concepts like blind speaker segmentation and separation and audio fingerprinting.

But even after we’ve solved the upfront problem of identifying who is in the room and speaking at any given time there are additional problems associated with understanding the “whom”. It is common to refer to people with pronouns and in a multiparty situation you can’t just simply assume “you” is the other speaker. Let’s look at a slightly modified version of one of the conversations we presented earlier.

The simple assumption would be that the previous speaker (User 2) is the “whom” in this task statement. But after analyzing the conversation it is clear that “you” refers to User 1. Identifying the owner or “whom” in this case requires concepts like coreference resolution (who does “you” refer to elsewhere in the conversation) to correctly identify the correct person.

Semantic Parsing

Semantic parsing, also sometimes referred to as intent and entity detection, is an integral part of all task-oriented dialog systems. However, the problem gets far more complex in multiparty conversations. Take the dialog in the previous section. A structured intent and entity JSON block might look something like this:

{
    "intent": "schedule_meeting",
    "entities": {
        "organizer": "User 1",
        "attendees": [
            "User 2",
            "User 3"
        ],
        "title": "next quarter roadmap",
        "time_range": "next week"
    }
}

Note that all of the details in this JSON block did not originate from our task-based dialog act. Rather the information was pulled from multiple utterances across multiple speakers. Successfully achieving this requires a system that is exceptionally good at coreference resolution and discourse parsing.

Conversation Disentanglement

While some modern chat-based applications (e.g. Slack) have concepts of threading that can help isolate conversations, we can’t guarantee that any given dialog is single-threaded. Meetings are nonthreaded and chat conversations can contain multiple conversations that are interspersed with each other. That means any multiparty conversational AI system must be able to pull apart these different conversations to transact accurately. Let’s look at another adaptation of a previous conversation:

In this dialog, two of our users have started a separate conversation. This can lead to ambiguity in the last request to our agent. User 3 appears to be referring to the previous meeting we set up, but knowing this requires we separate (or disentangle) these two distinct conversations so we can successfully handle subsequent requests.

Social / Knowledge Graph and User Preferences

While this might not be obvious, when you engage in any multiparty conversation you are relying on a database of information that helps inform how you engage with each participant. That means any successful multiparty conversational AI system needs to be equally aware of this information. At a bare minimum, we need to know how each participant relates to each other and their preferences associated with the supported tasks. For example, if the CEO of the company is part of the conversation you may want to defer to their preferences when executing any task.

Executive Function

Perhaps most importantly, any task-oriented multiparty conversational AI system must have executive function capabilities. According to the field of neuropsychology, executive function is the set of cognitive capabilities humans use to plan, monitor, and execute goals.

Executive function is critically important in a multiparty conversation because we need to develop a plan for whether we immediately take action on any given request or if we must seek consensus first. Without these capabilities, an AI system will just blindly execute tasks. As described earlier in this post this is exactly how your Alexa behaves today. If you and your kids continuously scream out “play <song name x>” it will just keep changing songs without any attempt to build consensus and the interaction with the conversational AI system will become dysfunctional. Let’s look at one more dialog interaction.

As you can see in the example above our agent just didn’t automatically transact on a request to move the meeting to Wednesday. Instead, the agent used its executive function to do a few things:

  • Recognize that the second request was not the request originator
  • Preemptively pull back information about whether the proposal was viable
  • Seek consensus with the group before executing

Achieving this capability requires the gathering of data previously collected, developing a plan, and then executing against that plan. So for a task-oriented multiparty conversational AI system to correctly operate within a group, it must have executive function capabilities.

Generative Dialog Engine

Last but not least, any conversational AI system must be able to converse with your users. However, because the number of people in any given conversation and their identities are not predictable and our executive functions can cause a wide array of responses, no predefined or templated list will suffice for generating responses. A multiparty system will need to take all our previously generated information and generate responses on demand.

Wait, Don’t Large Language Models (LLMs) Solve This

With all the hype, you’d think LLMs could solve the problem of task-oriented multiparty conversational AI – and weave straw into gold. But it turns out that, at best, LLMs are just a piece in a much larger puzzle of AI technology needed to solve this problem.

There are basic problems like the fact that LLMs are purely focused on text and can’t handle some of the speaker identification problems discussed earlier. But even more importantly there is no evidence that LLMs have native abilities to understand the complexities of social interactions and plan their responses and actions based on that information.

It will require a different set of technologies, perhaps leveraging LLMs in some instances, to fully build a task-oriented multiparty conversational AI system.

So When Can I Invite an AI to Join the Party

While I can’t say anyone has solved all the challenges discussed in this post, I can say we are very close. My team at Xembly has developed what we believe is the first commercial product capable of participating in multiparty conversations as both a silent observer and an active participant. Our AI agent can join in-person meetings or converse with a group in Slack while also helping complete tasks that arise as a byproduct of these conversations.

We are only just beginning to tackle task-oriented multiparty conversational AI. So we may not be the life of the party, but go ahead and give Xembly and our Xena AI agent a try. The least you can do is send us an invite!

  1. With the Kinect Camera, we hoped to individually identify speakers in a room so each user could independently interact with the game. You can read more details about or work in this space here: 1, 2, 3, 4 ↩︎
  2. Are tasks in multiparty conversations just action items? Yes, since an action item is generally defined as a task that arises out of a group’s discussion. I’ll be writing a larger deep dive into action item detection in a future post. ↩︎

Generative AI – Prolific Copyright Infringer?

poor man's copyright.  Original music mailed to myself (Jason Flaks) via certified mail.

So, you might be wondering, “What makes this guy fit to pen an article about generative AI and copyright infringement?” I mean, I’m no copyright lawyer, nor do I moonlight as one on television. But I do bring a unique viewpoint to the table.  After all, I’ve dabbled in the music industry and have been up to my elbows in machine-learning projects for a good chunk of my career. But perhaps my best qualification is my long-standing fascination with copyright law, which started when I was just a kid. That image up top isn’t some AI-generated piece from Midjourney; they’re my earnest attempts at copyrighting my original music over three decades ago using the Poor Man’s copyright approach.

So, What’s Copyright, and How Do I Get One?

Before we dive into the meaty debate of whether Generative AI infringes on anyone’s copyright, let’s clarify what copyright means. According to the US Copyright Office, copyright is a “type of intellectual property that protects original works of authorship as soon as an author fixes the work in a tangible form of expression.” In simpler terms, copyright asserts your ownership of any intangible creations of human intellect (e.g., music, art, writing, etc).  The moment you fix your creation to a physical form (e.g., MP3 file, canvas, video recording, piece of paper, etc.), you have a copyright.

Amazingly, you don’t need to register for an official copyright to have one. So, why register a copyright? The Supreme Court has decided that to sue for copyright infringement, you must have registered your copyright with the US copyright office. However, they’ve also clarified that registering to sue is separate from the date of creation. This means I could register for a copyright for this blog post years from now but still sue for any infringement prior, as long as I can prove the date of creation.

How Can Generative AI Infringe on a Copyright?

There’s been a lot of talk about generative AI infringing on copyright protections, but many of these discussions oversimplify the issue. There are actually three different ways Generative AI can infringe on your copyright, some favoring artists/creators and some more favorable to the generative AI companies.

  • Theft (copying) of copyrighted material
  • Distribution of copyrighted material
  • Use of copyrighted material in derivative or transformative works

On Theft (Copying) of Copyrighted Material

Let’s get real; while there are numerous court cases establishing the legitimate right to duplicate copyrighted material under the fair use doctrine, the default assumption is and should be that it is illegal to do so.  Therefore, if we can determine that generative AI companies are using copies of content they did not pay for or get permission to use and using that content in a way that falls outside fair use, then we can assume they are stealing copyrighted material.

There is little or no debate that generative AI companies are using copyrighted material.  After all, OpenAI basically admits to lifting its training data from content “that is publicly available on the internet” on its website. And as I discussed earlier just about anything newly written on the internet has an inherent copyright.  But beyond possible scraping of my blog post, there is sufficient evidence that generative AI companies have ingested copyrighted books, images, and more. 

And if you need any more proof, look at the image below where I attempted to elicit lyrics from Bob Dylan’s Blowin’ in the Wind from ChatGPT.  ChatGPT both knew the lyrics I provided were from the song, and ChatGPT was able to quote a portion of the lyrics I did not provide.  It can only do that because it has seen the lyrics before in its training data set.

ChatGPT prompted to generate lyrics from Bob Dylan's "Blowin in teh Wind"

If there is no question that copyrighted material was used in the training process, then we only need to assess whether the copying should be considered fair use. There are multiple justifications for fair use in copyright law.  Some are easy to interpret, and others more difficult.  Items like research or scholarly use are reasonably easy to assess, and I can find no fair argument that generative AI companies are using copyrighted material in either capacity.

So, that leaves the last question in fair use: does the copying materially impact the monetization of the content?   And I think the answer here again is quite simple: YES!  The easiest example I can give is the artwork I regularly use in my blog posts.  I’ve traditionally paid for the art I use via services like Dreamstime.  If Midjourney or Stable Diffusion trained on this type of art and I subsequently generate my blog post art via their services, I may never pay for art via Dreeamstime or other similar services again.  And in doing so, those artists have lost a way to monetize their art, and they are not equally compensated by the generative AI companies.

On Distribution of Copyrighted Material

If you’re old like me, you may remember those FBI copyright warnings that regularly made an appearance on DVDs and VHS tapes.

The unauthorized reproduction or distribution of this copyrighted work is illegal …

The issue of whether these systems distribute the content in its original form with little transformation is a big one. This distribution can occur in two ways: to end customers and to data annotators.

To end customers

Generative AI models are basically next-word (pixel, etc.) predictors. They aim to provide the most statistically likely next word based on a previous sequence of words. As a result, these models will, without any special adaptations, spit back exact copies of text, images, etc., especially in low-density areas.  As you can see from the image in the previous section, while OpenAI has been proactively trying to adapt the system not to distribute copyrighted material, I was still able to get it to do so with very little effort on my part.

So while these generative AI systems will continue to try and put mitigations in place to prevent the distribution of copyrighted content, there is little or no debate that they have been doing so all along.  And they are likely to continue doing so as it is impossible to close every hole in the system.

To data annotators

OpenAI and others use reinforcement learning from human feedback (RLHF) to improve their models. RLHF requires that outputs from an original model are shown to human annotators to help build a reward model that leads to better outputs from the generative model. If these human annotators were shown copyrighted material, in an effort to reward the model for not doing so in the future, OpenAI and other generative AI companies would clearly be distributing copyrighted content.

You might ask, “Shouldn’t copyright holders be happy that OpenAI is trying to train their models not to distribute copyrighted content?”  Well, maybe, but if I started traveling the country tomorrow, giving a for-profit seminar on how to detect illegal copies of the Super Bowl, and in these seminars, I played previous Super Bowl recordings to the attendees without the NFL’s permission … I think the NFL would have a problem with that.

On Use of Copyrighted Material in Derivative or Transformative Works

The question of whether the output generated by Generative AI models, when not a direct reproduction, counts as copyright infringement is a murky one. There are many examples where courts have determined that “style” is not copyrightable. There are further questions on whether any output created by generative AI based on copyrighted material is derivative or transformative.   Truth be told, it can likely be either, depending on how the model is prompted.  So it’s actually quite difficult to say for sure if the resulting output from generative AI models is fair use or copyright infringement.

We’re left then with questions about who is really violating copyright in any of these cases. Is it the model or the company that owns it? Or Is it the user who prompted the model to generate the content? And does any of it really matter unless that generated content is published?

The Road Ahead

It seems to me the issue of generative AI and copyright has been complicated more than necessary. Generative AI companies must find a way to pay for the content they use to train their models. If they distribute the content, they may need to find a way to pay royalties.  Otherwise, these generative AI companies are profiting off the works of creators without properly compensating them.  And that just isn’t fair.

For artists, don’t let the thought of generative AI copying your style without compensation scare you. These models can’t generate new content and are limited to what they’ve seen in their training set. So, keep making new art, keep pushing boundaries, and if we solve the first problem of content theft and distribution, you’ll continue to be paid for the amazing work you create.

Your Large Language Model – it’s as Dumb as a Rock

© Jason Flaks -Initially generated by DALL-E and edited by Jason Flaks

Unless you’ve been living under a rock lately you likely think we’re entering some sort of AI-pocalypse. The sky is falling and the bots have come calling. There are endless reports of ChatGPT acing college-level exams, becoming self-aware, and even trying to break up people’s marriages! The way  OpenAI and their ChatGPT product have been depicted, it’s a miracle we haven’t all unplugged our devices and shattered our screens. It seems like a sensible way to stop the AI overlords from taking control of our lives.

But never fear! I am here to tell you that large language models (LLMs) and their various compatriots are as dumb as the rocks we all might be tempted to smash them with. Well, ok, they are smart in some ways. But don’t fret—these models are not conscious, sentient, or intelligent at all. Here’s why.

Some Like it Bot: What’s an LLM?

Large Language Models (LLMs) actually do something quite simple. They take a given sequence of words and predict the next likely word to follow. Do that recursively, and add in a little extra noise each time you make a prediction to ensure your results are non-deterministic, and voila! You have yourself a “generative AI” product like ChatGPT.

But what if we take the description of LLMs above and restate it a little more succinctly:

LLMs estimate an unknown word based on extending a known sequence of words.

It may sound fancy—revolutionary, even—but the truth is it’s actually old school. Like, really, really old school—it’s almost the exact definition of extrapolation, a common mathematical technique that’s existed since the time of Archimedes! If you take a step back, Large Language Models are nothing more than a fancy extrapolation algorithm.  Last I checked nobody thinks their standard polynomial extrapolation algorithm is conscious or intelligent. So why exactly do so many believe LLMs are?

Hear Ye, Hear Ye: What’s in an Audio Sample

Sometimes it’s easier to explain a complex topic by comparison. Let’s take a look at one of the most common human languages in existence—music.  Below are a few hundred samples from Bob Dylan’s “Like a Rolling Stone.” 


If I were to take those samples and feed them into an algorithm and then recursively extrapolate out for a few thousand samples, I’d have generated some additional audio content. But there is a lot more information encoded in that generated audio content than just the few thousand samples used to create it.

At the lowest level:

  • Pitch
  • Intensity
  • Timbre

At a higher level:

  • Melody
  • Harmony
  • Rhythm

And at an even higher level:

  • Genre
  • Tempo

So by simply extrapolating samples of audio, we generated all sorts of complex higher-level features of auditory or musical information. But pump the brakes! Did I just create AI Mozart? I don’t think so. It’s more like AI Muzak.

An AI of Many Words: What’s Next? 

It turns out that predicting the next word in a sequence of words will also generate more than just a few lines of text. There’s a lot of information encoded in those lines,  including the structure of how humans speak and write, as well as general information and knowledge we’ve previously logged. Here’s just a small sample of things encoded in a sequence of words:

  • Vocabulary
  • Grammar/Part of Speech (PoS) tagging
  • Coreference resolution (pronoun dereferencing)
  • Named entity detection
  • Text categorization
  • Question and answering
  • Abstract summarization
  • Knowledge base

All of the information above can, in theory, be extracted by simply predicting the next word, much in the same way predicting the next musical sample gives us melody, harmony, rhythm, and more.   And just like our music extrapolation algorithm didn’t produce the next Mozart, ChatGPT isn’t going to create the next Shakespeare (or the next horror movie villain, for that matter).

LLMs: Lacking Little Minds? 

Large Language Models aren’t the harbinger of digital doom, but that doesn’t mean they don’t have some inherent value. As an early adopter of this technology, I know it has a place in this time. It’s integral to the work we do at Xembly, where I’m the co-founder and CTO. However, once you understand that LLMs are just glorified extrapolation algorithms, you gain a better understanding of the limitations of the technology and how best to use it. 

Five Alive: How to Use LLMs So They Don’t Take Over the World


LLMs have huge potential. Just like any other tool, though, in order to extrapolate the most value, you have to use them properly. Here are five areas to consider as you incorporate LLMs into your life and work. 

  • Information must be encoded in text
  • Extrapolation error with distance
  • Must be prompted
  • Limited short-term memory
  • Fixed in time with no long-term memory

Let’s dig a little deeper.

Information Must Be Encoded in Text

Yan LeCun probably said it best:

Humans are multi-modal input devices and many of the things we observe and are aware of that drive our behavior aren’t verbal  (and hence not encoded in text). An example we contend with at Xembly is the prediction of action items from a meeting. It turns out that the statement “I’ll update the row in the spreadsheet” may or may not be a future commitment to do work.  Language is nuanced, influenced by other real-time inputs like body language and hundreds of other human expressions. It’s entirely possible in this example that the task was completed in real-time during the meeting, and the spoken words weren’t an indication of future work at all.

Extrapolation Error with Distance

Like all extrapolation algorithms, the further you get away from your source signal (or prompt in the case of LLMs), the more likely you will experience errors. Sometimes a single prediction that negates an otherwise affirmative statement or an incorrectly assigned gendered pronoun, can cause downstream errors in future predictions. These tiny errors can often lead to convincingly good responses that are factually inaccurate. In some cases, you may find LLMs return highly confident answers that are completely incorrect. These types of errors are referred to as hallucinations.

But both of these examples are really just forms of extrapolation error. The errors will be more pronounced when you make long predictions. This is especially true for content largely unseen by the underlying language model (for example, when trying to do long-form summarization of novel content).

Must Be Prompted

Simply put, if you don’t provide input text an LLM will do nothing. So if you are expecting ChatGPT to act as a sage and give you unsolicited advice, you’ll be waiting a long time. Many of the features Xembly customers rave about are based on our product providing unsolicited guidance. Large Language Models are no help to us here.

Limited Short-Term Memory

LLMs generally only operate on a limited window of text. In the case of ChatGPT, that window is roughly 3000 words. What that means is that new information not already incorporated in the initial LLM training data can very quickly fall out of memory. This is especially problematic for long conversations where new corporate lingo may be introduced at the start of a conversation and never mentioned again. Once whatever buzzword is used falls out of the context window it will no longer contribute to any future prediction, which can be problematic when trying to summarize a conversation.

Fixed in Time with no Long-term Memory

Every conversation you have with ChatGPT only exists for that session. Once you close that browser or exit your current conversation, there is no memory of what was said. That means you cannot depend on new words being understood in future conversations unless you reintroduce them within a new context window. So, if you introduce an LLM to anything it hasn’t heard before in a given session, you may find it uses that word correctly in subsequent responses. But if you enter a new session and have any hopes that the word will be used without introducing it in a new prompt, brace yourself—you will be disappointed.

To Use an LLM or Not to Use an LLM

It’s a big question. LLMs are exceedingly powerful, and you should strongly consider using them as part of your NLP stack. I’ve found the greatest value of many of these LLMs is that they potentially replace all the bespoke language models folks have been making for some time.  You may not need these custom entity modes, intent models, abstract summarization models, etc. It’s quite possible that LLMs can accomplish all of these things at similar or better accuracy, while possibly greatly reducing time to market for products that rely on this type of technology.  

There are many items in the LLM plus column, but if you are hoping to have a thought-provoking intelligent conversation with ChatGPT,  I suggest you walk outside and consult your nearest rock. You just might have a more engaging conversation!

The Annotators Dilemma: When Humans Teach Machines to Fail

What does a machine learning model trained via supervised learning and a lion raised in captivity have in common? … They’re both likely to die in the wild!

Now that might sound like a joke aimed at getting PETA to boycott my blog, but this is no laughing matter. Captive bred lions are more likely to die in the wild and so are machine learning models trained with human annotated data.

According to a 2008 National Geographic Article captive bred predators often die in the wild because they never learn the natural behaviors necessary for success. This is because their human captors never teach the animals the necessary survival skills (e.g., hunting) or inadvertently teach behaviors that are detrimental to their survival (e.g., no fear of humans). It’s not for lack of trying, but for a variety of reasons it is impractical or impossible to expose a captive predator to an environment that completely mirrors their ultimate home.

Why Humans Teach Machines to Fail

Not surprisingly, when humans teach machines, they fail for many of the same reasons. In both supervised and semi-supervised learning machine learning we train a model using human annotated data. Unfortunately, human cognitive and sensory capabilities can introduce a variety of consequences that often lead us to teach machines the wrong thing or fail to fully expose them to the environment they will find the wild. While there are likely many areas that impact the quality of human annotations, I’d like to cover five that I believe have the greatest impact on success.

Missing Fundamental and the Transposed Letter Effect (Priming)

Have you ever had someone explain an interesting factoid that sticks with you for life? One such example in my life is the “Missing Fundamental” effect first introduced to me by my freshman year Music Theory professor. The question he posed was “how are you able to hear the low A on a piano when the vast majority of audio equipment cannot reproduce the corresponding fundamental frequency”. It turns out the low A on a piano has a fundamental frequency of 27.5 HZ and most run of the mill consumer audio equipment is incapable of producing a frequency that low with any measurable gain. Yet we can hear the low A on a Piano recording even with those crappy speakers. The reason why is because of the “Missing Fundamental” effect. In essence the human brain can infer the missing fundamental frequency from upper harmonics.

A similar concept is the “Transposed Letter” effect. I’m sure you’ve seen the meme. You know those images with scrambled letters that tell you you’re a genius if you can read it. Your ability to read those sentences is due to transposed letter effect and related to priming. Basically, even if the letters in words printed on the page are jumbled, reading it can still activate the same region of the brain as the original word.

You might be asking what any of this to do with annotating data and teaching machines. The problem arises when you realize we humans can correctly identify something even when all the data is not actually present to do so. If we annotate data this way we are assuming the machine has the same capabilities and that might not be so. Teaching a machine that “can you raed this” and “can you read this” are the same thing may have unintended consequences.

Selection Bias

If you gave me millions of images and asked me to find and label all the images with tomatoes, I am probably going to quickly scan for anything red and circular. Why? Because that’s my initial vision of a tomato and scanning the images that way would likely speed up the process of going through them all. But that is me imparting my bias of what a tomato looks like into the annotated data set. And it’s exactly how a machine never learns that the Green Zebra and Cherokee Purple varieties are indeed tomatoes!

Multisensory Integration

Humans often make use of multiple senses simultaneously to improve our perception of the world. For example, it has been well documented that speech perception, especially in noisy environments, is greatly enhanced when the listener can leverage visual and auditory cues. However, the vast majority of commercial machine learning models are single modality (reading text, scanning images, scanning audio). So, if my ability to understand a noisy speech signal is only possible due to a corresponding video it may be dangerous to try and teach a machine what was said since the machine likely does not have access to the same raw data.

Response Bias

I am ashamed to admit this but every time I get an election ballot, I feel an almost compulsive need to select a candidate for every position. Even when I have little or no knowledge of the office the candidates are running for, the background of the competing candidates, and their policy positions. Usually, I arbitrarily select a candidate based on their party affiliation or what college they went to, which is probably only slightly better than selecting the first name on the ballot. My need to select a candidate even though I have no basis for doing so is likely a form of response bias. The problem with response bias is it generally leads to inaccurate data. If your annotators suffer from response bias, you are likely teaching the machine with inaccurate data.

Zoning Out

Have you ever driven somewhere only to get to your destination and have no recollection of how you got there? If so, like me, you have experienced zoning out. With repetitive tasks we tend to start with an implied speed versus accuracy metric but over time as the task gets boring, we start to zone out or get distracted but we maintain the same speed which ultimately leads to errors. Annotating data is a highly repetitive task and therefore has a high probability of generating these types of errors. And when we use error-ridden annotated data to teach our machines we likely teach them the wrong thing.

How to be a Better Teacher

While the problems above might seem daunting there are things we can do to help minimize the effects of human behavior on our ability to accurately teach machines.

Provide a Common Context

The missing fundamental and multisensory integration problems are both issues with context. In each of these cases either historical or current context allows us humans to discern something another species (a.k.a. the machine) may not be able to comprehend. The solution to this problem is to make sure humans teach the machine with a shared context. The easiest way to fix this problem is to limit the annotator to the same modality the machine will operate with. If the task at hand is to teach a machine to recognize speech from audio, then don’t provide the annotator access to any associated video content. If the task is to identify sarcasm in written text don’t provide the annotator with audio recordings of the text being spoken. This will ensure the annotator teaches the machine with mutually accessible data.

Beyond tooling you can also train your annotators to try and interpret data from multiple perspectives to ensure their previous experiences don’t cause brain activations that the machine might not benefit from. For example, it is very easy to read text in your head with your own internal inflections that might change the meaning. After all the slightest change in inflection can turn a benign comment into a sarcastic insult. However, if you train annotators to step back and try to read the text with multiple inflections you might avoid this problem.

Introduce Randomness

While it might be tempting to let users search around for items they think will help teach a machine doing so can increase the likelihood of selection bias. There may be good reasons to allow users to search to speed up data collection of certain classes, but it is also important to make sure a sizeable portion of your data is randomly selected. Make sure you set up different jobs for your annotators and ensure some proportion of your labeling effort is from randomly selected examples.

Reduce Cognitive Load

While we may not be able to prevent boredom and zoning out, we can reduce complexity in our labeling tools. By reducing the cognitive load we are more likely to minimize mistakes when people get distracted. Some ways to reduce cognitive load include limiting labeling tasks to single step processes (i.e., only label one thing at a time) and providing clear and concise instructions that remove ambiguity.

Be Unsure

Last but not least, allow people to be unsure. If you force people to put things in one of N buckets they will. By giving people the option of being “unsure” you minimize how often you’ll get inaccurate data due to people’s compulsion to provide an answer even if no correct answer is obvious.

Final Thoughts

No teacher wants to see their students fail. So it’s important to remember whether training a lion or a machine learning model, that different species likely learn in different ways. If we cater our teaching towards our students, we just might find our machine learning models fat and happy long after we sent them off into the wild.

*I’d like to thank Shane Walker and Phil Lipari who inspired this post and have helped me successfully teach many machine learning models to survive in the wild.

Voting is Just a Precision and Recall Optimization Problem

It’s hard to avoid the constant bickering about the results of our last election. Should mail-in voting be legal? Do we need stricter voter identification laws? Was there fraud in the last election? Did it impact the results? These are just a fraction of the questions circulating around elections and voter integrity these days. Sadly, these questions appear to be highly politicized and it’s unclear if anybody is really interested in asking what an optimal election system looks like.

In a true fair and accurate representative democracy, a vote not counted is just as costly as one inaccurately counted. More precisely, a single mother with no childcare who doesn’t vote because of 4-hour lines is just as damaging to the system as a vote for a republican candidate that is intentionally or accidentally recorded for the opposing Democratic candidate.

Therefore, we can conclude an optimal election system really involves optimizing on both axes. How do we make sure everyone who wants to vote gets to vote? And how do we ensure every vote is counted accurately? When viewed this way one can’t help but see the parallels to optimizing a machine learning classifier for precision (when we count votes for a given candidate how often did we get it right) and recall (of all possible votes for that candidate how many did we find).

Back the Truck Up! What is Precision and Recall Anyway

Precision and Recall are two metrics often used to measure the accuracy of a classifier. You might ask “why not just measure accuracy?” and that would be a valid question. Accuracy defined as everything we classified correctly divided by everything we evaluated, suffers from what is commonly known as the imbalanced class problem.

Suppose we have a classifier (a.k.a. laws and regulations) that can take a known set of voters who intend to vote “democrat” and “not democrat” (actual / input) and then outputs the recorded vote (predicted / output).

Let’s assume we evaluate 100 intended voters/votes, 97 of which intend to not vote for the democratic candidate and let’s build the dumbest classifier ever known. We are just going to count every vote as “not democrat”, regardless of whether the ballot was marked for the democratic candidate or not.

N (number of votes) = 100 Output (Predicted) Value
Democrat Not a Democrat
Input (Actual) Value Democrat TP = 0 FN = 3 TOTAL DEMOCRATS = 3
Not a Democrat FP = 0 TN = 97 TOTAL NOT DEMOCRATS = 97
POSITIVES = 0 NEGATIVES = 100

To make our calculations a little easier we can take those numbers and drop them into a table that compares inputs to outputs also known as a confusion matrix. To simplify some of our future calculations we can further define some of the cells in the table above

  • True Positives (TP): Correctly captured an intended vote for the democrats as a vote for the democrats (97)
  • True Negatives (TN): Correctly captured a vote NOT intended for the democrats as a vote, not for the democrats (97)
  • False Positives (FP): Incorrectly captured a vote NOT intended for the democrats as a vote for the democrats (0)
  • False Negatives (FN): Incorrectly captured an intended vote for the democrats as a vote not for the democrats (3)

Now we can slightly relabel our accuracy equation and calculate our accuracy with our naïve classifier and the associated values from the table above.

97% Accuracy! We just created the world’s stupidest classifier and achieved 97% accuracy! And therein lies the rub. The second I expose this classifier to the real world with a more balanced set of inputs across classes we will quickly see our accuracy plummet. Hence, we need a better set of metrics. Ladies and gentlemen, I am delighted to introduce …

  • Precision: Of the votes recorded (predicted) for the Democrats, how many were correct

  • Recall: Of all possible votes for the Democrats, how many did we find

What becomes blatantly clear from evaluating these two metrics is that our classifier, which appeared to have great accuracy, is terrible. None of the intended votes for the democrats were correctly captured and of all possible intended votes for the democrats, we found none of them. It’s worth noting that the example I’ve presented here is for a binary classifier (democrat, not democrat) but these metrics can easily be adapted to multi-class systems that more accurately reflect our actual candidate choices in the United States.

There’s No Such Thing as 100% Precision and Recall

Gödel’s incompleteness theorem, which loosely states that every non-trivial formal system is either incomplete or inconsistent, likely applies to machine learning and artificial intelligence systems. In other words, since machine learning algorithms are built around our known formal mathematical systems there will be some truths they can never describe. A consequence of that belief and something I preach to everyone I work with is that there is really no such thing as 100% precision and recall. No matter how great your model is and what your test metrics tell you. There will always be edge cases.

So if 100% precision and recall is all but impossible what do we do? When developing products around machine learning classifiers, we often ask ourselves what is most important to the customer, precision, recall, or both. For example, if I create a facial recognition system that notifies the police if you are a wanted criminal, we probably want to air on the side of precision because arresting innocent individuals would be intolerable. But in other cases, like flagging inappropriate images on a social network for human review, we might want to air on the side of recall, so we capture most if not all images and allow humans to further refine the set.

It turns out that very often precision and recall can be traded off. Most classifiers emit a confidence score of sorts (also known as a SoftMax output) and by just varying the threshold on that output we can trade-off precision for recall and vice-vera. Another way to think about this is, if I require my classifier to be very confident in its output before I accept the result, I can tip the results in favor of precision. If I loosen my confidence threshold, I can tip it back in favor of recall.

And how might this apply in voting? Well, if I structure my laws and regulations such that every voter must vote in person with 6 forms of ID and the vote is tallied in front of the voter by a 10-person bipartisan evaluation team who must all agree … we will likely have very high precision. After all, we’ve greatly increased the confidence in the vote outcome. But at what expense? We will also likely slow down the voting process and create massive lines which will significantly increase the number of people who might have intended to vote but don’t actually do so, hence decreasing recall.

Remind me again what the hell this has to do with Voting

The conservative-leaning Heritage Foundation makes the following statement on their website:

“It is incumbent upon state governments to safeguard the electoral process and ensure that every voter’s right to cast a ballot is protected.”

I strongly subscribe to that statement and I believe it is critical to the success of any representative democracy. But ensuring that every voter’s right to cast a ballot is protected, requires not only that we accurately record the captured votes, but also ensure that every voter who intends to vote is unhindered in doing so.

Maybe we need to move entirely to in-person voting while simultaneously allocating sufficient funds for more polling stations, government-mandated paid time off, and government-provided childcare. Or maybe we need all mail-in ballots but some new process or technology to ensure the accuracy of the votes. Ultimately, I don’t pretend to know the right answer, or if we even have a problem, to begin with. What I do know is that if we wish to improve our election systems we must first start with data on where we stand today and then tweak our laws and regulations to simultaneously optimize for precision and recall.

So, the next time a politician proposes changes to our election system ask … no demand, they provide data on the current system and how their proposed changes will impact precision and recall. Because only when we optimize for both these metrics can we stop worrying about making America great again and start working on making America even greater!