Introducing Task-Oriented Multiparty Conversational AI: Inviting AI to the Party

The term “conversational AI” has been around for some time. There are dozens of definitions all over the internet. But let me refresh your memory with a definition from NVIDIA’s website.

Conversational AI is the application of machine learning to develop language-based apps that allow humans to interact naturally with devices, machines, and computers using speech

There’s nothing wrong with that definition except for one small misleading phrase: “… allow humans to interact …”. What that should say is: “… allow a human to interact …”. Why? Because every interaction you’ve ever had with a conversational AI system has been one-on-one.

Sure, you and your kids can sit around the kitchen table blurting out song titles to Alexa (“Alexa, play the Beatles,” “No Alexa, play Travis Scott,” “No Alexa, play Olivia Rodrigo.” …). Alexa may even acknowledge each request, but she isn’t having a conversation with your family. She’s indiscriminately acknowledging and transacting on each request as if they’re coming in one by one, all from the same person.

And that’s where multiparty conversational AI comes into play.

What is Multiparty Conversational AI

With a few small tweaks, we can transform our previous definition of conversational AI to one that accurately defines multiparty conversational AI.

Multiparty conversational AI is the application of machine learning to develop language-based apps that allow AI agents to interact naturally with groups of humans using speech

While the definitions may appear similar, they are fundamentally different. One implies a human talking to a machine, while our new definition implies a machine being able to interact naturally with a group of humans using speech or language. This is the difference between one-on-one interactions versus an AI agent interacting in a multiparty environment.

Multiparty conversational AI isn’t necessarily new. Researchers have been exploring multiparty dialog and conversational AI for many decades. I personally contributed to early attempts at building multiparty conversational AI into video games with the Kinect camera nearly fifteen years ago.1 But sadly no one has been able to solve all the technical challenges associated with building these types of systems and there has been no commercial product of note.

What about the “Task-Oriented” part?

You may have wisely noted that I have not mentioned the words “task-oriented” contained in the title of this post. Conversational AI (sometimes also called dialog systems) can be divided into two categories, open-domain and task-oriented.

Open-domain systems can talk about any arbitrary topic. The goal is not necessarily to assist any particular action, but rather engage in arbitrary chitchat. Task-oriented systems are instead focused on solving “tasks”. Siri and Alexa are both task-oriented conversational AI systems.

In multiparty systems, tasks become far more complicated. Tasks are usually the output of a conversation where a consensus is formed that necessitates action. Therefore any task-oriented multiparty conversational AI system must be capable of participating in forming a consensus or it will risk taking action before it is necessary to do so

Multiparty Conversational AI, What is it Good For?

“Absolutely Everything!” Humans are inherently social creatures. We spend much of our time on this planet interacting with other humans. Some have even argued that humans are a eusocial species (like ants and bees) and that our social interactions are critical to our evolutionary success. Therefore, for any conversational AI system, to truly become an integral part of our lives, it must be capable of operating amongst groups of humans.

Nowhere is this more evident than in a corporate work environment. After all, we place employees on teams, they have group conversations on Slack/Teams and email, and we constantly gather small groups of people in scheduled or ad-hoc meetings. Any AI system claiming to improve productivity in a work environment will ultimately need to become a seamless part of these group interactions.

Building Task-Oriented Multiparty Conversational AI Systems

There is a litany of complex problems that need to be solved to reliably build a task-oriented multiparty conversational AI system that would be production-worthy. Below is a list of the most critical areas that need to be addressed.

  • Task detection and dialog segmentation
  • Who’s talking to whom
  • Semantic parsing (multi-turn intent and entity detection)
  • Conversation dissentanglement
  • Social graphs and user/organization preferences
  • Executive function
  • Generative dialog

In the next sections, we’ll briefly dive deeper into each of these areas.

Task Detection and Dialog Segmentation

In a single-party system such as Alexa or Siri task detection is quite simple. You address the system (“Hey Siri …” ) and everything you utter is assumed to be a request to complete a task (or follow up on a secondary step needed to complete a task). But in multiparty conversations, detecting tasks2 is far more difficult. Let’s look at two dialog segments below

Two aspects of these conversations make accurately detecting tasks complex.:

  • In the first dialog, our agent Xena, is an active part of the conversation, and the agent is explicitly addressed. However, in the second conversation, our agent passively observed a task assigned to someone else and subsequently proactively offered assistance. That means we need to be able to detect task-oriented statements (often referred to as a type of dialog act) that might not be explicitly addressed to the agent.
  • The second issue is that the information necessary to complete either of these tasks is contained outside the bounds of the statement itself. That means we need to be able to segment the dialog (dialog segmentation) to capture all the utterances that pertain to the specific task.

Beyond the two challenges above there is also the issue of humans often making vague commitments or hedging on ownership. This presents additional challenges as any AI system must be able to parse whether a task request is definitive or not and be able to handle vague tasks or uncertain ownership.

Who’s Talking to Whom

To successfully execute the task in a multiparty conversation we need to know who is making the request and to whom it is assigned. This raises another set of interesting challenges. The first issue is, how do we even know who is speaking in the first place?

In a simple text-based chat in Slack, it is easy to identify each speaker. The same is true of a fully remote Zoom meeting. But what happens when six people are all collocated in a conference room? To solve this problem we need to introduce concepts like blind speaker segmentation and separation and audio fingerprinting.

But even after we’ve solved the upfront problem of identifying who is in the room and speaking at any given time there are additional problems associated with understanding the “whom”. It is common to refer to people with pronouns and in a multiparty situation you can’t just simply assume “you” is the other speaker. Let’s look at a slightly modified version of one of the conversations we presented earlier.

The simple assumption would be that the previous speaker (User 2) is the “whom” in this task statement. But after analyzing the conversation it is clear that “you” refers to User 1. Identifying the owner or “whom” in this case requires concepts like coreference resolution (who does “you” refer to elsewhere in the conversation) to correctly identify the correct person.

Semantic Parsing

Semantic parsing, also sometimes referred to as intent and entity detection, is an integral part of all task-oriented dialog systems. However, the problem gets far more complex in multiparty conversations. Take the dialog in the previous section. A structured intent and entity JSON block might look something like this:

    "intent": "schedule_meeting",
    "entities": {
        "organizer": "User 1",
        "attendees": [
            "User 2",
            "User 3"
        "title": "next quarter roadmap",
        "time_range": "next week"

Note that all of the details in this JSON block did not originate from our task-based dialog act. Rather the information was pulled from multiple utterances across multiple speakers. Successfully achieving this requires a system that is exceptionally good at coreference resolution and discourse parsing.

Conversation Disentanglement

While some modern chat-based applications (e.g. Slack) have concepts of threading that can help isolate conversations, we can’t guarantee that any given dialog is single-threaded. Meetings are nonthreaded and chat conversations can contain multiple conversations that are interspersed with each other. That means any multiparty conversational AI system must be able to pull apart these different conversations to transact accurately. Let’s look at another adaptation of a previous conversation:

In this dialog, two of our users have started a separate conversation. This can lead to ambiguity in the last request to our agent. User 3 appears to be referring to the previous meeting we set up, but knowing this requires we separate (or disentangle) these two distinct conversations so we can successfully handle subsequent requests.

Social / Knowledge Graph and User Preferences

While this might not be obvious, when you engage in any multiparty conversation you are relying on a database of information that helps inform how you engage with each participant. That means any successful multiparty conversational AI system needs to be equally aware of this information. At a bare minimum, we need to know how each participant relates to each other and their preferences associated with the supported tasks. For example, if the CEO of the company is part of the conversation you may want to defer to their preferences when executing any task.

Executive Function

Perhaps most importantly, any task-oriented multiparty conversational AI system must have executive function capabilities. According to the field of neuropsychology, executive function is the set of cognitive capabilities humans use to plan, monitor, and execute goals.

Executive function is critically important in a multiparty conversation because we need to develop a plan for whether we immediately take action on any given request or if we must seek consensus first. Without these capabilities, an AI system will just blindly execute tasks. As described earlier in this post this is exactly how your Alexa behaves today. If you and your kids continuously scream out “play <song name x>” it will just keep changing songs without any attempt to build consensus and the interaction with the conversational AI system will become dysfunctional. Let’s look at one more dialog interaction.

As you can see in the example above our agent just didn’t automatically transact on a request to move the meeting to Wednesday. Instead, the agent used its executive function to do a few things:

  • Recognize that the second request was not the request originator
  • Preemptively pull back information about whether the proposal was viable
  • Seek consensus with the group before executing

Achieving this capability requires the gathering of data previously collected, developing a plan, and then executing against that plan. So for a task-oriented multiparty conversational AI system to correctly operate within a group, it must have executive function capabilities.

Generative Dialog Engine

Last but not least, any conversational AI system must be able to converse with your users. However, because the number of people in any given conversation and their identities are not predictable and our executive functions can cause a wide array of responses, no predefined or templated list will suffice for generating responses. A multiparty system will need to take all our previously generated information and generate responses on demand.

Wait, Don’t Large Language Models (LLMs) Solve This

With all the hype, you’d think LLMs could solve the problem of task-oriented multiparty conversational AI – and weave straw into gold. But it turns out that, at best, LLMs are just a piece in a much larger puzzle of AI technology needed to solve this problem.

There are basic problems like the fact that LLMs are purely focused on text and can’t handle some of the speaker identification problems discussed earlier. But even more importantly there is no evidence that LLMs have native abilities to understand the complexities of social interactions and plan their responses and actions based on that information.

It will require a different set of technologies, perhaps leveraging LLMs in some instances, to fully build a task-oriented multiparty conversational AI system.

So When Can I Invite an AI to Join the Party

While I can’t say anyone has solved all the challenges discussed in this post, I can say we are very close. My team at Xembly has developed what we believe is the first commercial product capable of participating in multiparty conversations as both a silent observer and an active participant. Our AI agent can join in-person meetings or converse with a group in Slack while also helping complete tasks that arise as a byproduct of these conversations.

We are only just beginning to tackle task-oriented multiparty conversational AI. So we may not be the life of the party, but go ahead and give Xembly and our Xena AI agent a try. The least you can do is send us an invite!

  1. With the Kinect Camera, we hoped to individually identify speakers in a room so each user could independently interact with the game. You can read more details about or work in this space here: 1, 2, 3, 4 ↩︎
  2. Are tasks in multiparty conversations just action items? Yes, since an action item is generally defined as a task that arises out of a group’s discussion. I’ll be writing a larger deep dive into action item detection in a future post. ↩︎

End-to-End Speech Recognition: Part 1 – Neural Networks for Executives (I Mean Dummies)

When I originally contemplated the subject of my next blog post, I thought it might be interesting to provide a thorough explanation of the latest and greatest speech recognition algorithms, often referred to as End-to-End Speech Recognition, Deep Speech, or Connectionist Temporal Classification (CTC).   However, as I began to research the topic I quickly discovered that my basic knowledge of neural networks was woefully lacking.  Several weeks of reading and a few hundred lines of code later, I realized before I could teach a fellow plebe like myself about end-to-end speech recognition,  I probably needed to introduce the fundamentals first.

With that in mind, what was intended to be a single entry will likely turn into multiple blog posts covering an overview of end-to-end speech recognition and some fundamentals of deep learning that make it possible.  In this first post I’d like to provide a brief introduction to end-to-end speech recognition and then give a more detailed tutorial about one of the basic components of deep learning, a multilayer perceptron, also known as a feed forward neural network.  I’ll then walk you through how I brought all this information together while building a very basic end-to-end speech recognition system.

End-to-End Speech Recognition

So what is end-to-end speech recognition anyway?  At it’s most basic level an end-to-end speech recognition solution aims to train a machine to convert speech to text by directly piping raw audio input with associated labeled text through a deep learning algorithm.   The resulting model is then able to recognize speech with no further algorithmic components.


And why is this any better than traditional speech recognition systems?  Traditional speech recognition systems use a much more complicated architecture that includes feature generation, acoustic modeling, language modeling, and a variety of other algorithmic techniques in order to be accurate and effective.   This in turn makes the training, testing, and code complexity far more difficult than would be with an end-to-end system.


In other words an end-to-end solution greatly reduces the complexity in building a speech recognition system.   And if that alone doesn’t convince you of the value an end-to-end recognizer brings to the table, several research teams, most notably the folks at Baidu, have shown that they can achieve superior accuracy results over traditional speech recognition systems.

To validate the possibilities of an end-to-end speech recognition system I decided to build my own.  However, I quickly found that building such a system required advanced knowledge of deep learning techniques.   This is because the current end-to-end systems generally rely on more complex neural network algorithms like Recurrent Neural Networks (RNNs) and something called the connectionist temporal loss function that are difficult to understand if you don’t have a solid understanding of basic neural networks.   So I opted to take a simpler approach and see if I could build a very simple end-to-end recognizer using basic deep learning techniques.   Specifically a feed forward neural network or multi layer perceptron.

Neural Network Fundamentals

Before I dive into the details, let me provide a quick tutorial on the feed forward neural network.  The underlying element of a neural network is called a perceptron or an artificial neuron.  Much like a biological neuron, a perceptron takes a series of inputs, performs a function on those inputs, and produces and output that can be passed to other neurons.


The simplest function is just a sum of weighted inputs.  However this function is a linear relationship and the world is rarely linear so we apply something called an activation function to help impart nonlinearity.   There are actually numerous activation functions used in neural networks, some linear and some not, but the Sigmoid and TanH functions are two you will commonly see in the relevant literature.


Now that we know what a neuron is, a neural network is really just a collection of multiple interconnected neurons.   Neurons are grouped and connected in “layers”.   The simplest neural network is a single layer network that connects one or more inputs to one or more outputs.   There is no calculation on the input layer, only the output layer.


Neural networks can grow in complexity by adding additional layers which are commonly referred to as “hidden layers”.  In theory a network can contain an infinite number of layers with an infinite number of neurons although this is neither practical or necessary.


The only remaining question then is how do we know what weights will give us the outputs we are looking for.  A simple feed forward neural network uses a technique called forward and back propagation to train the network and find the optimal weights.   There are dozens of books and blog posts devoted to the subject of how the forward and back propagation algorithms work, but for the sake of this blog post I’ll provide an introductory explanation along with pointers to additional information.

The main idea requires randomly initializing our weights and pushing the inputs “forward” through the network so we can make an output prediction.   We then use a cost or loss function to calculate how far our prediction was from the expected result.

Our ultimate goals is to reduce our error or cost to the lowest point possible (sometimes referred to as the global minimum).  To do this we use an algorithm called gradient decent.   The goal of the gradient descent algorithm is to find the partial derivative of the cost function with respect to each weight.


In other words we’re looking for the direction (+/-) and slope of our cost function to tell us how large to adjust our weights and in which direction in order to get to zero cost (or close to it).  If the gradient is 0 we have reached our minima.   While I won’t go into the details thanks to the concept of the chain rule in calculus we can actually start at the output layer , perform the gradient descent algorithm, and “back” propagate it to the next layer and all the way back to our inputs.  Along the way we are calculating how much we need to adjust our weights to get closer to that zero cost.


When training a neural network we continue to forward and back propagate until we we have minimized the error.  While I have grossly oversimplified the explanation for forward and back propagation, this is fundamentally how neural networks work.  I have provided links to more detailed descriptions at the end of this post.

Putting it All Together

Now that we have some basic knowledge of end-to-end speech recognition systems and neural networks, we’re ready to make a simple end-to-end speech recognizer.  To build this recognizer I used python and the numpy library to help with the matrix math.

However, before we start we need a simple speech data set.  Preferably one consisting of utterances with only single words.  This would eliminate the need to deal with time alignment (i.e. which text goes with which audio segment in time).  Luckily I found a great freely available dataset consisting of people speaking single digits 0 – 9 with fifty utterances per digit per person.   This data set met the criteria of being a single word while also being sufficiently large enough to train a neural network.

With labeled audio data in hand the next step required is reading in the audio data and the associated labels  For this I used the python librosa library.  Librosa provides easy to use out-of-the-box functions for computing the Short Time Fourier Transform (STFT) which is necessary to get the frequency spectrum of our audio signal (e.g. our input signal).  Librosa additionally provides handy functions for computing other audio features like Mel Frequency Cepstral Coefficients (MFCC) which can also be a useful audio input feature (note my code provides an alternative implementation that uses MFCC’s instead of the raw spectrum)

for files in file_list:
    relative_path = 'recordings/' + files[0]
    file_name = os.path.join(os.path.dirname(__file__), relative_path)
    y, sr = load(file_name, sr=None)
    filesize = sys.getsizeof(y)
    if output_type == 'spectrum':
        spectrum = stft(y, nfft, hop_length=int(filesize / 2))
        mag, phase = magphase(spectrum)
    mfcc = feature.mfcc(y, sr, n_mfcc=nmfcc, hop_length=int(filesize / 2))
    mfcc = mfcc[1:nmfcc]

Beyond the audio, we also need to store the associated digit spoken in each audio file.   When training a multiclass classifier ( in our case our classes are 0 – 9) it’s common to use something called “one hot” vectors to represent the output.   This is just a vector where all the classes are represented by 0 except for the one element representing the actual output class.   So in our case we have a 10 element vector and if the audio file is someone saying “one’ the vector would look like [0 1 0 0 0 0 0 0 0 0 ].

class digits:
    zero    = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    one     = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
    two     = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
    three   = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
    four    = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
    five    = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
    six     = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
    seven   = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
    eight   = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
    nine    = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

With our inputs and outputs squared away it’s time to define our network. The variables that make up your network are also known as hyper-parameters. For my end-to-end recognizer I selected the following hyper-parameters: (*Note that selecting hyper-parameters is half art and half science and your choices will be critical to the success of your network.  I have provided additional resources below)

  • Number of layers:3 (input, output and one hidden layer)
  • Nodes in hidden layer: 2048 (1x our frequency bins)
  • Activation functions: TanH (HIdden), Sigmoid (Output)
  • Weight initialization algorithm: Xavier or Glorot
  • Learning rate = 0.001
  • Rate decay = 0.0001
input_layer = layers.Layer(inputs=training_inputs.shape[0], neurons=training_inputs.shape[1] + 1)
if mode == 'E2E':
    hidden_layer = layers.Layer(inputs=training_inputs.shape[1] + 1, neurons=2048,
    output_layer = layers.Layer(inputs=2048, neurons=training_outputs.shape[1],
    output_layer.Initialize_Synaptic_Weights()if mode == 'E2E':

nnet = NeuralNetwork(layer1=input_layer, layer2=hidden_layer, layer3=output_layer, learning_rate=0.001,
learning_rate_decay=0.0001, momentum=0.5)

So now that we have our inputs and outputs, and we’ve defined our network, all we need to do is train using our forward and back propagation functions. Per my earlier description the forward propagation algorithm is quite simple and is really just summing the weighted inputs and applying the activation functions. Using matrix math this can be written in three or four simple lines of code.

def Feed_Forward(self, inputs):
    self.l1_inputs[:,0:self.layer1.neurons-1] = inputs
    self.l2_hidden = self.layer2.activation(dot(self.l1_inputs, self.layer2.synaptic_weights))
    self.l3_output = self.layer3.activation(dot(self.l2_hidden, self.layer3.synaptic_weights))
    return  self.l3_output

The forward propagation algorithm gives us our predicted output.  Using that predicted output we can perform our back propagation.  Much like my earlier explanation we need to perform a series of steps for each layer.   Specifically we need to calculate the error, calculate the gradient, and adjust our weights based on the previous two calculations.

def Back_Propogate(self, outputs):
    output_deltas = numpy.zeros((self.layer1.inputs, self.layer3.neurons))
    l3_output_error = -(outputs - self.l3_output)
    if self.layer3.activation_derivative == activationfunctions.Sigmoid_Activation_Derivative:
        output_deltas = self.layer3.activation_derivative(self.l3_output) * l3_output_error
    elif self.layer3.activation_derivative == activationfunctions.softmax_derivative:
        output_deltas = l3_output_error
    elif self.layer3.activation_derivative == activationfunctions.Oland_Et_Al_Derivative:
        output_deltas = self.layer3.activation_derivative(self.l3_output) - outputs
    hidden_deltas = numpy.zeros((self.layer1.inputs, self.layer2.neurons))
    l2_hidden_error =
    hidden_deltas = self.layer2.activation_derivative(self.l2_hidden) * l2_hidden_error
    adjustment1 =
    self.layer3.synaptic_weights = self.layer3.synaptic_weights - (adjustment1 * self.learning_rate) #+ self.l3_output_adjustment * self.momentum
    self.l3_output_adjustment = adjustment1
    adjustment2 =
    self.layer2.synaptic_weights = self.layer2.synaptic_weights - (adjustment2 * self.learning_rate) #+ self.l2_hidden_adjustment * self.momentum
    self.l2_hidden_adjustment = adjustment2

To bring it all together we just need to iterate over our forward and back propagation algorithms until we have stopped learning or have reduced our cost or error to it’s lowest possible point.

def Train(self, inputs, outputs, iterations):
    for iteration in range(iterations):
        error = 0.0
        # random.shuffle(patterns)
        # turn off random
        randomize = numpy.arange(len(inputs))
        inputs = inputs[randomize]
        outputs = outputs[randomize]
        error = self.Back_Propogate(outputs)
        error = numpy.average(error)
        if iteration % 10 == 0:
            print('error %-.5f' % error)
        # learning rate decay
        self.learning_rate = self.learning_rate * (
        self.learning_rate / (self.learning_rate + (self.learning_rate * self.learning_rate_decay)))

That’s it!  While there is a lot more glue code and learning that went into this implementation what I have presented here represents the fundamental building blocks of a basic end-to-end speech recognition system.  I have made the full project available on GitHub and you can evaluate the code yourself in order to fully comprehend all the details.  I’ve also provided a bevy of resources below that helped get me to this point and can do the same for you.

Final Thoughts

You might be asking why a senior leader in my position would spend the time required to go through this exercise.  There are some general principles I like to follow and I think anybody managing a research oriented (or really any engineering) team should consider as well.  Specifically:

  • ABL – Always Be Learning:  If you want to innovate you need to be up to speed on the latest technology trends.
  • Earn your team’s respect:  The best way to earn the respect of your technical team is to get into the trenches.  Show them that you understand their job and all the pain that comes with it.  In other words write code (any code), test it, check it in, and push it to production.
  • Lead by example: If you want your team to “innovate for the masses”, it’s best demonstrate the behaviors you are looking for.

Hopefully this post has given you a basic understanding of end-to-end speech recognition systems and neural networks  If you’re really brave perhaps you’ve learned how to build your own simple end-to-end recognizer.  But if you take nothing else away from this article I hope it’s that you’ll invest your time improving your own technical skills and getting in the trenches to earn your team’s respect.

In an upcoming post I’ll dig deeper into end-to-end speech recognition algorithms and how they work.  Specifically we’ll cover recurrent neural networks and the connectionist temporal classification algorithms that truly allow these systems to be superior over traditional speech recognition systems.  In the mean time I hope you get a chance to “wreck a nice beach”!

  1. “How to build a simple neural network in 9 lines of Python code” – Milo Spencer-Harper
  2. “How to build a multi-layered neural network in Python” – Milo Spencer-Harper
  3. “Understanding and coding Neural Networks from Scratch in Python and R” – Sunil Ray
  4. “How to Compute the Derivative of  Sigmoid Function (fully worked example)” – Jeremy (no last name)
  5. “Practical recommendation for Gradient-Based Training of Deep Architectures” – Yoshua Bengio
  6. “How to train your Deep Neural Network” – Rishabh Shukla
  7. “Understanding the difficulty of training deep feedforward neural networks” – Xavier Glorot and Yoshua Bengio
  8. “Deep Learning Basics: Neural Networks, Backpropegation, and stochastic Gradient Descent” –  Alex Minnaar
  9. “Speech Recognition: You down with CTC” – Karl N.
  10. “Deep Speech: Scaling up end-to-end speech recognition” – Andrew Y. Ng et al.
  11.  “Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks” – Alex Graves et al.

Microsoft’s 5.1% Word Error Rate (WER) Announcement is Complete and Utter Bullshit


I apologize! That title was actually generated by Microsoft’s speech recognition system incorrectly transcribing “Microsoft’s 5.1% Word Error Rate (WER) Announcement is Completely Misleading”.   Okay, that was snarky, but I promise Microsoft compelled me to write that.  You see in the course of editing my previous post Microsoft had to go and put out a press release announcing “Microsoft Researchers Achieve new Conversational Speech Recognition Milestone”.  Their announcement flies in the face of my previous post and therefore I had no choice but to attempt an epic takedown.

Before I try to dismantle Microsoft’s irrational clam I would like to state that the the researchers at Microsoft (some of whom I have crossed paths with while working on the Xbox Kinect and HoloLens) have done some solid research with potential implications on how we build production speech recognition systems.   I have no issues with the technical nature of the research paper underpinning the press release, but I do take issue with the marketing and PR spin applied on top of it.  So without further ado “LET’S GET READY TO RUMBLE”.

There are two primary issues with the announcement made by Microsoft:

  1. Does Microsoft’s testing provide conclusive evidence that the 5.1% WER results will generalize
  2. Are the tactics used viable from a cost/compute/timeliness perspective in a production system

Let’s tackle each of these issues independently.

Will the Results Generalize

In my previous post I discussed why large data sets were critical for training truly accurate conversational speech recognition systems.   While I do take issue with the data size used to train the Microsoft speech recognition system, the larger issue is with the test set used to validate the word error rate.

In Andrew Ng’s seminal talk on the “nuts and bolts of machine learning”, he goes into great detail on the different data sets required for training, testing and validating machine learning algorithms.  I encourage anybody interested in the optimal process for training and testing machine learning / AI like algorithms to watch this seriously awesome video.   In terms of Microsoft’s research I want to focus on the relatively small size of their test corpus, it’s overlap with the training data, and the fact that the chosen corpus appears cherry-picked.

Corpus Size

The test set Microsoft selected for calculating the reported  5.1% WER is the 2000 NIST CTS SWITCHBOARD corpus.  While I was unable to find the specific number of hours of conversation in this test corpus I was able to confirm that the 1998 and 2001 NIST CTS data sets contained 3 and 5 hours of conversation respectively.  We can therefore assume the number of hours of conversation in the 2000 set is similar in duration.   When considering the overall size of the conversational speech domain explained in my previous post  a test set of this size is hardly sufficient for making any broad claims about meeting or beating human transcription accuracy.

Training Data Overlap

As you dig into the details of the NIST corpus a dirty little secret is quickly revealed.  Let me start by quoting directly from the source:

“Of the forty speakers in these conversations thirty-six appear in conversations of the published Switchboard Corpus.”

Let me translate that for you.   Thirty-six of the speakers in the test corpus are the same speakers used in Microsoft’s training corpus.   I’ll also remind you that the Switchboard corpus only has 543 speakers to begin with.  This raises a foundational questions about whether the test data is really distinct relative to the training set.   You see almost all modern speech recognition systems use something called i-vectors to help achieve speaker independence (sometimes called speaker adaptation).  Since the same speakers, on the same devices, in the same environments exist in both the training and test corpus there will invariably be a correlation between the i-vectors generated by the two data sets.

Per the diagram below, a truly honest measure of WER would require the the test data be truly distinct from the training set .  In other words it should pull from a data set that includes different speakers, different content, and different acoustic environments.   What is clear from the Microsoft paper is that this didn’t happen which calls into question whether the published results will truly generalize.  It also greatly diminishes the the validity of any claim about a new “milestone” being achieved in conversational speech recognition.


It’s worth noting that the full 2000 NIST CTS corpus actually contains a total of 40 conversations.   Twenty of those conversations are from the Switchboard corpus and twenty are from a different corpus called “Call Home”.   This begs the question of why Microsoft only validated against the Switchboard portion of the corpus.   While I can’t say for sure what their intent was, my best guess is because if they had used the Call Home data the results would not have led to the desired goal of meeting or beating “human accuracy”.

Taken altogether, the small corpus, with overlapping data, and a cherry picked data set you can’t help but ask did Microsoft really achieve a “new conversational speech recognition milestone”?

Is it Production Ready

EBTKS.  For those not familiar with texting slang, that stands for “Everything But the Kitchen Sink”,  and it’s really the best description of the system Microsoft used for this research.  This calls into question the production viability of their proposed solution.

Ensemble Models

At the acoustic model (AM) and language model (LM) layer Microsoft is using an ensemble model technique.   This technique requires training multiple models and processing each utterance through every model.   A separate algorithm is used to combine the outputs of the different models.   In essence this equates to trying to run multiple recognizers at once for every audio utterance.  It currently requires an enormous number of machines to transcribe phone calls in real-time at scale   Microsoft appears to be running 4 distinct AMs and multiple LMs which will have serious performance impacts.   This raises questions about the number of machines and associated costs required to run a system like the one used in Microsoft’s paper.


On top of the ensemble modeling Microsoft is also using language Model Rescoring.  In order to rescore you usually have an initial language model produce an N-BEST lattice which is basically the top N paths predicted by the language model.   This lattice needs to be stored or held in memory in order for the rescoring to take place.   In Microsoft’s case they are generating a 500-best lattice.   While not crazy holding a 500-best lattice in memory in a scaled production speech recognition system would not be ideal unless it provided significant accuracy gains.   According to the paper the gains from rescoring were minimal at best.

In Conclusion

So where does that leave us?  Microsoft has done some great research on advancing speech recognition algorithms.   Research that I greatly appreciate and hope to review further.   However for Microsoft to even imply that they achieved some epic milestone in matching human transcription accuracy is downright preposterous.

In the words of renowned Johns Hopkins speech recognition researcher Daniel Povey:

“… … this whole competition between IBM and Microsoft on Switchboard is just a pissing contest, in which they both try to add in more data and bigger system combinations to beat the other one’s number.  It doesn’t really indicate any special progress.”

“Blinded by the Light, Revved up like a ???”

Image result for i don't understand what you're saying

I probably sang that Manfred Mann song a thousand times in my teen years and I was pretty sure the last word in that lyric was a feminine hygiene product until Google came along and taught me otherwise.   It turns out the lyrics to Blinded by the Light are very difficult to understand and so is conversational speech.

For my first substantive blog post on this site I’d like to continue on a theme we have been covering over at Marchex around the complexity in building automatic speech recognition (ASR) systems that can accurately understand unbounded conversational speech.   In this post I intend to dive a little deeper into WHY conversational ASR systems are so difficult to build, possible solutions to improve them, and the bounty for those who finally succeed.

There are really three primary issues that are limiting current systems from accurately recognizing conversational speech: Data, Data, and Data.    More specifically: Required Data Size,  Lack of Publically Available Data Sets, and Cost and Complexity with Acquiring the Required Data.

Required Data Size

There is no strict answer for how much data is needed to solve a given machine learning problem, but one oft-cited rule is the “rule of 10”.   The rule of 10 states that you need roughly 10 times as many examples as you have parameters.   While there are multiple parts of an ASR system including an acoustic model (AM) and a language model (LM), for now I am going to focus on the LM.   One parameter used in an LM is called an n-gram, specifically in most cases a trigram.   A trigram is basically the probabilities of any 3 words being seen next to each other.   So if we take the rule of 10 that would imply we need 10 times the number of 3 word combinations required for our task.

This is where the problem arises.  You see we humans write beautifully but we speak like idiots.   Grammar goes out the window when people talk, we stutter, words are often repeated over and over while people search for their next thought, and honestly some folks downright make up words that don’t even exist.   Taken together that means one can expect to see almost ANY combination of 3 words in the wild.   Everything from “a a a “ to “zebra zebra zebra” .   So if you don’t mind rewinding your brain to highschool math and combinatorics that means the number of 3 word combinations is:

| Number of Words in US English | ^3


| ~500,000 | ^3 = 125 QUADRILLION (i.e. a really #$%&’ing big number)

If we apply the rule of 10 we would need 1.25 QUINTILLION (i.e. an even bigger #$%&’ing number) utterances (basically a spoken sentence) containing examples of these trigrams.   Let me put this in perspective for you.    A single spoken utterance saved in a text file is roughly 50 bytes in size.   So in order to to store 1.25 QUINTILLION utterances I would need 50 * 1.25 QUINTILLION bytes of storage.  Or … 62,500 Petabytes!   For reference 20 years of internet archiving only consumed 23 petabytes as of 2015.  And if that doesn’t frame it for you think about it this way.   The average utterance duration is roughly 1.5 seconds. If I were to string 1.25 QUINTILLION recorded utterances together it would take approximately 60 millennia to play it back!

So what’s the point?   The point is that the data size required to cover all possible examples of spoken US English is almost inconceivable.  Is the rule of 10 an exact science?  No.  Does it matter?  No, because even if this estimate is wrong by 1/2 or 3/4 it is still huge.   Ultimately the data size needed to properly train a conversational ASR system is gargantuan.

Lack of Publically Available Data Sets

Okay so we need a lot of data.   Can’t we just buy it?  No!  Most publically available data sets are shockingly small compared to the size of the domain I described above.  As my fellow Marchex coworkers reported in our recently published research paper the size of 2 of the most commonly used data sets, fisher English and switchboard, is prohibitively small.

Switchboard Fisher English Marchex




















Dat Acquisition Cost and Complexity

Alright if you can’t buy it why don’t companies just go and collect the data themselves?   Well it turns out collecting 62,500 petabytes or 60 millennia’s worth of people conversing is no simple task.   There are two primary problems, collecting that amount of audio data and labeling it.

Audio Data

Where could someone acquire that quantity of audio data?   Well, there are countless hours of TV and Radio interviews out there but the dialog is generally scripted and edited so not reflective of true conversational speech.  On top of that in most cases companies do not have the legal rights to the data and acquiring those rights would be prohibitively expensive.

Amazon, Apple, Microsoft, Google, and other companies are all collecting mountains of data from various voice assistants (Alexa, Siri, etc.) and voicemail messages.   However all that speech data is mostly unidirectional and non-conversational (“Alexa tell me the weather” is not really conversational).

That leaves one obvious channel for acquiring conversational speech and that is phone calls.    So why don’t companies just collect call recordings at scale? The answer is simple:  WIRETAPPING.

In the US wiretapping is a federal and state statute aimed at ensuring your communications are private and there are criminal and civil penalties for those who violate the law.   What makes wiretapping laws particularly problematic is that the law varies by state specifically around who must consent to being recorded.

So why does this matter?   Well because 12 states require bidirectional consent and phone networks are open (nobody can guarantee they control both sides of the call).  While any company can update their “terms of service” to notify you that you are being recorded, they would have no easy way to guarantee that the other party has consented.   Unless they start playing that pesky message “this call might be recorded or monitored” in front of every call, including your weekly call with your mother!  This puts scalable call recording for consumer oriented phone services mostly out of reach since the risk of violating a criminal law is too high (I think it is safe to say Mark Zuckerberg has no interest in going to jail).

In fact just ask Google who has dealt with an ongoing wiretapping case because they were scanning the emails of GMAIL users to place targeted ads.   The argument is in fact incredibly similar in that Google was not just reading the email of GMAIL users but also any yahoo, Hotmail, etc. user who sent a GMAIL user an email.   In the 12 states requiring bidirectional consent the non-Gmail users never consented which has potentially caused Google to violate the law.


Even if by some miracle we could collect that amount of audio data how would we label it?   In general ASR systems (and all other machine learning systems) require accurately labeled data (sometimes called “ground truth”).   In the speech recognition world that generally involves hand transcribing data.   And if it would take 60 millennia to read out that much speech imagine how long it would take to hand transcribe it.   Simply put, it is not feasible in our lifetimes at any reasonable cost.

What’s the Solution

It turns out almost all companies record phone calls.  Recordings from any one company would have highly biased content but in aggregate consumer to business recorded phone calls are an amazing source of conversational speech at scale.   Because you need a  wide cross section of content to ensure subject matter diversity, companies who provide platform call recording solutions and have legal access to the aggregate data are really the best sources of this content.

But what about the labeling?  Well the only reasonable solution for labeling that content is using unsupervised or semi-supervised automated solutions for labeling the data.   This is where Marchex has invested and you can read more details about our semi-supervised approach in our research paper.   I hope to cover this topic in detail in a future post.

Why Does Any of this Matter

You might be asking if highly accurate conversational speech recognition is really necessary.   Or you might be thinking “My Alexa already works awesome”   But if you are a sci-fi nerd like me you’re anxiously awaiting the day that you can step foot on the holodeck and have a real conversation with an AI entity or crack open a beer with a fully conversant robot like Data from Star Trek.   For that to happen we need to truly understand conversational speech.  We need to understand it so machines can properly decipher what humans are saying and we need to understand it so machines can generate speech that mimics human dialog.

Highly accurate conversational speech recognition is necessary for us to fulfill the promised vision of artificial intelligence.  Who knows maybe in a few years a holographic Manfred Mann and I will be doing a duet in my own personal holodeck.   Can you hear it?   “Blinded by the light, revved up like a deuce …”