Introducing Task-Oriented Multiparty Conversational AI: Inviting AI to the Party

The term “conversational AI” has been around for some time. There are dozens of definitions all over the internet. But let me refresh your memory with a definition from NVIDIA’s website.

Conversational AI is the application of machine learning to develop language-based apps that allow humans to interact naturally with devices, machines, and computers using speech

There’s nothing wrong with that definition except for one small misleading phrase: “… allow humans to interact …”. What that should say is: “… allow a human to interact …”. Why? Because every interaction you’ve ever had with a conversational AI system has been one-on-one.

Sure, you and your kids can sit around the kitchen table blurting out song titles to Alexa (“Alexa, play the Beatles,” “No Alexa, play Travis Scott,” “No Alexa, play Olivia Rodrigo.” …). Alexa may even acknowledge each request, but she isn’t having a conversation with your family. She’s indiscriminately acknowledging and transacting on each request as if they’re coming in one by one, all from the same person.

And that’s where multiparty conversational AI comes into play.

What is Multiparty Conversational AI

With a few small tweaks, we can transform our previous definition of conversational AI to one that accurately defines multiparty conversational AI.

Multiparty conversational AI is the application of machine learning to develop language-based apps that allow AI agents to interact naturally with groups of humans using speech

While the definitions may appear similar, they are fundamentally different. One implies a human talking to a machine, while our new definition implies a machine being able to interact naturally with a group of humans using speech or language. This is the difference between one-on-one interactions versus an AI agent interacting in a multiparty environment.

Multiparty conversational AI isn’t necessarily new. Researchers have been exploring multiparty dialog and conversational AI for many decades. I personally contributed to early attempts at building multiparty conversational AI into video games with the Kinect camera nearly fifteen years ago.1 But sadly no one has been able to solve all the technical challenges associated with building these types of systems and there has been no commercial product of note.

What about the “Task-Oriented” part?

You may have wisely noted that I have not mentioned the words “task-oriented” contained in the title of this post. Conversational AI (sometimes also called dialog systems) can be divided into two categories, open-domain and task-oriented.

Open-domain systems can talk about any arbitrary topic. The goal is not necessarily to assist any particular action, but rather engage in arbitrary chitchat. Task-oriented systems are instead focused on solving “tasks”. Siri and Alexa are both task-oriented conversational AI systems.

In multiparty systems, tasks become far more complicated. Tasks are usually the output of a conversation where a consensus is formed that necessitates action. Therefore any task-oriented multiparty conversational AI system must be capable of participating in forming a consensus or it will risk taking action before it is necessary to do so

Multiparty Conversational AI, What is it Good For?

“Absolutely Everything!” Humans are inherently social creatures. We spend much of our time on this planet interacting with other humans. Some have even argued that humans are a eusocial species (like ants and bees) and that our social interactions are critical to our evolutionary success. Therefore, for any conversational AI system, to truly become an integral part of our lives, it must be capable of operating amongst groups of humans.

Nowhere is this more evident than in a corporate work environment. After all, we place employees on teams, they have group conversations on Slack/Teams and email, and we constantly gather small groups of people in scheduled or ad-hoc meetings. Any AI system claiming to improve productivity in a work environment will ultimately need to become a seamless part of these group interactions.

Building Task-Oriented Multiparty Conversational AI Systems

There is a litany of complex problems that need to be solved to reliably build a task-oriented multiparty conversational AI system that would be production-worthy. Below is a list of the most critical areas that need to be addressed.

  • Task detection and dialog segmentation
  • Who’s talking to whom
  • Semantic parsing (multi-turn intent and entity detection)
  • Conversation dissentanglement
  • Social graphs and user/organization preferences
  • Executive function
  • Generative dialog

In the next sections, we’ll briefly dive deeper into each of these areas.

Task Detection and Dialog Segmentation

In a single-party system such as Alexa or Siri task detection is quite simple. You address the system (“Hey Siri …” ) and everything you utter is assumed to be a request to complete a task (or follow up on a secondary step needed to complete a task). But in multiparty conversations, detecting tasks2 is far more difficult. Let’s look at two dialog segments below

Two aspects of these conversations make accurately detecting tasks complex.:

  • In the first dialog, our agent Xena, is an active part of the conversation, and the agent is explicitly addressed. However, in the second conversation, our agent passively observed a task assigned to someone else and subsequently proactively offered assistance. That means we need to be able to detect task-oriented statements (often referred to as a type of dialog act) that might not be explicitly addressed to the agent.
  • The second issue is that the information necessary to complete either of these tasks is contained outside the bounds of the statement itself. That means we need to be able to segment the dialog (dialog segmentation) to capture all the utterances that pertain to the specific task.

Beyond the two challenges above there is also the issue of humans often making vague commitments or hedging on ownership. This presents additional challenges as any AI system must be able to parse whether a task request is definitive or not and be able to handle vague tasks or uncertain ownership.

Who’s Talking to Whom

To successfully execute the task in a multiparty conversation we need to know who is making the request and to whom it is assigned. This raises another set of interesting challenges. The first issue is, how do we even know who is speaking in the first place?

In a simple text-based chat in Slack, it is easy to identify each speaker. The same is true of a fully remote Zoom meeting. But what happens when six people are all collocated in a conference room? To solve this problem we need to introduce concepts like blind speaker segmentation and separation and audio fingerprinting.

But even after we’ve solved the upfront problem of identifying who is in the room and speaking at any given time there are additional problems associated with understanding the “whom”. It is common to refer to people with pronouns and in a multiparty situation you can’t just simply assume “you” is the other speaker. Let’s look at a slightly modified version of one of the conversations we presented earlier.

The simple assumption would be that the previous speaker (User 2) is the “whom” in this task statement. But after analyzing the conversation it is clear that “you” refers to User 1. Identifying the owner or “whom” in this case requires concepts like coreference resolution (who does “you” refer to elsewhere in the conversation) to correctly identify the correct person.

Semantic Parsing

Semantic parsing, also sometimes referred to as intent and entity detection, is an integral part of all task-oriented dialog systems. However, the problem gets far more complex in multiparty conversations. Take the dialog in the previous section. A structured intent and entity JSON block might look something like this:

    "intent": "schedule_meeting",
    "entities": {
        "organizer": "User 1",
        "attendees": [
            "User 2",
            "User 3"
        "title": "next quarter roadmap",
        "time_range": "next week"

Note that all of the details in this JSON block did not originate from our task-based dialog act. Rather the information was pulled from multiple utterances across multiple speakers. Successfully achieving this requires a system that is exceptionally good at coreference resolution and discourse parsing.

Conversation Disentanglement

While some modern chat-based applications (e.g. Slack) have concepts of threading that can help isolate conversations, we can’t guarantee that any given dialog is single-threaded. Meetings are nonthreaded and chat conversations can contain multiple conversations that are interspersed with each other. That means any multiparty conversational AI system must be able to pull apart these different conversations to transact accurately. Let’s look at another adaptation of a previous conversation:

In this dialog, two of our users have started a separate conversation. This can lead to ambiguity in the last request to our agent. User 3 appears to be referring to the previous meeting we set up, but knowing this requires we separate (or disentangle) these two distinct conversations so we can successfully handle subsequent requests.

Social / Knowledge Graph and User Preferences

While this might not be obvious, when you engage in any multiparty conversation you are relying on a database of information that helps inform how you engage with each participant. That means any successful multiparty conversational AI system needs to be equally aware of this information. At a bare minimum, we need to know how each participant relates to each other and their preferences associated with the supported tasks. For example, if the CEO of the company is part of the conversation you may want to defer to their preferences when executing any task.

Executive Function

Perhaps most importantly, any task-oriented multiparty conversational AI system must have executive function capabilities. According to the field of neuropsychology, executive function is the set of cognitive capabilities humans use to plan, monitor, and execute goals.

Executive function is critically important in a multiparty conversation because we need to develop a plan for whether we immediately take action on any given request or if we must seek consensus first. Without these capabilities, an AI system will just blindly execute tasks. As described earlier in this post this is exactly how your Alexa behaves today. If you and your kids continuously scream out “play <song name x>” it will just keep changing songs without any attempt to build consensus and the interaction with the conversational AI system will become dysfunctional. Let’s look at one more dialog interaction.

As you can see in the example above our agent just didn’t automatically transact on a request to move the meeting to Wednesday. Instead, the agent used its executive function to do a few things:

  • Recognize that the second request was not the request originator
  • Preemptively pull back information about whether the proposal was viable
  • Seek consensus with the group before executing

Achieving this capability requires the gathering of data previously collected, developing a plan, and then executing against that plan. So for a task-oriented multiparty conversational AI system to correctly operate within a group, it must have executive function capabilities.

Generative Dialog Engine

Last but not least, any conversational AI system must be able to converse with your users. However, because the number of people in any given conversation and their identities are not predictable and our executive functions can cause a wide array of responses, no predefined or templated list will suffice for generating responses. A multiparty system will need to take all our previously generated information and generate responses on demand.

Wait, Don’t Large Language Models (LLMs) Solve This

With all the hype, you’d think LLMs could solve the problem of task-oriented multiparty conversational AI – and weave straw into gold. But it turns out that, at best, LLMs are just a piece in a much larger puzzle of AI technology needed to solve this problem.

There are basic problems like the fact that LLMs are purely focused on text and can’t handle some of the speaker identification problems discussed earlier. But even more importantly there is no evidence that LLMs have native abilities to understand the complexities of social interactions and plan their responses and actions based on that information.

It will require a different set of technologies, perhaps leveraging LLMs in some instances, to fully build a task-oriented multiparty conversational AI system.

So When Can I Invite an AI to Join the Party

While I can’t say anyone has solved all the challenges discussed in this post, I can say we are very close. My team at Xembly has developed what we believe is the first commercial product capable of participating in multiparty conversations as both a silent observer and an active participant. Our AI agent can join in-person meetings or converse with a group in Slack while also helping complete tasks that arise as a byproduct of these conversations.

We are only just beginning to tackle task-oriented multiparty conversational AI. So we may not be the life of the party, but go ahead and give Xembly and our Xena AI agent a try. The least you can do is send us an invite!

  1. With the Kinect Camera, we hoped to individually identify speakers in a room so each user could independently interact with the game. You can read more details about or work in this space here: 1, 2, 3, 4 ↩︎
  2. Are tasks in multiparty conversations just action items? Yes, since an action item is generally defined as a task that arises out of a group’s discussion. I’ll be writing a larger deep dive into action item detection in a future post. ↩︎

Your Large Language Model – it’s as Dumb as a Rock

© Jason Flaks -Initially generated by DALL-E and edited by Jason Flaks

Unless you’ve been living under a rock lately you likely think we’re entering some sort of AI-pocalypse. The sky is falling and the bots have come calling. There are endless reports of ChatGPT acing college-level exams, becoming self-aware, and even trying to break up people’s marriages! The way  OpenAI and their ChatGPT product have been depicted, it’s a miracle we haven’t all unplugged our devices and shattered our screens. It seems like a sensible way to stop the AI overlords from taking control of our lives.

But never fear! I am here to tell you that large language models (LLMs) and their various compatriots are as dumb as the rocks we all might be tempted to smash them with. Well, ok, they are smart in some ways. But don’t fret—these models are not conscious, sentient, or intelligent at all. Here’s why.

Some Like it Bot: What’s an LLM?

Large Language Models (LLMs) actually do something quite simple. They take a given sequence of words and predict the next likely word to follow. Do that recursively, and add in a little extra noise each time you make a prediction to ensure your results are non-deterministic, and voila! You have yourself a “generative AI” product like ChatGPT.

But what if we take the description of LLMs above and restate it a little more succinctly:

LLMs estimate an unknown word based on extending a known sequence of words.

It may sound fancy—revolutionary, even—but the truth is it’s actually old school. Like, really, really old school—it’s almost the exact definition of extrapolation, a common mathematical technique that’s existed since the time of Archimedes! If you take a step back, Large Language Models are nothing more than a fancy extrapolation algorithm.  Last I checked nobody thinks their standard polynomial extrapolation algorithm is conscious or intelligent. So why exactly do so many believe LLMs are?

Hear Ye, Hear Ye: What’s in an Audio Sample

Sometimes it’s easier to explain a complex topic by comparison. Let’s take a look at one of the most common human languages in existence—music.  Below are a few hundred samples from Bob Dylan’s “Like a Rolling Stone.” 

If I were to take those samples and feed them into an algorithm and then recursively extrapolate out for a few thousand samples, I’d have generated some additional audio content. But there is a lot more information encoded in that generated audio content than just the few thousand samples used to create it.

At the lowest level:

  • Pitch
  • Intensity
  • Timbre

At a higher level:

  • Melody
  • Harmony
  • Rhythm

And at an even higher level:

  • Genre
  • Tempo

So by simply extrapolating samples of audio, we generated all sorts of complex higher-level features of auditory or musical information. But pump the brakes! Did I just create AI Mozart? I don’t think so. It’s more like AI Muzak.

An AI of Many Words: What’s Next? 

It turns out that predicting the next word in a sequence of words will also generate more than just a few lines of text. There’s a lot of information encoded in those lines,  including the structure of how humans speak and write, as well as general information and knowledge we’ve previously logged. Here’s just a small sample of things encoded in a sequence of words:

  • Vocabulary
  • Grammar/Part of Speech (PoS) tagging
  • Coreference resolution (pronoun dereferencing)
  • Named entity detection
  • Text categorization
  • Question and answering
  • Abstract summarization
  • Knowledge base

All of the information above can, in theory, be extracted by simply predicting the next word, much in the same way predicting the next musical sample gives us melody, harmony, rhythm, and more.   And just like our music extrapolation algorithm didn’t produce the next Mozart, ChatGPT isn’t going to create the next Shakespeare (or the next horror movie villain, for that matter).

LLMs: Lacking Little Minds? 

Large Language Models aren’t the harbinger of digital doom, but that doesn’t mean they don’t have some inherent value. As an early adopter of this technology, I know it has a place in this time. It’s integral to the work we do at Xembly, where I’m the co-founder and CTO. However, once you understand that LLMs are just glorified extrapolation algorithms, you gain a better understanding of the limitations of the technology and how best to use it. 

Five Alive: How to Use LLMs So They Don’t Take Over the World

LLMs have huge potential. Just like any other tool, though, in order to extrapolate the most value, you have to use them properly. Here are five areas to consider as you incorporate LLMs into your life and work. 

  • Information must be encoded in text
  • Extrapolation error with distance
  • Must be prompted
  • Limited short-term memory
  • Fixed in time with no long-term memory

Let’s dig a little deeper.

Information Must Be Encoded in Text

Yan LeCun probably said it best:

Humans are multi-modal input devices and many of the things we observe and are aware of that drive our behavior aren’t verbal  (and hence not encoded in text). An example we contend with at Xembly is the prediction of action items from a meeting. It turns out that the statement “I’ll update the row in the spreadsheet” may or may not be a future commitment to do work.  Language is nuanced, influenced by other real-time inputs like body language and hundreds of other human expressions. It’s entirely possible in this example that the task was completed in real-time during the meeting, and the spoken words weren’t an indication of future work at all.

Extrapolation Error with Distance

Like all extrapolation algorithms, the further you get away from your source signal (or prompt in the case of LLMs), the more likely you will experience errors. Sometimes a single prediction that negates an otherwise affirmative statement or an incorrectly assigned gendered pronoun, can cause downstream errors in future predictions. These tiny errors can often lead to convincingly good responses that are factually inaccurate. In some cases, you may find LLMs return highly confident answers that are completely incorrect. These types of errors are referred to as hallucinations.

But both of these examples are really just forms of extrapolation error. The errors will be more pronounced when you make long predictions. This is especially true for content largely unseen by the underlying language model (for example, when trying to do long-form summarization of novel content).

Must Be Prompted

Simply put, if you don’t provide input text an LLM will do nothing. So if you are expecting ChatGPT to act as a sage and give you unsolicited advice, you’ll be waiting a long time. Many of the features Xembly customers rave about are based on our product providing unsolicited guidance. Large Language Models are no help to us here.

Limited Short-Term Memory

LLMs generally only operate on a limited window of text. In the case of ChatGPT, that window is roughly 3000 words. What that means is that new information not already incorporated in the initial LLM training data can very quickly fall out of memory. This is especially problematic for long conversations where new corporate lingo may be introduced at the start of a conversation and never mentioned again. Once whatever buzzword is used falls out of the context window it will no longer contribute to any future prediction, which can be problematic when trying to summarize a conversation.

Fixed in Time with no Long-term Memory

Every conversation you have with ChatGPT only exists for that session. Once you close that browser or exit your current conversation, there is no memory of what was said. That means you cannot depend on new words being understood in future conversations unless you reintroduce them within a new context window. So, if you introduce an LLM to anything it hasn’t heard before in a given session, you may find it uses that word correctly in subsequent responses. But if you enter a new session and have any hopes that the word will be used without introducing it in a new prompt, brace yourself—you will be disappointed.

To Use an LLM or Not to Use an LLM

It’s a big question. LLMs are exceedingly powerful, and you should strongly consider using them as part of your NLP stack. I’ve found the greatest value of many of these LLMs is that they potentially replace all the bespoke language models folks have been making for some time.  You may not need these custom entity modes, intent models, abstract summarization models, etc. It’s quite possible that LLMs can accomplish all of these things at similar or better accuracy, while possibly greatly reducing time to market for products that rely on this type of technology.  

There are many items in the LLM plus column, but if you are hoping to have a thought-provoking intelligent conversation with ChatGPT,  I suggest you walk outside and consult your nearest rock. You just might have a more engaging conversation!

“If you start me up I’ll never stop …” Until We Successfully Exit

“Hey, our fledgling startup is on path to being the next *INSERT BIG TECH COMPANY NAME HERE* and we think you’re a great fit for our CTO role”. Find me a technical leader who hasn’t been enticed by those words and you’ll have found a liar. So, what happens when one succumbs to the temptation and joins an early-stage startup? Well, if you have been wondering where I’ve been for the past couple of years, I was fighting the good fight at a small, early-stage NLP/Machine Learning based risk intelligence startup. And while I’m not retired or sailing around the world in my new 500-foot yacht, we were able to successfully exit the company with a net positive outcome for all involved. My hope with this post is that I can share some of my acquired wisdom, and perhaps steer the next willing victim down a similar path of success.

If I could sum up my key learnings in a few bullet points, it would boil down to this:

  • If you don’t believe … don’t join
  • Be prepared to contribute in any way possible
  • Find the product and focus on building it
  • Pick the race you have enough fuel for and win it

What I’d like to do in the rest of this post is break down each one of these items a little further.

If you don’t believe … don’t join

Maybe this goes without saying, but if you don’t believe in the vision, the people, and the product you shouldn’t join the startup approaching you. The CTO title is alluring, and it is easy to fool yourself into taking a job for the wrong reasons. But the startup experience is an emotional slog of ups and downs and it will be nearly impossible to weather the ride if you don’t wake up every day with an unyielding conviction for what you’re doing. As I’ll explain later in this post, you don’t need to believe you’re working for the next Facebook, but you do need to believe you are building a compelling product that has real value for you, your coworkers, your investors, and your customers.

Be prepared to contribute in any way possible

For the first few months on the job I used to go into our tiny office and empty all the trash bins because, if I didn’t, that small office with 5 engineers started to smell! It didn’t take long for someone to call out that I was appropriately titled, CTO (a.k.a. Chief Trash Officer). You might be asking why anybody would take a CTO job to wind up being the corporate custodian, but that is what was needed on some days.

While I have steadfastly maintained my technical chops throughout my career, I hadn’t really written a lick of production code for nearly two decades prior to this job. But with limited resources, it became clear I also needed to contribute to the code base and so I dusted off those deeply buried skills and contributed where I could. When you join a startup with that CTO title, it is easy to convince yourself that you’ll build a huge team, be swimming in resources, and have an opportunity to direct the band versus playing in it. But you’ll quickly find that in the early stages of a startup, the success of the company will depend on your willingness to drop your ego and contribute wherever you can.

Find the product and focus on building it

Great salespeople can sell you the Brooklyn Bridge. And if you’re just lucky enough, you might have a George C. Parker in your ranks. But the problem with great salespeople is they will do almost anything to close the sale and that comes with a real risk that they’ll sell custom work. If that happens over an extended period of time, you will be unable to focus on the core product offering and you’ll quickly find you’re the CTO of a work-for-hire / consulting company.

Startups face real financial pressures that often drive counterproductive behaviors. That often means doing anything necessary to drive growth in revenue, customers, or usage. But as illustrated in the graph below, high product variance will often ultimately lead to stagnant growth.

That’s because with every new feature comes a perpetual support cost. And if you keep building one-off features, and can’t fundraise fast enough, that cost will eventually come at the expense of delivering your true market-wide value proposition. If you allow this to happen, you’ll wind up with a company that generates some amount of revenue or usage but has no real value.

Companies that find true product/market fit should see product variance gradually decrease over time and this should allow the company to grow. Your growth trajectory may be linear when you need it to be exponential, but no per customer feature work will fix that problem and you may need to consider pivoting. If pivoting isn’t an option, it may be time to look for an exit.

As the CTO, a critical part of your job is to help the company find its product/market fit and then relentlessly focus on it. You need to hold the line against distractions and ensure the vast majority of resources are spent on features that align with the core value proposition. If you’ve truly found a product offering that is valued by a given market segment, and you can keep your resources focused on building it, growth will follow.

Pick the race you have enough fuel for and win it

I am an avid runner, and one of the great lessons of long-distance running is, that if you deplete your glycogen store, you’ll be unable to finish the race no matter how hard you trained. In other words, you can’t win the race if you have insufficient fuel. This is also very true of startups. If you’re SpaceX or Magic Leap, you’re running an ultra-marathon and you need a tremendous amount of capital in order to have sufficient time and resources to realize the value. But fundraising is hard, and even if you have an amazing product and top-notch talent, there can be significant barriers to acquiring sufficient capital.

The mistake some startups make is that they continue to run an ultra-marathon when they only have fuel for a 5k and that can lead to a premature or unnecessary failure. If funding becomes an issue, start looking for how your product might offer value to another firm. Start allocating resources towards making the product attractive for an acquisition. Aim to win a smaller race and seek more fuel on the next go around.

Final Thoughts

Taking on a CTO role at an early stage startup can be a great opportunity and lead to enormous success, but before you take the leap make sure you know what you’re getting into. Along the way don’t forget to stop and smell the roses. In the words of fellow Seattle native Macklemore, “Someday soon, your whole life’s gonna change. You’ll miss the magic of these good old days”.

Final Final Thoughts

No startup CTO is successful without support from an army of people. So I’d like to offer some gratitude to the following folks:

  • Greg Adams, Christ Hurst: Thanks for giving me an opportunity and treating me like a cofounder from day one.
  • Shane Walker, Cody Jones, Phil LiPari, Pavel Khlustikov, David Ulrich, Julie Bauer, Jason Scott, Carrie Birmingham, Rich Gridlestone, Bill Rick, Zach Pryde, Amy Well, David Debusk, Mikhail Zaydman, Jean-Roux, Bezuidenhout, Sergey Kurilkn (and others I may have forgotten): Thanks for being one the greatest teams I’ve ever worked with.
  • Brandon Shelton, Linda Fingerle, Wayne Boulais, Armando Pauker, Matt Abrams, Matthew Mills: Thank you for being outstanding board members, mentors, and investors
  • Ziad Ismail, Pete Christothoulou, Kirby Winfield: Thank you for the career advice during my first venture into the startup world.

*Note: You can read more about Stabilitas, OnSolve, and our acquisition at the links below: