The Astonishing Reasons Why Your LLM is a Bad Notetaker

We’ve all been there. You gather your team for a meeting, you make a bunch of decisions that lead to a series of follow-up action items. Then a week goes by, you meet again and nobody remembers what on earth you decided and none of the action items have been closed out. And right there you just wasted hours of your time, the team’s time, and most importantly the company’s time. That, my friends, is why we take meeting notes!

Capturing meeting notes and more importantly, the action items that result from them, is critical for a high-functioning team. But there is a downside. Taking notes while simultaneously participating in a meeting is difficult and you usually wind up focusing on one task or the other. So the prospect of a large language model (LLM) being able to take a meeting transcript and produce an accurate list of action items is insanely attractive. Too bad you’ll find it doesn’t do an excellent job.

What is an Action Item

Before we dive into the details of why LLMs struggle to capture action items, it’s worth defining what an action item is. A quick search on Google will find you a dozen or so similar definitions. For the sake of this post, I prefer the definition given by our friends at Asana:

An action item is a task that is created from a meeting with other stakeholders to move a project towards its goal

https://asana.com/resources/action-items

That’s a pretty good definition. But truthfully if you interview 100 people and ask them what the action items are for a given meeting you’ll get 100 different answers. Why? It turns out that the complete list of action items from any given meeting is wildly dependent on the purpose of the meeting, the nature of a given project, the type of work done by an organization, and sometimes the simple subjectivity of the notetaker.

Anyone who has worked on machine learning projects knows you can’t teach a machine to learn random human behavior. There has to be structure in what you are trying to teach the machine even if we humans can’t fully articulate it. So at Xemby, we’ve adopted a slightly more precise definition that draws clear lines between what is and is not an action item.

A commitment, made by a meeting attendee or on behalf of another person to do future work related to the agenda of the meeting or business of the organization.

Why the different definition? Action items from a 1:1 meeting may not be project-based. It may cover multiple projects or self-improvement tasks. A commitment to walk the dog after the meeting may be irrelevant for an engineering standup but critically important for a veterinary office. The definition above gives us the best chance of getting 100 people to agree on the action items from a given meeting.

Why LLMs Struggle with Action Item Detection

There are a host of reasons why LLMs fail to accurately capture the action items from a meeting with sufficient precision and recall. However, they fall into a few key areas:

  • Information isn’t encoded in the text
  • Lack of social awareness
  • Difficulty in doing abstract calculations

Let’s dive into each of these individually.

Information isn’t encoded in the text

I’ve discussed this issue in an earlier blog post, but to reiterate an LLM is just predicting the next word or token based on previous words or tokens. If the information necessary to predict the next word isn’t contained in the earlier text (or encoded in the base model) the LLM will likely not give you a high-value output. There is a variety of information that may not be contained in the text of a given conversation, but let’s focus on two in particular, outside context and visual information.

Outside Context

Let’s assume a manager passes by an employee earlier in the day to discuss a possible new project. Subsequently, in a later 1:1 meeting, the manager says “Remember that project we discussed, I want you to do it”. This is a situation where the context of the project is not contained in the text so there is no way for the LLM to know that this is a precise action item. The LLM will either struggle to classify this as an action item or at best return a vague and ambiguous action time that isn’t of much use.

But missing context isn’t limited to nonspecific references. A lack of a corporate “world model” can have all sorts of implications. For example “walking the dog” may be an action item if you work at a vet, or just a passing comment in a standup meeting. Sarcasm may also be difficult to discern without a larger working knowledge of what is going on in an organization.

Visual Information

It is very common to have working meetings. In those meetings, some proportion of the tasks will be acted upon in the meeting. While others may be future work commitments. That isn’t always obvious unless you have access to any associated visual information. For example, someone saying “I’m going to update that row in the spreadsheet” may or may not be doing so right at that moment. But often the text alone is insufficient to identify that a meeting participant has already taken action on a task. You often need additional visual information to confirm whether a given item is a future commitment to do work.

Social Awareness

We humans are funny creatures. For a host of reasons I won’t get into here, we often like to be non-committal. That means we will often hedge our commitments so we can’t be held accountable. That ultimately has meaningful impacts on any model identifying action items. There are two techniques humans tend to use to avoid accountability that LLMs struggle with, the royal we and vague/ambiguous tasks.

The Royal We

The best way of avoiding ownership of a given task is to suggest that “we” should do the task. Because “we” means everybody or anyone and therein lies the problem. Sometimes “we” really is the royal we. If I say “We all need to complete our expense reports by the end of the week”, that likely really means everyone in the meeting owns the task. However, if I say “We will get back to you next week”, we means “I” but I hedged on ownership. This makes it incredibly difficult for an LLM to understand if these types of tasks are actually action items and if they are, who the owner should be.

Vague and Ambiguous Tasks

The other way humans hedge on accountability is to provide vague or ambiguous task descriptions. For example “I’m going to do something about that” or “I should really look into that”. The problem with the first example is that the task is very unclear. The problem with the second example is that I used the hedge word “should”. In both these cases, it is unclear from the text if they are relevant action items. That means LLMs generally have to guess and usually do so with 50/50 accuracy at best.

Abstract Calculations

Last but not least, LLMs do poorly at abstract calculations. The thing about action items is that they often have a due date but those dates are often relative (e.g. “I’ll send you that on Tuesday”). Converting a relative date like Next Tuesday to April 2nd, 2024 requires abstract calculations, and ultimately this is not something LLMs excel at. As I’ve commented in the past, LLMs struggle to even understand a leap year, so how can they accurately provide due dates?

Summarizing the Action Items from this Post

Well if an LLM isn’t good enough on its own for capturing action items for meeting notes what should you do? At Xembly we’ve found that you need to augment an LLM with additional models to truly get close to 100% precision and recall when identifying action items.

Specifically, we’ve found it necessary to be more permissive in what we call action items and subsequently use ranking models for ordering them by likelihood. This gives the end user the ability to quickly make those 50/50 judgment calls with just a click of a button. We have also built dedicated models for due date and owner detection that perform far more accurately than what you will get out of the box with an LLM. Finally, wherever possible we’ve tried to connect our evaluation to data sources (knowledge graphs/world models) that extend beyond the conversation.

Ultimatly, LLMs can be an incredibly helpful tool for building a notetaking solution. But you’ll have a few action items on your plate to augment the technology if you want sufficient accuracy to delight your users.

Introducing Task-Oriented Multiparty Conversational AI: Inviting AI to the Party

The term “conversational AI” has been around for some time. There are dozens of definitions all over the internet. But let me refresh your memory with a definition from NVIDIA’s website.

Conversational AI is the application of machine learning to develop language-based apps that allow humans to interact naturally with devices, machines, and computers using speech

https://www.nvidia.com/en-us/glossary/conversational-ai

There’s nothing wrong with that definition except for one small misleading phrase: “… allow humans to interact …”. What that should say is: “… allow a human to interact …”. Why? Because every interaction you’ve ever had with a conversational AI system has been one-on-one.

Sure, you and your kids can sit around the kitchen table blurting out song titles to Alexa (“Alexa, play the Beatles,” “No Alexa, play Travis Scott,” “No Alexa, play Olivia Rodrigo.” …). Alexa may even acknowledge each request, but she isn’t having a conversation with your family. She’s indiscriminately acknowledging and transacting on each request as if they’re coming in one by one, all from the same person.

And that’s where multiparty conversational AI comes into play.

What is Multiparty Conversational AI

With a few small tweaks, we can transform our previous definition of conversational AI to one that accurately defines multiparty conversational AI.

Multiparty conversational AI is the application of machine learning to develop language-based apps that allow AI agents to interact naturally with groups of humans using speech

While the definitions may appear similar, they are fundamentally different. One implies a human talking to a machine, while our new definition implies a machine being able to interact naturally with a group of humans using speech or language. This is the difference between one-on-one interactions versus an AI agent interacting in a multiparty environment.

Multiparty conversational AI isn’t necessarily new. Researchers have been exploring multiparty dialog and conversational AI for many decades. I personally contributed to early attempts at building multiparty conversational AI into video games with the Kinect camera nearly fifteen years ago.1 But sadly no one has been able to solve all the technical challenges associated with building these types of systems and there has been no commercial product of note.

What about the “Task-Oriented” part?

You may have wisely noted that I have not mentioned the words “task-oriented” contained in the title of this post. Conversational AI (sometimes also called dialog systems) can be divided into two categories, open-domain and task-oriented.

Open-domain systems can talk about any arbitrary topic. The goal is not necessarily to assist any particular action, but rather engage in arbitrary chitchat. Task-oriented systems are instead focused on solving “tasks”. Siri and Alexa are both task-oriented conversational AI systems.

In multiparty systems, tasks become far more complicated. Tasks are usually the output of a conversation where a consensus is formed that necessitates action. Therefore any task-oriented multiparty conversational AI system must be capable of participating in forming a consensus or it will risk taking action before it is necessary to do so

Multiparty Conversational AI, What is it Good For?

“Absolutely Everything!” Humans are inherently social creatures. We spend much of our time on this planet interacting with other humans. Some have even argued that humans are a eusocial species (like ants and bees) and that our social interactions are critical to our evolutionary success. Therefore, for any conversational AI system, to truly become an integral part of our lives, it must be capable of operating amongst groups of humans.

Nowhere is this more evident than in a corporate work environment. After all, we place employees on teams, they have group conversations on Slack/Teams and email, and we constantly gather small groups of people in scheduled or ad-hoc meetings. Any AI system claiming to improve productivity in a work environment will ultimately need to become a seamless part of these group interactions.

Building Task-Oriented Multiparty Conversational AI Systems

There is a litany of complex problems that need to be solved to reliably build a task-oriented multiparty conversational AI system that would be production-worthy. Below is a list of the most critical areas that need to be addressed.

  • Task detection and dialog segmentation
  • Who’s talking to whom
  • Semantic parsing (multi-turn intent and entity detection)
  • Conversation dissentanglement
  • Social graphs and user/organization preferences
  • Executive function
  • Generative dialog

In the next sections, we’ll briefly dive deeper into each of these areas.

Task Detection and Dialog Segmentation

In a single-party system such as Alexa or Siri task detection is quite simple. You address the system (“Hey Siri …” ) and everything you utter is assumed to be a request to complete a task (or follow up on a secondary step needed to complete a task). But in multiparty conversations, detecting tasks2 is far more difficult. Let’s look at two dialog segments below

Two aspects of these conversations make accurately detecting tasks complex.:

  • In the first dialog, our agent Xena, is an active part of the conversation, and the agent is explicitly addressed. However, in the second conversation, our agent passively observed a task assigned to someone else and subsequently proactively offered assistance. That means we need to be able to detect task-oriented statements (often referred to as a type of dialog act) that might not be explicitly addressed to the agent.
  • The second issue is that the information necessary to complete either of these tasks is contained outside the bounds of the statement itself. That means we need to be able to segment the dialog (dialog segmentation) to capture all the utterances that pertain to the specific task.

Beyond the two challenges above there is also the issue of humans often making vague commitments or hedging on ownership. This presents additional challenges as any AI system must be able to parse whether a task request is definitive or not and be able to handle vague tasks or uncertain ownership.

Who’s Talking to Whom

To successfully execute the task in a multiparty conversation we need to know who is making the request and to whom it is assigned. This raises another set of interesting challenges. The first issue is, how do we even know who is speaking in the first place?

In a simple text-based chat in Slack, it is easy to identify each speaker. The same is true of a fully remote Zoom meeting. But what happens when six people are all collocated in a conference room? To solve this problem we need to introduce concepts like blind speaker segmentation and separation and audio fingerprinting.

But even after we’ve solved the upfront problem of identifying who is in the room and speaking at any given time there are additional problems associated with understanding the “whom”. It is common to refer to people with pronouns and in a multiparty situation you can’t just simply assume “you” is the other speaker. Let’s look at a slightly modified version of one of the conversations we presented earlier.

The simple assumption would be that the previous speaker (User 2) is the “whom” in this task statement. But after analyzing the conversation it is clear that “you” refers to User 1. Identifying the owner or “whom” in this case requires concepts like coreference resolution (who does “you” refer to elsewhere in the conversation) to correctly identify the correct person.

Semantic Parsing

Semantic parsing, also sometimes referred to as intent and entity detection, is an integral part of all task-oriented dialog systems. However, the problem gets far more complex in multiparty conversations. Take the dialog in the previous section. A structured intent and entity JSON block might look something like this:

{
    "intent": "schedule_meeting",
    "entities": {
        "organizer": "User 1",
        "attendees": [
            "User 2",
            "User 3"
        ],
        "title": "next quarter roadmap",
        "time_range": "next week"
    }
}

Note that all of the details in this JSON block did not originate from our task-based dialog act. Rather the information was pulled from multiple utterances across multiple speakers. Successfully achieving this requires a system that is exceptionally good at coreference resolution and discourse parsing.

Conversation Disentanglement

While some modern chat-based applications (e.g. Slack) have concepts of threading that can help isolate conversations, we can’t guarantee that any given dialog is single-threaded. Meetings are nonthreaded and chat conversations can contain multiple conversations that are interspersed with each other. That means any multiparty conversational AI system must be able to pull apart these different conversations to transact accurately. Let’s look at another adaptation of a previous conversation:

In this dialog, two of our users have started a separate conversation. This can lead to ambiguity in the last request to our agent. User 3 appears to be referring to the previous meeting we set up, but knowing this requires we separate (or disentangle) these two distinct conversations so we can successfully handle subsequent requests.

Social / Knowledge Graph and User Preferences

While this might not be obvious, when you engage in any multiparty conversation you are relying on a database of information that helps inform how you engage with each participant. That means any successful multiparty conversational AI system needs to be equally aware of this information. At a bare minimum, we need to know how each participant relates to each other and their preferences associated with the supported tasks. For example, if the CEO of the company is part of the conversation you may want to defer to their preferences when executing any task.

Executive Function

Perhaps most importantly, any task-oriented multiparty conversational AI system must have executive function capabilities. According to the field of neuropsychology, executive function is the set of cognitive capabilities humans use to plan, monitor, and execute goals.

Executive function is critically important in a multiparty conversation because we need to develop a plan for whether we immediately take action on any given request or if we must seek consensus first. Without these capabilities, an AI system will just blindly execute tasks. As described earlier in this post this is exactly how your Alexa behaves today. If you and your kids continuously scream out “play <song name x>” it will just keep changing songs without any attempt to build consensus and the interaction with the conversational AI system will become dysfunctional. Let’s look at one more dialog interaction.

As you can see in the example above our agent just didn’t automatically transact on a request to move the meeting to Wednesday. Instead, the agent used its executive function to do a few things:

  • Recognize that the second request was not the request originator
  • Preemptively pull back information about whether the proposal was viable
  • Seek consensus with the group before executing

Achieving this capability requires the gathering of data previously collected, developing a plan, and then executing against that plan. So for a task-oriented multiparty conversational AI system to correctly operate within a group, it must have executive function capabilities.

Generative Dialog Engine

Last but not least, any conversational AI system must be able to converse with your users. However, because the number of people in any given conversation and their identities are not predictable and our executive functions can cause a wide array of responses, no predefined or templated list will suffice for generating responses. A multiparty system will need to take all our previously generated information and generate responses on demand.

Wait, Don’t Large Language Models (LLMs) Solve This

With all the hype, you’d think LLMs could solve the problem of task-oriented multiparty conversational AI – and weave straw into gold. But it turns out that, at best, LLMs are just a piece in a much larger puzzle of AI technology needed to solve this problem.

There are basic problems like the fact that LLMs are purely focused on text and can’t handle some of the speaker identification problems discussed earlier. But even more importantly there is no evidence that LLMs have native abilities to understand the complexities of social interactions and plan their responses and actions based on that information.

It will require a different set of technologies, perhaps leveraging LLMs in some instances, to fully build a task-oriented multiparty conversational AI system.

So When Can I Invite an AI to Join the Party

While I can’t say anyone has solved all the challenges discussed in this post, I can say we are very close. My team at Xembly has developed what we believe is the first commercial product capable of participating in multiparty conversations as both a silent observer and an active participant. Our AI agent can join in-person meetings or converse with a group in Slack while also helping complete tasks that arise as a byproduct of these conversations.

We are only just beginning to tackle task-oriented multiparty conversational AI. So we may not be the life of the party, but go ahead and give Xembly and our Xena AI agent a try. The least you can do is send us an invite!

  1. With the Kinect Camera, we hoped to individually identify speakers in a room so each user could independently interact with the game. You can read more details about or work in this space here: 1, 2, 3, 4 ↩︎
  2. Are tasks in multiparty conversations just action items? Yes, since an action item is generally defined as a task that arises out of a group’s discussion. I’ll be writing a larger deep dive into action item detection in a future post. ↩︎