We’ve all been there. You gather your team for a meeting, you make a bunch of decisions that lead to a series of follow-up action items. Then a week goes by, you meet again and nobody remembers what on earth you decided and none of the action items have been closed out. And right there you just wasted hours of your time, the team’s time, and most importantly the company’s time. That, my friends, is why we take meeting notes!
Capturing meeting notes and more importantly, the action items that result from them, is critical for a high-functioning team. But there is a downside. Taking notes while simultaneously participating in a meeting is difficult and you usually wind up focusing on one task or the other. So the prospect of a large language model (LLM) being able to take a meeting transcript and produce an accurate list of action items is insanely attractive. Too bad you’ll find it doesn’t do an excellent job.
What is an Action Item
Before we dive into the details of why LLMs struggle to capture action items, it’s worth defining what an action item is. A quick search on Google will find you a dozen or so similar definitions. For the sake of this post, I prefer the definition given by our friends at Asana:
That’s a pretty good definition. But truthfully if you interview 100 people and ask them what the action items are for a given meeting you’ll get 100 different answers. Why? It turns out that the complete list of action items from any given meeting is wildly dependent on the purpose of the meeting, the nature of a given project, the type of work done by an organization, and sometimes the simple subjectivity of the notetaker.
Anyone who has worked on machine learning projects knows you can’t teach a machine to learn random human behavior. There has to be structure in what you are trying to teach the machine even if we humans can’t fully articulate it. So at Xemby, we’ve adopted a slightly more precise definition that draws clear lines between what is and is not an action item.
Why the different definition? Action items from a 1:1 meeting may not be project-based. It may cover multiple projects or self-improvement tasks. A commitment to walk the dog after the meeting may be irrelevant for an engineering standup but critically important for a veterinary office. The definition above gives us the best chance of getting 100 people to agree on the action items from a given meeting.
Why LLMs Struggle with Action Item Detection
There are a host of reasons why LLMs fail to accurately capture the action items from a meeting with sufficient precision and recall. However, they fall into a few key areas:
- Information isn’t encoded in the text
- Lack of social awareness
- Difficulty in doing abstract calculations
Let’s dive into each of these individually.
Information isn’t encoded in the text
I’ve discussed this issue in an earlier blog post, but to reiterate an LLM is just predicting the next word or token based on previous words or tokens. If the information necessary to predict the next word isn’t contained in the earlier text (or encoded in the base model) the LLM will likely not give you a high-value output. There is a variety of information that may not be contained in the text of a given conversation, but let’s focus on two in particular, outside context and visual information.
Outside Context
Let’s assume a manager passes by an employee earlier in the day to discuss a possible new project. Subsequently, in a later 1:1 meeting, the manager says “Remember that project we discussed, I want you to do it”. This is a situation where the context of the project is not contained in the text so there is no way for the LLM to know that this is a precise action item. The LLM will either struggle to classify this as an action item or at best return a vague and ambiguous action time that isn’t of much use.
But missing context isn’t limited to nonspecific references. A lack of a corporate “world model” can have all sorts of implications. For example “walking the dog” may be an action item if you work at a vet, or just a passing comment in a standup meeting. Sarcasm may also be difficult to discern without a larger working knowledge of what is going on in an organization.
Visual Information
It is very common to have working meetings. In those meetings, some proportion of the tasks will be acted upon in the meeting. While others may be future work commitments. That isn’t always obvious unless you have access to any associated visual information. For example, someone saying “I’m going to update that row in the spreadsheet” may or may not be doing so right at that moment. But often the text alone is insufficient to identify that a meeting participant has already taken action on a task. You often need additional visual information to confirm whether a given item is a future commitment to do work.
Social Awareness
We humans are funny creatures. For a host of reasons I won’t get into here, we often like to be non-committal. That means we will often hedge our commitments so we can’t be held accountable. That ultimately has meaningful impacts on any model identifying action items. There are two techniques humans tend to use to avoid accountability that LLMs struggle with, the royal we and vague/ambiguous tasks.
The Royal We
The best way of avoiding ownership of a given task is to suggest that “we” should do the task. Because “we” means everybody or anyone and therein lies the problem. Sometimes “we” really is the royal we. If I say “We all need to complete our expense reports by the end of the week”, that likely really means everyone in the meeting owns the task. However, if I say “We will get back to you next week”, we means “I” but I hedged on ownership. This makes it incredibly difficult for an LLM to understand if these types of tasks are actually action items and if they are, who the owner should be.
Vague and Ambiguous Tasks
The other way humans hedge on accountability is to provide vague or ambiguous task descriptions. For example “I’m going to do something about that” or “I should really look into that”. The problem with the first example is that the task is very unclear. The problem with the second example is that I used the hedge word “should”. In both these cases, it is unclear from the text if they are relevant action items. That means LLMs generally have to guess and usually do so with 50/50 accuracy at best.
Abstract Calculations
Last but not least, LLMs do poorly at abstract calculations. The thing about action items is that they often have a due date but those dates are often relative (e.g. “I’ll send you that on Tuesday”). Converting a relative date like Next Tuesday to April 2nd, 2024 requires abstract calculations, and ultimately this is not something LLMs excel at. As I’ve commented in the past, LLMs struggle to even understand a leap year, so how can they accurately provide due dates?
Summarizing the Action Items from this Post
Well if an LLM isn’t good enough on its own for capturing action items for meeting notes what should you do? At Xembly we’ve found that you need to augment an LLM with additional models to truly get close to 100% precision and recall when identifying action items.
Specifically, we’ve found it necessary to be more permissive in what we call action items and subsequently use ranking models for ordering them by likelihood. This gives the end user the ability to quickly make those 50/50 judgment calls with just a click of a button. We have also built dedicated models for due date and owner detection that perform far more accurately than what you will get out of the box with an LLM. Finally, wherever possible we’ve tried to connect our evaluation to data sources (knowledge graphs/world models) that extend beyond the conversation.
Ultimatly, LLMs can be an incredibly helpful tool for building a notetaking solution. But you’ll have a few action items on your plate to augment the technology if you want sufficient accuracy to delight your users.