How to Increase Team Productivity with “Swing”

I wasn’t looking for anything special when I first picked up a copy of The Boys in the Boat. I was in an airport, probably more concerned about finding a decent cup of coffee than a life-changing read. But what started as a random grab for entertainment turned into a transformative experience that solidified my approach to managing teams.

Daniel James Brown’s epic telling of the 1936 U.S. eight-man Olympic rowing team has a lot to offer. The book paints vivid pictures of Seattle—my current hometown—and the Hudson Valley in New York, where I spent my childhood. It’s an underdog story that rivals my favorite movie, Rocky. But what truly resonated with me was the introduction to the concept of “swing.”

What Is “Swing”?

While I could describe “swing” in my own words, it’s best to hear it from the source:

“There is a thing that sometimes happens in rowing that is hard to achieve and hard to define… It’s called ‘swing.’ It only happens when all eight oarsmen are rowing in such perfect unison that no single action by any one is out of sync with those of all the others… Each minute action—each subtle turning of wrists—must be mirrored exactly by each oarsman, from one end of the boat to the other. Only then will the boat continue to run, unchecked, fluidly and gracefully between pulls of the oars. Only then will it feel as if the boat is a part of each of them, moving as if on its own… Rowing then becomes a kind of perfect language. Poetry, that’s what a good swing feels like.” [source]

Daniel James Brown, The Boys in the Boat

What struck me most about this description of “swing” was how much it mirrored experiences I’ve had—not in a rowboat, mind you (I can barely row an inflatable raft across a pool)—but as a musician playing in bands and as a leader of product and engineering teams.

Swing on Product and Engineering Teams

Just like in a racing shell, “swing” on a product and engineering team occurs when everyone operates in complete synchronicity. Instead of propelling a boat forward, you are propelling a product forward at warp speed. When you find it, you’ll know because you’ll be delivering high-value, high-quality software at an astonishing pace, and it will feel like no cycle of engineering time is wasted. It’s a euphoric experience, and once you’ve tasted it, you’ll forever be chasing that high.

How to Get Your Team to Swing

If you’re thinking, “swing” sounds as elusive as finding a unicorn in your backyard, fear not. Here’s your cheat sheet for achieving “swing” with your team:

  • Hire for Skill AND Team Fit
  • Build Trust
  • Align on a Common Vision
  • Shuffle the Pieces
  • Change the Team

Hire for Skill AND Team Fit

Building a product, like rowing, is a team sport. While skills are essential, team fit is equally crucial. In my career, I’ve seen many interviews that barely touch on team fit. Ensure you evaluate candidates not just for their technical skills but also for how well they align with your team’s working style. Conduct team interviews, perhaps over lunch, to observe larger team dynamics.

Build Trust

To achieve “swing,” everyone must do their job without stepping on each other’s toes. This requires near-blind trust among team members. Building trust isn’t easy, but providing opportunities for the team to socialize and avoiding favoritism can help. Deliver praise and feedback equally to maintain trust over time.

Align on a Common Vision

On a crew team, the coxswain sets the pace and strategy. On a product and engineering team, that’s your role. Ensure your team aligns on a common vision. Clearly articulate the vision and roadmap, frequently reiterate it until it becomes their mantra. If they start veering off track, use it as a compass to steer them back.

Shuffle the Pieces

Just as rowers might need to switch positions to find the optimal configuration, you may need to reassign roles within your team. Align tasks with each member’s strengths, whether it’s frontend, backend, DevOps, or project management, to maximize efficiency.

Change the Team

If the team can’t achieve “swing,” some members might not be a fit. Don’t be afraid to make changes, even if it means letting someone go or moving them to another team. These tough conversations are crucial for achieving “swing.”

Go for Gold

Getting a product and engineering team to “swing” is an incredible feeling. If you’ve never experienced it, you’ll know when you find it. Follow the guidance in this post, and experience your team delivering software at a pace that would make even Usain Bolt jealous. And when that happens, don’t be surprised when company executives, other teams, and customers line up to hand you your Olympic gold.

How to Use Nondeterministic LLMs for Building Robust Deterministic Applications

If you’ve followed my previous posts, you might be under the impression that I’m against large language models (LLMs). However, in the words of my mother, “I’m just kvetching!” The truth is, I’ve been a devoted fan of LLMs for quite some time and have successfully integrated them into multiple customer-facing products. So, why the seemingly negative outlook?

The Challenges of LLMs

LLMs will undoubtedly accelerate your ability to deploy conversational AI and natural language based products and features.  But as the old adage goes, “There is no such thing as a free lunch”.

No matter how passionate you are about LLMs, anyone who attempts to construct a production-grade product around them will inevitably learn that it comes at a cost.

Here are some challenges that come with LLMs:

  • Nondeterministic behavior
  • Lack of strong typing
  • Struggles with abstract calculations
  • Poor uptime
  • Performance issues
  • High cost

My aim here is to delve into each of these areas and offer some advice on how you can navigate around them.

Navigating LLM Pitfalls


Like it or not, commercially available LLMs all exhibit nondeterministic behavior, even with a temperature of zero. Why? There have been various theories, ranging from the temperature of zero not being a true zero, to floating point rounding errors that become magnified due to the autoregressive (i.e., recursive) nature of generative AI models. Regardless of the cause, it happens, and it’s something you’ll need to manage.

LLMs are increasingly taking the place of more traditional intent and entity systems like RASA or SPACY. These systems have historically relied on discriminative machine learning models (e.g., classification models) for intent and entity detection. In such a system, your intent classes and entity key/value pairs remain fixed, and the only errors you need to deal with are misclassifications.

However, the predictability of intent or entity classes, or even the associated values attributed to a particular entity, is not guaranteed when using an off-the-shelf LLM. For instance, even with specific prompts, we’ve noticed a variety of responses for intents like sharing a summary.

share_meeting_intents = {"share_summary", "share_meeting_summary"}

The same variability applies to entities. Even more concerning, we’ve found that LLMs can’t guarantee consistency in entity values. Even if you instruct the LLM to return the associated input text exclusively, the model might attempt to produce more normalized data. For instance, a month like “June” could be converted to “6/1.”

To navigate these issues, treat your LLM solution like any other ML project and test it accordingly. This means having a substantial amount of ground truth data and comparing LLM responses to expected results. Because of the variability in responses, you’ll likely need to accommodate multiple valid alternatives for all of your expected classes. Post-normalization evaluation of entity values might be necessary to avoid grappling with every possible variation.

You might need to perform significant testing with generated test cases before launching to understand the range of potential responses for any given system. But even with thorough advanced testing, robust monitoring tools will likely be necessary to swiftly assess what’s happening in real-time with your customers. As you uncover failure cases, ensure that your code is structured in a way that allows you to quickly accommodate these new scenarios.

Tools like Label Studio can be helpful for building labeling solutions and labeling data. However, due to the high degree of variability, you might need to devise your own testing framework.

Label Studio example with multiple intents and normalized date / time entities.

Lack of Strong Typing

LLMs output text, meaning none of the data can be considered strongly typed. For example, if a user says, “send an email to Jason”, we don’t definitively know who “Jason” is. Your system may be able to suggest possible “Jasons” for the user to confirm, but you’ll need to store and associate that strongly typed identifier (email address, ID, etc.) with the conversation.

There are several ways to address this problem. One approach is to pass back an ID to the LLM, ensuring it’s not regurgitated to the user. However, depending on how you’re using LLMs, this may not always be feasible.  We also know that LLMs can hallucinate and there is always a chance that the ID you get back is not the one you started with.

An alternative is to maintain strongly typed context data associated with the conversation in a separate document store, like MONGO DB. This can help prevent repeated confirmation requests from the user.

Abstract Calculations

While recent models have shown improved functionality with tasks like math, LLMs still struggle with abstract calculations. It seems that the models are merely predicting next tokens without having learned to generalize any mathematical abilities.

For this reason, I suggest using LLMs to collect entity data but resort to a more reliable system for any abstract calculations. For instance, if you are doing any date / time calculations, it’s best not to rely on the LLM. Consider leveraging dedicated date / time NLP solutions like SUTime, Duckling, or build your own in Python.

Poor Uptime

Not to pick on OpenAI, but they have been known to have less than satisfactory uptime, which wouldn’t pass muster in most traditional SaaS businesses. OpenAI’s API currently reports two 9’s of uptime for the past 90 days. 

OpenAI uptime reported from

That means you need to prepare for 7+ hours of downtime every month. Addressing this requires a blend of good conversational UI and traditional software practices. Make sure you have sufficient retry logic built into your system.  You will also want to clearly communicate back to your users when the system is incapable of responding.


LLMs can be slow, especially for long responses. Products like OpenAI offer streaming output, which can significantly reduce the perceived latency from the end user’s perspective. Latency can also be reduced by minimizing the number of output tokens as much as possible.

If you send data to other systems to do calculations or lookups performance can degrade even further. Users expect very low latency in conversational AI systems, so while this may seem obvious, it is critical that you optimize every component in your stack to be as fast as possible.

If latency is still an issue you will likely need to structure your UI to mitigate the impact on the end user. Consider immediately replying to a user letting them know you working on a task so you properly set expectations.


Finally, LLM pricing can be prohibitively expensive. Our LLM costs have become our largest expense beyond traditional cloud costs. The best way to manage costs is to use a federated approach. 

Consider each challenge you’re addressing and question whether an LLM is truly the optimal solution. You might reap greater benefits from developing your own model for specific issues. This could be more cost-effective and potentially offer better performance. And remember, not every situation warrants the use of the most expensive LLM.

LLMs and Your Application

LLMs can significantly speed up your development. They can also enable you to tackle features that were previously beyond the capabilities of your team. However, LLMs are not a cure-all. If you want to build robust applications, you’ll need to invest a bit of extra effort. Happy building!

The Astonishing Reasons Why Your LLM is a Bad Notetaker

We’ve all been there. You gather your team for a meeting, you make a bunch of decisions that lead to a series of follow-up action items. Then a week goes by, you meet again and nobody remembers what on earth you decided and none of the action items have been closed out. And right there you just wasted hours of your time, the team’s time, and most importantly the company’s time. That, my friends, is why we take meeting notes!

Capturing meeting notes and more importantly, the action items that result from them, is critical for a high-functioning team. But there is a downside. Taking notes while simultaneously participating in a meeting is difficult and you usually wind up focusing on one task or the other. So the prospect of a large language model (LLM) being able to take a meeting transcript and produce an accurate list of action items is insanely attractive. Too bad you’ll find it doesn’t do an excellent job.

What is an Action Item

Before we dive into the details of why LLMs struggle to capture action items, it’s worth defining what an action item is. A quick search on Google will find you a dozen or so similar definitions. For the sake of this post, I prefer the definition given by our friends at Asana:

An action item is a task that is created from a meeting with other stakeholders to move a project towards its goal

That’s a pretty good definition. But truthfully if you interview 100 people and ask them what the action items are for a given meeting you’ll get 100 different answers. Why? It turns out that the complete list of action items from any given meeting is wildly dependent on the purpose of the meeting, the nature of a given project, the type of work done by an organization, and sometimes the simple subjectivity of the notetaker.

Anyone who has worked on machine learning projects knows you can’t teach a machine to learn random human behavior. There has to be structure in what you are trying to teach the machine even if we humans can’t fully articulate it. So at Xemby, we’ve adopted a slightly more precise definition that draws clear lines between what is and is not an action item.

A commitment, made by a meeting attendee or on behalf of another person to do future work related to the agenda of the meeting or business of the organization.

Why the different definition? Action items from a 1:1 meeting may not be project-based. It may cover multiple projects or self-improvement tasks. A commitment to walk the dog after the meeting may be irrelevant for an engineering standup but critically important for a veterinary office. The definition above gives us the best chance of getting 100 people to agree on the action items from a given meeting.

Why LLMs Struggle with Action Item Detection

There are a host of reasons why LLMs fail to accurately capture the action items from a meeting with sufficient precision and recall. However, they fall into a few key areas:

  • Information isn’t encoded in the text
  • Lack of social awareness
  • Difficulty in doing abstract calculations

Let’s dive into each of these individually.

Information isn’t encoded in the text

I’ve discussed this issue in an earlier blog post, but to reiterate an LLM is just predicting the next word or token based on previous words or tokens. If the information necessary to predict the next word isn’t contained in the earlier text (or encoded in the base model) the LLM will likely not give you a high-value output. There is a variety of information that may not be contained in the text of a given conversation, but let’s focus on two in particular, outside context and visual information.

Outside Context

Let’s assume a manager passes by an employee earlier in the day to discuss a possible new project. Subsequently, in a later 1:1 meeting, the manager says “Remember that project we discussed, I want you to do it”. This is a situation where the context of the project is not contained in the text so there is no way for the LLM to know that this is a precise action item. The LLM will either struggle to classify this as an action item or at best return a vague and ambiguous action time that isn’t of much use.

But missing context isn’t limited to nonspecific references. A lack of a corporate “world model” can have all sorts of implications. For example “walking the dog” may be an action item if you work at a vet, or just a passing comment in a standup meeting. Sarcasm may also be difficult to discern without a larger working knowledge of what is going on in an organization.

Visual Information

It is very common to have working meetings. In those meetings, some proportion of the tasks will be acted upon in the meeting. While others may be future work commitments. That isn’t always obvious unless you have access to any associated visual information. For example, someone saying “I’m going to update that row in the spreadsheet” may or may not be doing so right at that moment. But often the text alone is insufficient to identify that a meeting participant has already taken action on a task. You often need additional visual information to confirm whether a given item is a future commitment to do work.

Social Awareness

We humans are funny creatures. For a host of reasons I won’t get into here, we often like to be non-committal. That means we will often hedge our commitments so we can’t be held accountable. That ultimately has meaningful impacts on any model identifying action items. There are two techniques humans tend to use to avoid accountability that LLMs struggle with, the royal we and vague/ambiguous tasks.

The Royal We

The best way of avoiding ownership of a given task is to suggest that “we” should do the task. Because “we” means everybody or anyone and therein lies the problem. Sometimes “we” really is the royal we. If I say “We all need to complete our expense reports by the end of the week”, that likely really means everyone in the meeting owns the task. However, if I say “We will get back to you next week”, we means “I” but I hedged on ownership. This makes it incredibly difficult for an LLM to understand if these types of tasks are actually action items and if they are, who the owner should be.

Vague and Ambiguous Tasks

The other way humans hedge on accountability is to provide vague or ambiguous task descriptions. For example “I’m going to do something about that” or “I should really look into that”. The problem with the first example is that the task is very unclear. The problem with the second example is that I used the hedge word “should”. In both these cases, it is unclear from the text if they are relevant action items. That means LLMs generally have to guess and usually do so with 50/50 accuracy at best.

Abstract Calculations

Last but not least, LLMs do poorly at abstract calculations. The thing about action items is that they often have a due date but those dates are often relative (e.g. “I’ll send you that on Tuesday”). Converting a relative date like Next Tuesday to April 2nd, 2024 requires abstract calculations, and ultimately this is not something LLMs excel at. As I’ve commented in the past, LLMs struggle to even understand a leap year, so how can they accurately provide due dates?

Summarizing the Action Items from this Post

Well if an LLM isn’t good enough on its own for capturing action items for meeting notes what should you do? At Xembly we’ve found that you need to augment an LLM with additional models to truly get close to 100% precision and recall when identifying action items.

Specifically, we’ve found it necessary to be more permissive in what we call action items and subsequently use ranking models for ordering them by likelihood. This gives the end user the ability to quickly make those 50/50 judgment calls with just a click of a button. We have also built dedicated models for due date and owner detection that perform far more accurately than what you will get out of the box with an LLM. Finally, wherever possible we’ve tried to connect our evaluation to data sources (knowledge graphs/world models) that extend beyond the conversation.

Ultimatly, LLMs can be an incredibly helpful tool for building a notetaking solution. But you’ll have a few action items on your plate to augment the technology if you want sufficient accuracy to delight your users.

Introducing Task-Oriented Multiparty Conversational AI: Inviting AI to the Party

The term “conversational AI” has been around for some time. There are dozens of definitions all over the internet. But let me refresh your memory with a definition from NVIDIA’s website.

Conversational AI is the application of machine learning to develop language-based apps that allow humans to interact naturally with devices, machines, and computers using speech

There’s nothing wrong with that definition except for one small misleading phrase: “… allow humans to interact …”. What that should say is: “… allow a human to interact …”. Why? Because every interaction you’ve ever had with a conversational AI system has been one-on-one.

Sure, you and your kids can sit around the kitchen table blurting out song titles to Alexa (“Alexa, play the Beatles,” “No Alexa, play Travis Scott,” “No Alexa, play Olivia Rodrigo.” …). Alexa may even acknowledge each request, but she isn’t having a conversation with your family. She’s indiscriminately acknowledging and transacting on each request as if they’re coming in one by one, all from the same person.

And that’s where multiparty conversational AI comes into play.

What is Multiparty Conversational AI

With a few small tweaks, we can transform our previous definition of conversational AI to one that accurately defines multiparty conversational AI.

Multiparty conversational AI is the application of machine learning to develop language-based apps that allow AI agents to interact naturally with groups of humans using speech

While the definitions may appear similar, they are fundamentally different. One implies a human talking to a machine, while our new definition implies a machine being able to interact naturally with a group of humans using speech or language. This is the difference between one-on-one interactions versus an AI agent interacting in a multiparty environment.

Multiparty conversational AI isn’t necessarily new. Researchers have been exploring multiparty dialog and conversational AI for many decades. I personally contributed to early attempts at building multiparty conversational AI into video games with the Kinect camera nearly fifteen years ago.1 But sadly no one has been able to solve all the technical challenges associated with building these types of systems and there has been no commercial product of note.

What about the “Task-Oriented” part?

You may have wisely noted that I have not mentioned the words “task-oriented” contained in the title of this post. Conversational AI (sometimes also called dialog systems) can be divided into two categories, open-domain and task-oriented.

Open-domain systems can talk about any arbitrary topic. The goal is not necessarily to assist any particular action, but rather engage in arbitrary chitchat. Task-oriented systems are instead focused on solving “tasks”. Siri and Alexa are both task-oriented conversational AI systems.

In multiparty systems, tasks become far more complicated. Tasks are usually the output of a conversation where a consensus is formed that necessitates action. Therefore any task-oriented multiparty conversational AI system must be capable of participating in forming a consensus or it will risk taking action before it is necessary to do so

Multiparty Conversational AI, What is it Good For?

“Absolutely Everything!” Humans are inherently social creatures. We spend much of our time on this planet interacting with other humans. Some have even argued that humans are a eusocial species (like ants and bees) and that our social interactions are critical to our evolutionary success. Therefore, for any conversational AI system, to truly become an integral part of our lives, it must be capable of operating amongst groups of humans.

Nowhere is this more evident than in a corporate work environment. After all, we place employees on teams, they have group conversations on Slack/Teams and email, and we constantly gather small groups of people in scheduled or ad-hoc meetings. Any AI system claiming to improve productivity in a work environment will ultimately need to become a seamless part of these group interactions.

Building Task-Oriented Multiparty Conversational AI Systems

There is a litany of complex problems that need to be solved to reliably build a task-oriented multiparty conversational AI system that would be production-worthy. Below is a list of the most critical areas that need to be addressed.

  • Task detection and dialog segmentation
  • Who’s talking to whom
  • Semantic parsing (multi-turn intent and entity detection)
  • Conversation dissentanglement
  • Social graphs and user/organization preferences
  • Executive function
  • Generative dialog

In the next sections, we’ll briefly dive deeper into each of these areas.

Task Detection and Dialog Segmentation

In a single-party system such as Alexa or Siri task detection is quite simple. You address the system (“Hey Siri …” ) and everything you utter is assumed to be a request to complete a task (or follow up on a secondary step needed to complete a task). But in multiparty conversations, detecting tasks2 is far more difficult. Let’s look at two dialog segments below

Two aspects of these conversations make accurately detecting tasks complex.:

  • In the first dialog, our agent Xena, is an active part of the conversation, and the agent is explicitly addressed. However, in the second conversation, our agent passively observed a task assigned to someone else and subsequently proactively offered assistance. That means we need to be able to detect task-oriented statements (often referred to as a type of dialog act) that might not be explicitly addressed to the agent.
  • The second issue is that the information necessary to complete either of these tasks is contained outside the bounds of the statement itself. That means we need to be able to segment the dialog (dialog segmentation) to capture all the utterances that pertain to the specific task.

Beyond the two challenges above there is also the issue of humans often making vague commitments or hedging on ownership. This presents additional challenges as any AI system must be able to parse whether a task request is definitive or not and be able to handle vague tasks or uncertain ownership.

Who’s Talking to Whom

To successfully execute the task in a multiparty conversation we need to know who is making the request and to whom it is assigned. This raises another set of interesting challenges. The first issue is, how do we even know who is speaking in the first place?

In a simple text-based chat in Slack, it is easy to identify each speaker. The same is true of a fully remote Zoom meeting. But what happens when six people are all collocated in a conference room? To solve this problem we need to introduce concepts like blind speaker segmentation and separation and audio fingerprinting.

But even after we’ve solved the upfront problem of identifying who is in the room and speaking at any given time there are additional problems associated with understanding the “whom”. It is common to refer to people with pronouns and in a multiparty situation you can’t just simply assume “you” is the other speaker. Let’s look at a slightly modified version of one of the conversations we presented earlier.

The simple assumption would be that the previous speaker (User 2) is the “whom” in this task statement. But after analyzing the conversation it is clear that “you” refers to User 1. Identifying the owner or “whom” in this case requires concepts like coreference resolution (who does “you” refer to elsewhere in the conversation) to correctly identify the correct person.

Semantic Parsing

Semantic parsing, also sometimes referred to as intent and entity detection, is an integral part of all task-oriented dialog systems. However, the problem gets far more complex in multiparty conversations. Take the dialog in the previous section. A structured intent and entity JSON block might look something like this:

    "intent": "schedule_meeting",
    "entities": {
        "organizer": "User 1",
        "attendees": [
            "User 2",
            "User 3"
        "title": "next quarter roadmap",
        "time_range": "next week"

Note that all of the details in this JSON block did not originate from our task-based dialog act. Rather the information was pulled from multiple utterances across multiple speakers. Successfully achieving this requires a system that is exceptionally good at coreference resolution and discourse parsing.

Conversation Disentanglement

While some modern chat-based applications (e.g. Slack) have concepts of threading that can help isolate conversations, we can’t guarantee that any given dialog is single-threaded. Meetings are nonthreaded and chat conversations can contain multiple conversations that are interspersed with each other. That means any multiparty conversational AI system must be able to pull apart these different conversations to transact accurately. Let’s look at another adaptation of a previous conversation:

In this dialog, two of our users have started a separate conversation. This can lead to ambiguity in the last request to our agent. User 3 appears to be referring to the previous meeting we set up, but knowing this requires we separate (or disentangle) these two distinct conversations so we can successfully handle subsequent requests.

Social / Knowledge Graph and User Preferences

While this might not be obvious, when you engage in any multiparty conversation you are relying on a database of information that helps inform how you engage with each participant. That means any successful multiparty conversational AI system needs to be equally aware of this information. At a bare minimum, we need to know how each participant relates to each other and their preferences associated with the supported tasks. For example, if the CEO of the company is part of the conversation you may want to defer to their preferences when executing any task.

Executive Function

Perhaps most importantly, any task-oriented multiparty conversational AI system must have executive function capabilities. According to the field of neuropsychology, executive function is the set of cognitive capabilities humans use to plan, monitor, and execute goals.

Executive function is critically important in a multiparty conversation because we need to develop a plan for whether we immediately take action on any given request or if we must seek consensus first. Without these capabilities, an AI system will just blindly execute tasks. As described earlier in this post this is exactly how your Alexa behaves today. If you and your kids continuously scream out “play <song name x>” it will just keep changing songs without any attempt to build consensus and the interaction with the conversational AI system will become dysfunctional. Let’s look at one more dialog interaction.

As you can see in the example above our agent just didn’t automatically transact on a request to move the meeting to Wednesday. Instead, the agent used its executive function to do a few things:

  • Recognize that the second request was not the request originator
  • Preemptively pull back information about whether the proposal was viable
  • Seek consensus with the group before executing

Achieving this capability requires the gathering of data previously collected, developing a plan, and then executing against that plan. So for a task-oriented multiparty conversational AI system to correctly operate within a group, it must have executive function capabilities.

Generative Dialog Engine

Last but not least, any conversational AI system must be able to converse with your users. However, because the number of people in any given conversation and their identities are not predictable and our executive functions can cause a wide array of responses, no predefined or templated list will suffice for generating responses. A multiparty system will need to take all our previously generated information and generate responses on demand.

Wait, Don’t Large Language Models (LLMs) Solve This

With all the hype, you’d think LLMs could solve the problem of task-oriented multiparty conversational AI – and weave straw into gold. But it turns out that, at best, LLMs are just a piece in a much larger puzzle of AI technology needed to solve this problem.

There are basic problems like the fact that LLMs are purely focused on text and can’t handle some of the speaker identification problems discussed earlier. But even more importantly there is no evidence that LLMs have native abilities to understand the complexities of social interactions and plan their responses and actions based on that information.

It will require a different set of technologies, perhaps leveraging LLMs in some instances, to fully build a task-oriented multiparty conversational AI system.

So When Can I Invite an AI to Join the Party

While I can’t say anyone has solved all the challenges discussed in this post, I can say we are very close. My team at Xembly has developed what we believe is the first commercial product capable of participating in multiparty conversations as both a silent observer and an active participant. Our AI agent can join in-person meetings or converse with a group in Slack while also helping complete tasks that arise as a byproduct of these conversations.

We are only just beginning to tackle task-oriented multiparty conversational AI. So we may not be the life of the party, but go ahead and give Xembly and our Xena AI agent a try. The least you can do is send us an invite!

  1. With the Kinect Camera, we hoped to individually identify speakers in a room so each user could independently interact with the game. You can read more details about or work in this space here: 1, 2, 3, 4 ↩︎
  2. Are tasks in multiparty conversations just action items? Yes, since an action item is generally defined as a task that arises out of a group’s discussion. I’ll be writing a larger deep dive into action item detection in a future post. ↩︎

Generative AI – Prolific Copyright Infringer?

poor man's copyright.  Original music mailed to myself (Jason Flaks) via certified mail.

So, you might be wondering, “What makes this guy fit to pen an article about generative AI and copyright infringement?” I mean, I’m no copyright lawyer, nor do I moonlight as one on television. But I do bring a unique viewpoint to the table.  After all, I’ve dabbled in the music industry and have been up to my elbows in machine-learning projects for a good chunk of my career. But perhaps my best qualification is my long-standing fascination with copyright law, which started when I was just a kid. That image up top isn’t some AI-generated piece from Midjourney; they’re my earnest attempts at copyrighting my original music over three decades ago using the Poor Man’s copyright approach.

So, What’s Copyright, and How Do I Get One?

Before we dive into the meaty debate of whether Generative AI infringes on anyone’s copyright, let’s clarify what copyright means. According to the US Copyright Office, copyright is a “type of intellectual property that protects original works of authorship as soon as an author fixes the work in a tangible form of expression.” In simpler terms, copyright asserts your ownership of any intangible creations of human intellect (e.g., music, art, writing, etc).  The moment you fix your creation to a physical form (e.g., MP3 file, canvas, video recording, piece of paper, etc.), you have a copyright.

Amazingly, you don’t need to register for an official copyright to have one. So, why register a copyright? The Supreme Court has decided that to sue for copyright infringement, you must have registered your copyright with the US copyright office. However, they’ve also clarified that registering to sue is separate from the date of creation. This means I could register for a copyright for this blog post years from now but still sue for any infringement prior, as long as I can prove the date of creation.

How Can Generative AI Infringe on a Copyright?

There’s been a lot of talk about generative AI infringing on copyright protections, but many of these discussions oversimplify the issue. There are actually three different ways Generative AI can infringe on your copyright, some favoring artists/creators and some more favorable to the generative AI companies.

  • Theft (copying) of copyrighted material
  • Distribution of copyrighted material
  • Use of copyrighted material in derivative or transformative works

On Theft (Copying) of Copyrighted Material

Let’s get real; while there are numerous court cases establishing the legitimate right to duplicate copyrighted material under the fair use doctrine, the default assumption is and should be that it is illegal to do so.  Therefore, if we can determine that generative AI companies are using copies of content they did not pay for or get permission to use and using that content in a way that falls outside fair use, then we can assume they are stealing copyrighted material.

There is little or no debate that generative AI companies are using copyrighted material.  After all, OpenAI basically admits to lifting its training data from content “that is publicly available on the internet” on its website. And as I discussed earlier just about anything newly written on the internet has an inherent copyright.  But beyond possible scraping of my blog post, there is sufficient evidence that generative AI companies have ingested copyrighted books, images, and more. 

And if you need any more proof, look at the image below where I attempted to elicit lyrics from Bob Dylan’s Blowin’ in the Wind from ChatGPT.  ChatGPT both knew the lyrics I provided were from the song, and ChatGPT was able to quote a portion of the lyrics I did not provide.  It can only do that because it has seen the lyrics before in its training data set.

ChatGPT prompted to generate lyrics from Bob Dylan's "Blowin in teh Wind"

If there is no question that copyrighted material was used in the training process, then we only need to assess whether the copying should be considered fair use. There are multiple justifications for fair use in copyright law.  Some are easy to interpret, and others more difficult.  Items like research or scholarly use are reasonably easy to assess, and I can find no fair argument that generative AI companies are using copyrighted material in either capacity.

So, that leaves the last question in fair use: does the copying materially impact the monetization of the content?   And I think the answer here again is quite simple: YES!  The easiest example I can give is the artwork I regularly use in my blog posts.  I’ve traditionally paid for the art I use via services like Dreamstime.  If Midjourney or Stable Diffusion trained on this type of art and I subsequently generate my blog post art via their services, I may never pay for art via Dreeamstime or other similar services again.  And in doing so, those artists have lost a way to monetize their art, and they are not equally compensated by the generative AI companies.

On Distribution of Copyrighted Material

If you’re old like me, you may remember those FBI copyright warnings that regularly made an appearance on DVDs and VHS tapes.

The unauthorized reproduction or distribution of this copyrighted work is illegal …

The issue of whether these systems distribute the content in its original form with little transformation is a big one. This distribution can occur in two ways: to end customers and to data annotators.

To end customers

Generative AI models are basically next-word (pixel, etc.) predictors. They aim to provide the most statistically likely next word based on a previous sequence of words. As a result, these models will, without any special adaptations, spit back exact copies of text, images, etc., especially in low-density areas.  As you can see from the image in the previous section, while OpenAI has been proactively trying to adapt the system not to distribute copyrighted material, I was still able to get it to do so with very little effort on my part.

So while these generative AI systems will continue to try and put mitigations in place to prevent the distribution of copyrighted content, there is little or no debate that they have been doing so all along.  And they are likely to continue doing so as it is impossible to close every hole in the system.

To data annotators

OpenAI and others use reinforcement learning from human feedback (RLHF) to improve their models. RLHF requires that outputs from an original model are shown to human annotators to help build a reward model that leads to better outputs from the generative model. If these human annotators were shown copyrighted material, in an effort to reward the model for not doing so in the future, OpenAI and other generative AI companies would clearly be distributing copyrighted content.

You might ask, “Shouldn’t copyright holders be happy that OpenAI is trying to train their models not to distribute copyrighted content?”  Well, maybe, but if I started traveling the country tomorrow, giving a for-profit seminar on how to detect illegal copies of the Super Bowl, and in these seminars, I played previous Super Bowl recordings to the attendees without the NFL’s permission … I think the NFL would have a problem with that.

On Use of Copyrighted Material in Derivative or Transformative Works

The question of whether the output generated by Generative AI models, when not a direct reproduction, counts as copyright infringement is a murky one. There are many examples where courts have determined that “style” is not copyrightable. There are further questions on whether any output created by generative AI based on copyrighted material is derivative or transformative.   Truth be told, it can likely be either, depending on how the model is prompted.  So it’s actually quite difficult to say for sure if the resulting output from generative AI models is fair use or copyright infringement.

We’re left then with questions about who is really violating copyright in any of these cases. Is it the model or the company that owns it? Or Is it the user who prompted the model to generate the content? And does any of it really matter unless that generated content is published?

The Road Ahead

It seems to me the issue of generative AI and copyright has been complicated more than necessary. Generative AI companies must find a way to pay for the content they use to train their models. If they distribute the content, they may need to find a way to pay royalties.  Otherwise, these generative AI companies are profiting off the works of creators without properly compensating them.  And that just isn’t fair.

For artists, don’t let the thought of generative AI copying your style without compensation scare you. These models can’t generate new content and are limited to what they’ve seen in their training set. So, keep making new art, keep pushing boundaries, and if we solve the first problem of content theft and distribution, you’ll continue to be paid for the amazing work you create.

Your Large Language Model – it’s as Dumb as a Rock

© Jason Flaks -Initially generated by DALL-E and edited by Jason Flaks

Unless you’ve been living under a rock lately you likely think we’re entering some sort of AI-pocalypse. The sky is falling and the bots have come calling. There are endless reports of ChatGPT acing college-level exams, becoming self-aware, and even trying to break up people’s marriages! The way  OpenAI and their ChatGPT product have been depicted, it’s a miracle we haven’t all unplugged our devices and shattered our screens. It seems like a sensible way to stop the AI overlords from taking control of our lives.

But never fear! I am here to tell you that large language models (LLMs) and their various compatriots are as dumb as the rocks we all might be tempted to smash them with. Well, ok, they are smart in some ways. But don’t fret—these models are not conscious, sentient, or intelligent at all. Here’s why.

Some Like it Bot: What’s an LLM?

Large Language Models (LLMs) actually do something quite simple. They take a given sequence of words and predict the next likely word to follow. Do that recursively, and add in a little extra noise each time you make a prediction to ensure your results are non-deterministic, and voila! You have yourself a “generative AI” product like ChatGPT.

But what if we take the description of LLMs above and restate it a little more succinctly:

LLMs estimate an unknown word based on extending a known sequence of words.

It may sound fancy—revolutionary, even—but the truth is it’s actually old school. Like, really, really old school—it’s almost the exact definition of extrapolation, a common mathematical technique that’s existed since the time of Archimedes! If you take a step back, Large Language Models are nothing more than a fancy extrapolation algorithm.  Last I checked nobody thinks their standard polynomial extrapolation algorithm is conscious or intelligent. So why exactly do so many believe LLMs are?

Hear Ye, Hear Ye: What’s in an Audio Sample

Sometimes it’s easier to explain a complex topic by comparison. Let’s take a look at one of the most common human languages in existence—music.  Below are a few hundred samples from Bob Dylan’s “Like a Rolling Stone.” 

If I were to take those samples and feed them into an algorithm and then recursively extrapolate out for a few thousand samples, I’d have generated some additional audio content. But there is a lot more information encoded in that generated audio content than just the few thousand samples used to create it.

At the lowest level:

  • Pitch
  • Intensity
  • Timbre

At a higher level:

  • Melody
  • Harmony
  • Rhythm

And at an even higher level:

  • Genre
  • Tempo

So by simply extrapolating samples of audio, we generated all sorts of complex higher-level features of auditory or musical information. But pump the brakes! Did I just create AI Mozart? I don’t think so. It’s more like AI Muzak.

An AI of Many Words: What’s Next? 

It turns out that predicting the next word in a sequence of words will also generate more than just a few lines of text. There’s a lot of information encoded in those lines,  including the structure of how humans speak and write, as well as general information and knowledge we’ve previously logged. Here’s just a small sample of things encoded in a sequence of words:

  • Vocabulary
  • Grammar/Part of Speech (PoS) tagging
  • Coreference resolution (pronoun dereferencing)
  • Named entity detection
  • Text categorization
  • Question and answering
  • Abstract summarization
  • Knowledge base

All of the information above can, in theory, be extracted by simply predicting the next word, much in the same way predicting the next musical sample gives us melody, harmony, rhythm, and more.   And just like our music extrapolation algorithm didn’t produce the next Mozart, ChatGPT isn’t going to create the next Shakespeare (or the next horror movie villain, for that matter).

LLMs: Lacking Little Minds? 

Large Language Models aren’t the harbinger of digital doom, but that doesn’t mean they don’t have some inherent value. As an early adopter of this technology, I know it has a place in this time. It’s integral to the work we do at Xembly, where I’m the co-founder and CTO. However, once you understand that LLMs are just glorified extrapolation algorithms, you gain a better understanding of the limitations of the technology and how best to use it. 

Five Alive: How to Use LLMs So They Don’t Take Over the World

LLMs have huge potential. Just like any other tool, though, in order to extrapolate the most value, you have to use them properly. Here are five areas to consider as you incorporate LLMs into your life and work. 

  • Information must be encoded in text
  • Extrapolation error with distance
  • Must be prompted
  • Limited short-term memory
  • Fixed in time with no long-term memory

Let’s dig a little deeper.

Information Must Be Encoded in Text

Yan LeCun probably said it best:

Humans are multi-modal input devices and many of the things we observe and are aware of that drive our behavior aren’t verbal  (and hence not encoded in text). An example we contend with at Xembly is the prediction of action items from a meeting. It turns out that the statement “I’ll update the row in the spreadsheet” may or may not be a future commitment to do work.  Language is nuanced, influenced by other real-time inputs like body language and hundreds of other human expressions. It’s entirely possible in this example that the task was completed in real-time during the meeting, and the spoken words weren’t an indication of future work at all.

Extrapolation Error with Distance

Like all extrapolation algorithms, the further you get away from your source signal (or prompt in the case of LLMs), the more likely you will experience errors. Sometimes a single prediction that negates an otherwise affirmative statement or an incorrectly assigned gendered pronoun, can cause downstream errors in future predictions. These tiny errors can often lead to convincingly good responses that are factually inaccurate. In some cases, you may find LLMs return highly confident answers that are completely incorrect. These types of errors are referred to as hallucinations.

But both of these examples are really just forms of extrapolation error. The errors will be more pronounced when you make long predictions. This is especially true for content largely unseen by the underlying language model (for example, when trying to do long-form summarization of novel content).

Must Be Prompted

Simply put, if you don’t provide input text an LLM will do nothing. So if you are expecting ChatGPT to act as a sage and give you unsolicited advice, you’ll be waiting a long time. Many of the features Xembly customers rave about are based on our product providing unsolicited guidance. Large Language Models are no help to us here.

Limited Short-Term Memory

LLMs generally only operate on a limited window of text. In the case of ChatGPT, that window is roughly 3000 words. What that means is that new information not already incorporated in the initial LLM training data can very quickly fall out of memory. This is especially problematic for long conversations where new corporate lingo may be introduced at the start of a conversation and never mentioned again. Once whatever buzzword is used falls out of the context window it will no longer contribute to any future prediction, which can be problematic when trying to summarize a conversation.

Fixed in Time with no Long-term Memory

Every conversation you have with ChatGPT only exists for that session. Once you close that browser or exit your current conversation, there is no memory of what was said. That means you cannot depend on new words being understood in future conversations unless you reintroduce them within a new context window. So, if you introduce an LLM to anything it hasn’t heard before in a given session, you may find it uses that word correctly in subsequent responses. But if you enter a new session and have any hopes that the word will be used without introducing it in a new prompt, brace yourself—you will be disappointed.

To Use an LLM or Not to Use an LLM

It’s a big question. LLMs are exceedingly powerful, and you should strongly consider using them as part of your NLP stack. I’ve found the greatest value of many of these LLMs is that they potentially replace all the bespoke language models folks have been making for some time.  You may not need these custom entity modes, intent models, abstract summarization models, etc. It’s quite possible that LLMs can accomplish all of these things at similar or better accuracy, while possibly greatly reducing time to market for products that rely on this type of technology.  

There are many items in the LLM plus column, but if you are hoping to have a thought-provoking intelligent conversation with ChatGPT,  I suggest you walk outside and consult your nearest rock. You just might have a more engaging conversation!

The Annotators Dilemma: When Humans Teach Machines to Fail

What does a machine learning model trained via supervised learning and a lion raised in captivity have in common? … They’re both likely to die in the wild!

Now that might sound like a joke aimed at getting PETA to boycott my blog, but this is no laughing matter. Captive bred lions are more likely to die in the wild and so are machine learning models trained with human annotated data.

According to a 2008 National Geographic Article captive bred predators often die in the wild because they never learn the natural behaviors necessary for success. This is because their human captors never teach the animals the necessary survival skills (e.g., hunting) or inadvertently teach behaviors that are detrimental to their survival (e.g., no fear of humans). It’s not for lack of trying, but for a variety of reasons it is impractical or impossible to expose a captive predator to an environment that completely mirrors their ultimate home.

Why Humans Teach Machines to Fail

Not surprisingly, when humans teach machines, they fail for many of the same reasons. In both supervised and semi-supervised learning machine learning we train a model using human annotated data. Unfortunately, human cognitive and sensory capabilities can introduce a variety of consequences that often lead us to teach machines the wrong thing or fail to fully expose them to the environment they will find the wild. While there are likely many areas that impact the quality of human annotations, I’d like to cover five that I believe have the greatest impact on success.

Missing Fundamental and the Transposed Letter Effect (Priming)

Have you ever had someone explain an interesting factoid that sticks with you for life? One such example in my life is the “Missing Fundamental” effect first introduced to me by my freshman year Music Theory professor. The question he posed was “how are you able to hear the low A on a piano when the vast majority of audio equipment cannot reproduce the corresponding fundamental frequency”. It turns out the low A on a piano has a fundamental frequency of 27.5 HZ and most run of the mill consumer audio equipment is incapable of producing a frequency that low with any measurable gain. Yet we can hear the low A on a Piano recording even with those crappy speakers. The reason why is because of the “Missing Fundamental” effect. In essence the human brain can infer the missing fundamental frequency from upper harmonics.

A similar concept is the “Transposed Letter” effect. I’m sure you’ve seen the meme. You know those images with scrambled letters that tell you you’re a genius if you can read it. Your ability to read those sentences is due to transposed letter effect and related to priming. Basically, even if the letters in words printed on the page are jumbled, reading it can still activate the same region of the brain as the original word.

You might be asking what any of this to do with annotating data and teaching machines. The problem arises when you realize we humans can correctly identify something even when all the data is not actually present to do so. If we annotate data this way we are assuming the machine has the same capabilities and that might not be so. Teaching a machine that “can you raed this” and “can you read this” are the same thing may have unintended consequences.

Selection Bias

If you gave me millions of images and asked me to find and label all the images with tomatoes, I am probably going to quickly scan for anything red and circular. Why? Because that’s my initial vision of a tomato and scanning the images that way would likely speed up the process of going through them all. But that is me imparting my bias of what a tomato looks like into the annotated data set. And it’s exactly how a machine never learns that the Green Zebra and Cherokee Purple varieties are indeed tomatoes!

Multisensory Integration

Humans often make use of multiple senses simultaneously to improve our perception of the world. For example, it has been well documented that speech perception, especially in noisy environments, is greatly enhanced when the listener can leverage visual and auditory cues. However, the vast majority of commercial machine learning models are single modality (reading text, scanning images, scanning audio). So, if my ability to understand a noisy speech signal is only possible due to a corresponding video it may be dangerous to try and teach a machine what was said since the machine likely does not have access to the same raw data.

Response Bias

I am ashamed to admit this but every time I get an election ballot, I feel an almost compulsive need to select a candidate for every position. Even when I have little or no knowledge of the office the candidates are running for, the background of the competing candidates, and their policy positions. Usually, I arbitrarily select a candidate based on their party affiliation or what college they went to, which is probably only slightly better than selecting the first name on the ballot. My need to select a candidate even though I have no basis for doing so is likely a form of response bias. The problem with response bias is it generally leads to inaccurate data. If your annotators suffer from response bias, you are likely teaching the machine with inaccurate data.

Zoning Out

Have you ever driven somewhere only to get to your destination and have no recollection of how you got there? If so, like me, you have experienced zoning out. With repetitive tasks we tend to start with an implied speed versus accuracy metric but over time as the task gets boring, we start to zone out or get distracted but we maintain the same speed which ultimately leads to errors. Annotating data is a highly repetitive task and therefore has a high probability of generating these types of errors. And when we use error-ridden annotated data to teach our machines we likely teach them the wrong thing.

How to be a Better Teacher

While the problems above might seem daunting there are things we can do to help minimize the effects of human behavior on our ability to accurately teach machines.

Provide a Common Context

The missing fundamental and multisensory integration problems are both issues with context. In each of these cases either historical or current context allows us humans to discern something another species (a.k.a. the machine) may not be able to comprehend. The solution to this problem is to make sure humans teach the machine with a shared context. The easiest way to fix this problem is to limit the annotator to the same modality the machine will operate with. If the task at hand is to teach a machine to recognize speech from audio, then don’t provide the annotator access to any associated video content. If the task is to identify sarcasm in written text don’t provide the annotator with audio recordings of the text being spoken. This will ensure the annotator teaches the machine with mutually accessible data.

Beyond tooling you can also train your annotators to try and interpret data from multiple perspectives to ensure their previous experiences don’t cause brain activations that the machine might not benefit from. For example, it is very easy to read text in your head with your own internal inflections that might change the meaning. After all the slightest change in inflection can turn a benign comment into a sarcastic insult. However, if you train annotators to step back and try to read the text with multiple inflections you might avoid this problem.

Introduce Randomness

While it might be tempting to let users search around for items they think will help teach a machine doing so can increase the likelihood of selection bias. There may be good reasons to allow users to search to speed up data collection of certain classes, but it is also important to make sure a sizeable portion of your data is randomly selected. Make sure you set up different jobs for your annotators and ensure some proportion of your labeling effort is from randomly selected examples.

Reduce Cognitive Load

While we may not be able to prevent boredom and zoning out, we can reduce complexity in our labeling tools. By reducing the cognitive load we are more likely to minimize mistakes when people get distracted. Some ways to reduce cognitive load include limiting labeling tasks to single step processes (i.e., only label one thing at a time) and providing clear and concise instructions that remove ambiguity.

Be Unsure

Last but not least, allow people to be unsure. If you force people to put things in one of N buckets they will. By giving people the option of being “unsure” you minimize how often you’ll get inaccurate data due to people’s compulsion to provide an answer even if no correct answer is obvious.

Final Thoughts

No teacher wants to see their students fail. So it’s important to remember whether training a lion or a machine learning model, that different species likely learn in different ways. If we cater our teaching towards our students, we just might find our machine learning models fat and happy long after we sent them off into the wild.

*I’d like to thank Shane Walker and Phil Lipari who inspired this post and have helped me successfully teach many machine learning models to survive in the wild.

Teachers Keep on Teaching – ‘til I Reach my Highest Ground

Forgive me Stevie Wonder for slightly reordering your lyrics, but I think you’d agree that it’s hard to reach your highest ground without the help of teachers. Being a teacher is often a thankless job. Nobody gets rich or famous for being a teacher, yet the contribution teachers make to society is invaluable. So, in honor of Teacher Appreciation Week, I’d like to take the opportunity to thank some of the teachers who helped me reach my highest ground.

The Teachers Who Made Me What I Am Today

Erik Lawrence

My saxophone teacher from the age of 9 until I was 18, Erik is largely responsible for my love of music. Seeing that my career has largely centered around music and technology, Erik can largely take credit for planting the seed and nurturing the music branch of that tree. When I began to take an interest in the piano and composing Erik was quick to urge my parents to get me piano lessons. And when my parents wanted to buy me a new saxophone as a graduation present it was Erik who took me to the best New York area music stores to find the perfect sax. Beyond music, Erik was an extremely positive role model throughout my formative years. He taught me to treat others with respect, be accountable for my mistakes, and value my time and the time of others. For all of the above and much more I am grateful for the impact Erik had on my life.

David Snider

By the age of 13, I had taught myself to play piano, I was beginning to compose my own music, and I was slowly collecting a variety of audio electronics. I owned a Yamaha SY55 synthesizer with an onboard sequencer, a Tascam four-track recorder, and boxes of random audio cables. Recognizing my newfound passions Erik Lawrence convinced my parents to get me piano lessons and introduced me to David Snider. If Erik was my music guru then David was my technology guru. Upon realizing my knack for creating music and tinkering with anything music electronics-related, David managed to convince my parents to buy me my first computer (an Apple Mac) and my first audio software package (Mark of the Unicorn’s MOTU – Performer). He took my meandering teenage hobbies and turned them into a focused passion that would ultimately drive a large part of my career. David brought much more than technology to my life. He taught me how to play Jazz piano, something I still do to this day. In fact, 30+ years later I can still play the song Misty exactly the way he taught to me. While David may not remember this, he called me in the early days of my freshman year of Music school to wish me luck and give me some advice on avoiding some of the pitfalls of a musician’s life. It seemed inconsequential at the time but the fact that he cared enough to do that is remarkable.

Eileen M Curley

In the first quarter of my freshman year of high school, I had a failing grade in math. This was not acceptable in the Flaks household, so my mother reached out to the teacher to see if she had any advice on how I could improve my grade. Ms. Curley selflessly gave her own free time to provide me with extra help. It quickly became apparent that my problem with math was unrelated to my aptitude and purely a function of not paying attention and not doing the work. In no time at all, I went from an F to an A. That year I scored a 99 on New York State standardize math exam (the regents) losing only 1 point for carelessly not carrying a negative sign down to my final answer on one question. When Ms. Curley received the results from that exam, she took time out of her day to directly call my house and excitedly tell my Mother how well I did. My time with Ms. Curley was a turning point in my life. Little did she know that I would ultimately go on to be a math major in college, leading to a career in math and computer science.

William (Bill) Garbinsky (a.k.a. Mr. G)

William Garbinsky was a musician first and a high school music teacher second. He loved music and he gave innumerable hours during and after school to help students like me become better musicians. He taught the concert band, wind ensemble, marching band, and jazz band and I was a member of all of them. He gave band nerds like me a place to call home and surrounded us with a like-minded peer group that made us all feel like we were part of something bigger. Mr. G even took time out of his day to teach AP Music History and Music Theory classes to the small cohort of students who were interested. Thanks to those Advanced Placement College Credits I had some free time on my schedule when I entered a college, which I promptly filled with math classes. Sadly Mr. G passed away some years ago but I hope he knows what a difference he made in my life and the lives of countless others.

James (Jim) McElwaine

I was lucky enough to be accepted into the Conservatory of Music at Purchase College. I was even more fortunate to study under James McElwaine. Professor McElwaine was a Physics student before going full bore into music. So, when he stumbled upon a kid in his music program who was taking calculus classes as electives, he embraced it and pushed me to pursue it further. Beyond encouraging me to explore the math program, Jim recognized my passion for everything audio electronics related, and he opened every door he could, including getting me jobs running live sound for campus events, running the conservatories recording studios and he even got me my first real paid gig as a recording engineer. Professor McElwaine’s willingness to embrace and encourage my odd trajectory through music school played a huge role in my ability to progress into a master’s program that ultimately allowed me to go from using pro-audio equipment to building it.

Martin (Marty) Lewinter

If my music professor was a physicist then surely, I needed a math professor who was also a musician. Lucky for me the head of the math program Martin Lewinter also happened to be a seasoned musician. Professor Lewinter taught that very first calculus class I took as an elective. After witnessing my interest in math, Marty encouraged me to take on a second degree. Before long I was pursuing two simultaneous bachelor’s degrees with a focus in music composition and math/computer science. Professor Lewinter gave hours of his time towards helping me as the math curriculum progressively got harder and he continued to push me to excel in both the math and the music program. When I started applying to graduate schools with a heavier engineering focus, I picked up some textbooks to independently review. After struggling over some of the math equations I asked Professor Lewinter for some help. I still remember our conversation, where I showed him an equation in a book and he had to explain to me that engineers used j for imaginary numbers, not i, so as not to be confused with the variable for current. It was a simple thing that just might have prevented my first year in graduate school from turning into a complete disaster!

Ken Pohlmann

In my junior and senior years of college, I started to dive deeper into the underlying math behind the audio tools I was using. I happened to be reading a book called the Principles of Digital Audio and found a note about the author who was a professor of “music engineering” at the University of Miami. Music engineering sounded like an awfully good way to combine four grueling years of math and music education, so I sent Professor Pohlmann an email asking if he’d consider accepting a student without an undergraduate electrical engineering degree. Ken was kind enough to respond, and he recommended I take an extra year to get some basic engineering credits and he pointed me towards some textbooks that might give me an early head start. Well, I did buy the books, but I otherwise ignored him and applied to the program anyway. I still remember being overjoyed at receiving an acceptance letter where Ken told me that he thought my math background would carry me through the curriculum. With Professor Pohlmann and the University of Miami Music Engineering program, I stumbled into a small world of like-minded folks who had a passion for math and music. Professor Pohlman took a hodgepodge of academic pursuits I haphazardly pieced together and combined them into one coherent subject that would ultimately lead to my final career as an engineer, manager, and executive on countless audio projects.

Will Pirkle

How many teachers have fed you information that you can directly correlate to your current and future earnings? Not many, but that is exactly what Will Pirkle did for me and many others. Professor Pirkle was able to perfectly blend theory and practice and teach me how to effectively turn everything I had learned into real software that did amazing things with an audio signal. Will took all the ethereal subject matter I had learned over the years and made it into something I could feel and touch. It’s that skill set along with my own willingness to pester anybody for something I want, that led me to my first full-time job with a music software company called Opcode, (ironically a competitor of MOTU) bringing me full circle back to some of my earlier education. Will’s teaching has stood the test of time and I still find a use for some of what he taught. And whenever anybody asks for advice about the audio/music engineering space I regurgitate much of the knowledge Professor Pirkle imparted on me. Without a doubt, I can say that my employability and financial wellbeing are directly tied to everything I learned from Professor Pirkle.

To All the Teachers

While the eight teachers above had the most profound effect on my life there are many other teachers who contributed to my success and I’d like to offer my thanks to all of them. And to all the teachers out there who feel unappreciated, please remember that somewhere out there, in that sea of children, is a kid who just needs a little extra push to find out who they are and be the best version of themselves. Keep fighting for those kids because I am living proof of the impact you can have.

One Final Note of Gratitude

Since Mother’s Day is fast approaching, I’d be remiss if I didn’t thank the greatest teacher of them all, my Mother, Susan Flaks. My mother was there for every step of the journey described in this post. Whether that was teaching me my first notes on the piano, driving me to private music lessons, paying for that first computer, pushing me to get extra help when I needed it, paying for college, or just supporting me through my entire education, she was the root of all my academic and professional success. My mother was more than just an amazing parent, she was also a teacher for more decades than she would care for me to publicly comment on, and I know she had a positive influence on numerous students who like me, went on to be happy, healthy, and well-rounded adults, who have made a positive contribution to their communities and the world.

Voting is Just a Precision and Recall Optimization Problem

It’s hard to avoid the constant bickering about the results of our last election. Should mail-in voting be legal? Do we need stricter voter identification laws? Was there fraud in the last election? Did it impact the results? These are just a fraction of the questions circulating around elections and voter integrity these days. Sadly, these questions appear to be highly politicized and it’s unclear if anybody is really interested in asking what an optimal election system looks like.

In a true fair and accurate representative democracy, a vote not counted is just as costly as one inaccurately counted. More precisely, a single mother with no childcare who doesn’t vote because of 4-hour lines is just as damaging to the system as a vote for a republican candidate that is intentionally or accidentally recorded for the opposing Democratic candidate.

Therefore, we can conclude an optimal election system really involves optimizing on both axes. How do we make sure everyone who wants to vote gets to vote? And how do we ensure every vote is counted accurately? When viewed this way one can’t help but see the parallels to optimizing a machine learning classifier for precision (when we count votes for a given candidate how often did we get it right) and recall (of all possible votes for that candidate how many did we find).

Back the Truck Up! What is Precision and Recall Anyway

Precision and Recall are two metrics often used to measure the accuracy of a classifier. You might ask “why not just measure accuracy?” and that would be a valid question. Accuracy defined as everything we classified correctly divided by everything we evaluated, suffers from what is commonly known as the imbalanced class problem.

Suppose we have a classifier (a.k.a. laws and regulations) that can take a known set of voters who intend to vote “democrat” and “not democrat” (actual / input) and then outputs the recorded vote (predicted / output).

Let’s assume we evaluate 100 intended voters/votes, 97 of which intend to not vote for the democratic candidate and let’s build the dumbest classifier ever known. We are just going to count every vote as “not democrat”, regardless of whether the ballot was marked for the democratic candidate or not.

N (number of votes) = 100 Output (Predicted) Value
Democrat Not a Democrat
Input (Actual) Value Democrat TP = 0 FN = 3 TOTAL DEMOCRATS = 3
Not a Democrat FP = 0 TN = 97 TOTAL NOT DEMOCRATS = 97

To make our calculations a little easier we can take those numbers and drop them into a table that compares inputs to outputs also known as a confusion matrix. To simplify some of our future calculations we can further define some of the cells in the table above

  • True Positives (TP): Correctly captured an intended vote for the democrats as a vote for the democrats (97)
  • True Negatives (TN): Correctly captured a vote NOT intended for the democrats as a vote, not for the democrats (97)
  • False Positives (FP): Incorrectly captured a vote NOT intended for the democrats as a vote for the democrats (0)
  • False Negatives (FN): Incorrectly captured an intended vote for the democrats as a vote not for the democrats (3)

Now we can slightly relabel our accuracy equation and calculate our accuracy with our naïve classifier and the associated values from the table above.

97% Accuracy! We just created the world’s stupidest classifier and achieved 97% accuracy! And therein lies the rub. The second I expose this classifier to the real world with a more balanced set of inputs across classes we will quickly see our accuracy plummet. Hence, we need a better set of metrics. Ladies and gentlemen, I am delighted to introduce …

  • Precision: Of the votes recorded (predicted) for the Democrats, how many were correct

  • Recall: Of all possible votes for the Democrats, how many did we find

What becomes blatantly clear from evaluating these two metrics is that our classifier, which appeared to have great accuracy, is terrible. None of the intended votes for the democrats were correctly captured and of all possible intended votes for the democrats, we found none of them. It’s worth noting that the example I’ve presented here is for a binary classifier (democrat, not democrat) but these metrics can easily be adapted to multi-class systems that more accurately reflect our actual candidate choices in the United States.

There’s No Such Thing as 100% Precision and Recall

Gödel’s incompleteness theorem, which loosely states that every non-trivial formal system is either incomplete or inconsistent, likely applies to machine learning and artificial intelligence systems. In other words, since machine learning algorithms are built around our known formal mathematical systems there will be some truths they can never describe. A consequence of that belief and something I preach to everyone I work with is that there is really no such thing as 100% precision and recall. No matter how great your model is and what your test metrics tell you. There will always be edge cases.

So if 100% precision and recall is all but impossible what do we do? When developing products around machine learning classifiers, we often ask ourselves what is most important to the customer, precision, recall, or both. For example, if I create a facial recognition system that notifies the police if you are a wanted criminal, we probably want to air on the side of precision because arresting innocent individuals would be intolerable. But in other cases, like flagging inappropriate images on a social network for human review, we might want to air on the side of recall, so we capture most if not all images and allow humans to further refine the set.

It turns out that very often precision and recall can be traded off. Most classifiers emit a confidence score of sorts (also known as a SoftMax output) and by just varying the threshold on that output we can trade-off precision for recall and vice-vera. Another way to think about this is, if I require my classifier to be very confident in its output before I accept the result, I can tip the results in favor of precision. If I loosen my confidence threshold, I can tip it back in favor of recall.

And how might this apply in voting? Well, if I structure my laws and regulations such that every voter must vote in person with 6 forms of ID and the vote is tallied in front of the voter by a 10-person bipartisan evaluation team who must all agree … we will likely have very high precision. After all, we’ve greatly increased the confidence in the vote outcome. But at what expense? We will also likely slow down the voting process and create massive lines which will significantly increase the number of people who might have intended to vote but don’t actually do so, hence decreasing recall.

Remind me again what the hell this has to do with Voting

The conservative-leaning Heritage Foundation makes the following statement on their website:

“It is incumbent upon state governments to safeguard the electoral process and ensure that every voter’s right to cast a ballot is protected.”

I strongly subscribe to that statement and I believe it is critical to the success of any representative democracy. But ensuring that every voter’s right to cast a ballot is protected, requires not only that we accurately record the captured votes, but also ensure that every voter who intends to vote is unhindered in doing so.

Maybe we need to move entirely to in-person voting while simultaneously allocating sufficient funds for more polling stations, government-mandated paid time off, and government-provided childcare. Or maybe we need all mail-in ballots but some new process or technology to ensure the accuracy of the votes. Ultimately, I don’t pretend to know the right answer, or if we even have a problem, to begin with. What I do know is that if we wish to improve our election systems we must first start with data on where we stand today and then tweak our laws and regulations to simultaneously optimize for precision and recall.

So, the next time a politician proposes changes to our election system ask … no demand, they provide data on the current system and how their proposed changes will impact precision and recall. Because only when we optimize for both these metrics can we stop worrying about making America great again and start working on making America even greater!

“If you start me up I’ll never stop …” Until We Successfully Exit

“Hey, our fledgling startup is on path to being the next *INSERT BIG TECH COMPANY NAME HERE* and we think you’re a great fit for our CTO role”. Find me a technical leader who hasn’t been enticed by those words and you’ll have found a liar. So, what happens when one succumbs to the temptation and joins an early-stage startup? Well, if you have been wondering where I’ve been for the past couple of years, I was fighting the good fight at a small, early-stage NLP/Machine Learning based risk intelligence startup. And while I’m not retired or sailing around the world in my new 500-foot yacht, we were able to successfully exit the company with a net positive outcome for all involved. My hope with this post is that I can share some of my acquired wisdom, and perhaps steer the next willing victim down a similar path of success.

If I could sum up my key learnings in a few bullet points, it would boil down to this:

  • If you don’t believe … don’t join
  • Be prepared to contribute in any way possible
  • Find the product and focus on building it
  • Pick the race you have enough fuel for and win it

What I’d like to do in the rest of this post is break down each one of these items a little further.

If you don’t believe … don’t join

Maybe this goes without saying, but if you don’t believe in the vision, the people, and the product you shouldn’t join the startup approaching you. The CTO title is alluring, and it is easy to fool yourself into taking a job for the wrong reasons. But the startup experience is an emotional slog of ups and downs and it will be nearly impossible to weather the ride if you don’t wake up every day with an unyielding conviction for what you’re doing. As I’ll explain later in this post, you don’t need to believe you’re working for the next Facebook, but you do need to believe you are building a compelling product that has real value for you, your coworkers, your investors, and your customers.

Be prepared to contribute in any way possible

For the first few months on the job I used to go into our tiny office and empty all the trash bins because, if I didn’t, that small office with 5 engineers started to smell! It didn’t take long for someone to call out that I was appropriately titled, CTO (a.k.a. Chief Trash Officer). You might be asking why anybody would take a CTO job to wind up being the corporate custodian, but that is what was needed on some days.

While I have steadfastly maintained my technical chops throughout my career, I hadn’t really written a lick of production code for nearly two decades prior to this job. But with limited resources, it became clear I also needed to contribute to the code base and so I dusted off those deeply buried skills and contributed where I could. When you join a startup with that CTO title, it is easy to convince yourself that you’ll build a huge team, be swimming in resources, and have an opportunity to direct the band versus playing in it. But you’ll quickly find that in the early stages of a startup, the success of the company will depend on your willingness to drop your ego and contribute wherever you can.

Find the product and focus on building it

Great salespeople can sell you the Brooklyn Bridge. And if you’re just lucky enough, you might have a George C. Parker in your ranks. But the problem with great salespeople is they will do almost anything to close the sale and that comes with a real risk that they’ll sell custom work. If that happens over an extended period of time, you will be unable to focus on the core product offering and you’ll quickly find you’re the CTO of a work-for-hire / consulting company.

Startups face real financial pressures that often drive counterproductive behaviors. That often means doing anything necessary to drive growth in revenue, customers, or usage. But as illustrated in the graph below, high product variance will often ultimately lead to stagnant growth.

That’s because with every new feature comes a perpetual support cost. And if you keep building one-off features, and can’t fundraise fast enough, that cost will eventually come at the expense of delivering your true market-wide value proposition. If you allow this to happen, you’ll wind up with a company that generates some amount of revenue or usage but has no real value.

Companies that find true product/market fit should see product variance gradually decrease over time and this should allow the company to grow. Your growth trajectory may be linear when you need it to be exponential, but no per customer feature work will fix that problem and you may need to consider pivoting. If pivoting isn’t an option, it may be time to look for an exit.

As the CTO, a critical part of your job is to help the company find its product/market fit and then relentlessly focus on it. You need to hold the line against distractions and ensure the vast majority of resources are spent on features that align with the core value proposition. If you’ve truly found a product offering that is valued by a given market segment, and you can keep your resources focused on building it, growth will follow.

Pick the race you have enough fuel for and win it

I am an avid runner, and one of the great lessons of long-distance running is, that if you deplete your glycogen store, you’ll be unable to finish the race no matter how hard you trained. In other words, you can’t win the race if you have insufficient fuel. This is also very true of startups. If you’re SpaceX or Magic Leap, you’re running an ultra-marathon and you need a tremendous amount of capital in order to have sufficient time and resources to realize the value. But fundraising is hard, and even if you have an amazing product and top-notch talent, there can be significant barriers to acquiring sufficient capital.

The mistake some startups make is that they continue to run an ultra-marathon when they only have fuel for a 5k and that can lead to a premature or unnecessary failure. If funding becomes an issue, start looking for how your product might offer value to another firm. Start allocating resources towards making the product attractive for an acquisition. Aim to win a smaller race and seek more fuel on the next go around.

Final Thoughts

Taking on a CTO role at an early stage startup can be a great opportunity and lead to enormous success, but before you take the leap make sure you know what you’re getting into. Along the way don’t forget to stop and smell the roses. In the words of fellow Seattle native Macklemore, “Someday soon, your whole life’s gonna change. You’ll miss the magic of these good old days”.

Final Final Thoughts

No startup CTO is successful without support from an army of people. So I’d like to offer some gratitude to the following folks:

  • Greg Adams, Christ Hurst: Thanks for giving me an opportunity and treating me like a cofounder from day one.
  • Shane Walker, Cody Jones, Phil LiPari, Pavel Khlustikov, David Ulrich, Julie Bauer, Jason Scott, Carrie Birmingham, Rich Gridlestone, Bill Rick, Zach Pryde, Amy Well, David Debusk, Mikhail Zaydman, Jean-Roux, Bezuidenhout, Sergey Kurilkn (and others I may have forgotten): Thanks for being one the greatest teams I’ve ever worked with.
  • Brandon Shelton, Linda Fingerle, Wayne Boulais, Armando Pauker, Matt Abrams, Matthew Mills: Thank you for being outstanding board members, mentors, and investors
  • Ziad Ismail, Pete Christothoulou, Kirby Winfield: Thank you for the career advice during my first venture into the startup world.

*Note: You can read more about Stabilitas, OnSolve, and our acquisition at the links below: