When Buttons Triumph Over Words: Find Out Why

Picture it, Redmond, 2008. A young, naive program manager at Microsoft builds a prototype of one of the world’s first far-field, open-microphone, conversational AI assistants. Excited by his invention, he invites his engineering manager to witness the masterpiece in action. He confidently says, “Hey Xbox, channel up,” and the Xbox responds, “Got it. I’ve changed the channel.” The program manager, brimming with pride, looks to his engineering manager for applause. The engineering manager, unimpressed, replies, “That’s stupid. You’re holding the remote—why don’t you just press the channel-up button?”

That program manager was me, and that engineering manager… well, he knows who he is.

If you get the reference, great. If not, do yourself a favor and binge-watch The Golden Girls. But I’m not here to sell you on 1980s sitcoms. The point of this mostly true story is to illustrate a realization I had long ago: not every product or feature is improved by a conversational AI experience (whether it’s a speech or text interface).

With the advent of Large Language Models (LLMs), there’s been a rush to add copilots and conversational AI solutions to almost every product and feature, often without considering whether these features actually improve the product. In this post, I’ll explore why conversational AI isn’t always the best experience and suggest ways to evaluate when and where you should leverage this technology.

Clicking Fast, Speaking Slow

Humans are far faster at clicking buttons than speaking or typing. From a raw numbers perspective the average human types at 40 words per minute, speaks at 150 words per minute and can click a mouse button at 384 clicks per minute.

Typed Words Per Minute	40
Spoken Words Per Minute	150
Clicks Per Minute	384

speaking and type rates versus button clicks

This means we can perform far more button clicks in the same time it takes to speak or type a given set of words. For example, if a task requires 10 words, I can click a mouse button 24 times in the time it takes to speak those words, and 92 times in the time it takes to type them.

Below is a chart that shows how many button clicks are possible in the same time it takes to speak or type a given set of words

So if it is so much faster to click buttons why does anybody build conversational interfaces?

The Real World

Here’s the rub: if a task only requires clicking a mouse button repeatedly with no other interaction, a conversational interface would be hard to justify. However, few interactions are that simple. We often have to search, type, walk, and perform other actions as part of a larger workflow that may also involve button pushing. Asking Alexa to turn on the lights when you’re standing in front of the switch is silly, but if you’re 12 feet away in bed, it makes sense.

Consider a practical use case I’m familiar with from our work at Xembly: scheduling meetings. Creating a meeting with one person, where any available time slot will suffice and default options for the meeting title, duration, conference provider, etc., are all acceptable, requires roughly five words:

Schedule a meeting with @jason

These five words take 7.5 seconds to type, but I can click a mouse 46 times in the same timeframe. Realistically, this task is far more complicated in a tool like Google Calendar. To schedule a simple one-off meeting with one person, I have to click on an open spot, move to the guest section, start typing a few characters, find the attendee I wish to invite, select that person, click save, and then send. With all the mouse movement between clicks, it took me ~8 seconds to complete this task (and I was rushing). So, words actually do beat traditional UI in this example.

You might be thinking, “Didn’t he just disprove his entire point?” Not necessarily. My experiment made me question why my friends at Google haven’t improved the UX for Google Calendar. Why isn’t there a list of my most frequent meeting attendees on the main calendar page with a one-click button to schedule a meeting? With a few small UI tweaks, highly accurate AI for selecting optimal time slots, and data-driven default meeting parameters, booking a meeting could be reduced to a single button click, which would be far more efficient than typing or speaking any words.

Not every interaction can be fixed with simple UI tweaks. A contrary example is scheduling multiple recurring 1:1s. If you have a team of 8 people you need to schedule recurring 1:1 meetings with, it’s unlikely you can reduce the number of clickable interactions to something that will beat words. Using Xembly, this task could be accomplished with the following 7-word sentence:

Schedule individual weekly recurring meetings with @here

To Speak or Not To Speak

So if you are a product manager, product designer, or startup founder how do you decide when to build a conversational AI/Copilot component? While there isn’t one easy answer here are some suggestions:

Measure how long it would take to complete a task using traditional UI elements.
Find the average number of words required to convey the task conversationally and use the graph above to measure the average time needed to convey those words.
Measure the time and words necessary to complete any follow-up actions.
If traditional UX is slower, ask yourself if the UX can be optimized, and if so consider prioritizing those UX updates over adding unnecessary conversational UI elements.
Always opt for the most efficient path for your user.
If conversation is the fastest path for the user, make conversational AI a primary feature.
If there are some viable use cases for conversational AI but they aren’t primary, find a way to offer the feature without sacrificing the traditional experience.

Closing Thoughts

Conversational AI can be a powerful interaction paradigm, but there’s been an overwhelming tendency to believe it is the solution to every problem. However, the evidence doesn’t support this.

Don’t immediately jump to a conversational AI-based solution. Measure how long it takes users to complete tasks within your experience and use the data to make the best decision for the user. As the old adage goes, “A picture is worth a thousand words”—the same may be true of a good old-fashioned button.