I probably sang that Manfred Mann song a thousand times in my teen years and I was pretty sure the last word in that lyric was a feminine hygiene product until Google came along and taught me otherwise. It turns out the lyrics to Blinded by the Light are very difficult to understand and so is conversational speech.
For my first substantive blog post on this site I’d like to continue on a theme we have been covering over at Marchex around the complexity in building automatic speech recognition (ASR) systems that can accurately understand unbounded conversational speech. In this post I intend to dive a little deeper into WHY conversational ASR systems are so difficult to build, possible solutions to improve them, and the bounty for those who finally succeed.
There are really three primary issues that are limiting current systems from accurately recognizing conversational speech: Data, Data, and Data. More specifically: Required Data Size, Lack of Publically Available Data Sets, and Cost and Complexity with Acquiring the Required Data.
Required Data Size
There is no strict answer for how much data is needed to solve a given machine learning problem, but one oft-cited rule is the “rule of 10”. The rule of 10 states that you need roughly 10 times as many examples as you have parameters. While there are multiple parts of an ASR system including an acoustic model (AM) and a language model (LM), for now I am going to focus on the LM. One parameter used in an LM is called an n-gram, specifically in most cases a trigram. A trigram is basically the probabilities of any 3 words being seen next to each other. So if we take the rule of 10 that would imply we need 10 times the number of 3 word combinations required for our task.
This is where the problem arises. You see we humans write beautifully but we speak like idiots. Grammar goes out the window when people talk, we stutter, words are often repeated over and over while people search for their next thought, and honestly some folks downright make up words that don’t even exist. Taken together that means one can expect to see almost ANY combination of 3 words in the wild. Everything from “a a a “ to “zebra zebra zebra” . So if you don’t mind rewinding your brain to highschool math and combinatorics that means the number of 3 word combinations is:
| Number of Words in US English | ^3
or
| ~500,000 | ^3 = 125 QUADRILLION (i.e. a really #$%&’ing big number)
If we apply the rule of 10 we would need 1.25 QUINTILLION (i.e. an even bigger #$%&’ing number) utterances (basically a spoken sentence) containing examples of these trigrams. Let me put this in perspective for you. A single spoken utterance saved in a text file is roughly 50 bytes in size. So in order to to store 1.25 QUINTILLION utterances I would need 50 * 1.25 QUINTILLION bytes of storage. Or … 62,500 Petabytes! For reference 20 years of internet archiving only consumed 23 petabytes as of 2015. And if that doesn’t frame it for you think about it this way. The average utterance duration is roughly 1.5 seconds. If I were to string 1.25 QUINTILLION recorded utterances together it would take approximately 60 millennia to play it back!
So what’s the point? The point is that the data size required to cover all possible examples of spoken US English is almost inconceivable. Is the rule of 10 an exact science? No. Does it matter? No, because even if this estimate is wrong by 1/2 or 3/4 it is still huge. Ultimately the data size needed to properly train a conversational ASR system is gargantuan.
Lack of Publically Available Data Sets
Okay so we need a lot of data. Can’t we just buy it? No! Most publically available data sets are shockingly small compared to the size of the domain I described above. As my fellow Marchex coworkers reported in our recently published research paper the size of 2 of the most commonly used data sets, fisher English and switchboard, is prohibitively small.
Switchboard | Fisher English | Marchex | |
Hours |
309 |
2,000 |
5,000+ |
Speakers |
543 |
20,407 |
605,000 |
Utterances |
391,000 |
1,600,000 |
11,700,000 |
Conversations |
2,400 |
16,000+ |
288,000 |
Words |
3,000,000 |
18,000,000 |
79,500,000 |
Dat Acquisition Cost and Complexity
Alright if you can’t buy it why don’t companies just go and collect the data themselves? Well it turns out collecting 62,500 petabytes or 60 millennia’s worth of people conversing is no simple task. There are two primary problems, collecting that amount of audio data and labeling it.
Audio Data
Where could someone acquire that quantity of audio data? Well, there are countless hours of TV and Radio interviews out there but the dialog is generally scripted and edited so not reflective of true conversational speech. On top of that in most cases companies do not have the legal rights to the data and acquiring those rights would be prohibitively expensive.
Amazon, Apple, Microsoft, Google, and other companies are all collecting mountains of data from various voice assistants (Alexa, Siri, etc.) and voicemail messages. However all that speech data is mostly unidirectional and non-conversational (“Alexa tell me the weather” is not really conversational).
That leaves one obvious channel for acquiring conversational speech and that is phone calls. So why don’t companies just collect call recordings at scale? The answer is simple: WIRETAPPING.
In the US wiretapping is a federal and state statute aimed at ensuring your communications are private and there are criminal and civil penalties for those who violate the law. What makes wiretapping laws particularly problematic is that the law varies by state specifically around who must consent to being recorded.
So why does this matter? Well because 12 states require bidirectional consent and phone networks are open (nobody can guarantee they control both sides of the call). While any company can update their “terms of service” to notify you that you are being recorded, they would have no easy way to guarantee that the other party has consented. Unless they start playing that pesky message “this call might be recorded or monitored” in front of every call, including your weekly call with your mother! This puts scalable call recording for consumer oriented phone services mostly out of reach since the risk of violating a criminal law is too high (I think it is safe to say Mark Zuckerberg has no interest in going to jail).
In fact just ask Google who has dealt with an ongoing wiretapping case because they were scanning the emails of GMAIL users to place targeted ads. The argument is in fact incredibly similar in that Google was not just reading the email of GMAIL users but also any yahoo, Hotmail, etc. user who sent a GMAIL user an email. In the 12 states requiring bidirectional consent the non-Gmail users never consented which has potentially caused Google to violate the law.
Labeling
Even if by some miracle we could collect that amount of audio data how would we label it? In general ASR systems (and all other machine learning systems) require accurately labeled data (sometimes called “ground truth”). In the speech recognition world that generally involves hand transcribing data. And if it would take 60 millennia to read out that much speech imagine how long it would take to hand transcribe it. Simply put, it is not feasible in our lifetimes at any reasonable cost.
What’s the Solution
It turns out almost all companies record phone calls. Recordings from any one company would have highly biased content but in aggregate consumer to business recorded phone calls are an amazing source of conversational speech at scale. Because you need a wide cross section of content to ensure subject matter diversity, companies who provide platform call recording solutions and have legal access to the aggregate data are really the best sources of this content.
But what about the labeling? Well the only reasonable solution for labeling that content is using unsupervised or semi-supervised automated solutions for labeling the data. This is where Marchex has invested and you can read more details about our semi-supervised approach in our research paper. I hope to cover this topic in detail in a future post.
Why Does Any of this Matter
You might be asking if highly accurate conversational speech recognition is really necessary. Or you might be thinking “My Alexa already works awesome” But if you are a sci-fi nerd like me you’re anxiously awaiting the day that you can step foot on the holodeck and have a real conversation with an AI entity or crack open a beer with a fully conversant robot like Data from Star Trek. For that to happen we need to truly understand conversational speech. We need to understand it so machines can properly decipher what humans are saying and we need to understand it so machines can generate speech that mimics human dialog.
Highly accurate conversational speech recognition is necessary for us to fulfill the promised vision of artificial intelligence. Who knows maybe in a few years a holographic Manfred Mann and I will be doing a duet in my own personal holodeck. Can you hear it? “Blinded by the light, revved up like a deuce …”
Speech recognition tehnology has improved a lot in the last 10 years. Google, Amazon, FB and dictation apps are collection tons of data continously. I think the accuracy rate will increase much more faster in the following years (I believe we don’t have to wait 10 years to see that).
Great article! Enjoyed to read it.
Switchboard was pretty cutting edge back in the day 😉 Now that all of the big players are collecting tremendous amounts of training data, while they aren’t publicly available – do you really think the corpora for shared results are relevant?
In any case, true Star Trek speech understanding requires more than increased recognition rates – at least outside of highly restrictive domains. I hope that Marchex is moving in that direction as well.
@JackUnverferth I think corpa for shared results are interesting from an isolated test perspective but the results do not necessarily generalize and therefor should be carefully used.
Totally agree about needing more than speech recognition accuracy .. although it winds up being the source data for many higher level NLP type algorithms.
You know of course that Bruce Springsteen wrote and recorded “Blinded by the Light” but Manfred Mann’s version sold better.