Speech Wrecko – Page 2 – Musings on speech recognition, audio signal processing, natural language processing, artificial intelligence, and managing teams that build those technologies

October 10, 2017November 13, 2017

Creating a Management Philosophy

Just In case you couldn’t infer this from my previous posts some folks consider me opinionated and occasionally dogmatic. What else would you expect from a born and raised New Yorker, who grew up in a household where arguing your point was considered a cultural birthright!

Unfortunately, while having strong opinions and ideas can be a positive, I’ve found throughout my career that those ideas don’t always resonate with coworkers. Even when those ideas are sound. When I started to lead and manage larger teams this increasingly became a handicap and I soon realized I needed a better way to get my thoughts across.

Enter Pete Carroll. That’s right, the same Pete Carrol who led the Seattle Seahawks to two super bowls and USC to two college national championships. Not too long ago a good friend and neighbor happened to tell me about a management training class he attended based on material from Pete Carroll and his “Win Forever” philosophy. It sounded very compelling and I immediately purchased Pete’s book based on the same concepts.

It was immediately clear from reading this book that Pete Carroll had faced similar challenges earlier in his career.

“But while I had a sense inside me of what we needed, I hadn’t articulated it very well. I didn’t have the details worked out in my own mind so that I could lay them out clearly and convincingly to anybody else”

In short, this book preaches a simple strategy for dealing with an inability to convey your ideas, which is to write them down, iterate on them, and formulate them into a single cohesive vision. By doing so you change the conversation from “hey, here is my opinion” to “hey, I have a strategy for winning and here it is”. Or in Pete’s words;

“by December I finally had a clear, organized template of my core values , my philosophy, and – most importantly – my overarching vision for what I wanted to stand for as a person, a coach, and a competitor”

Armed with a singular vision and philosophy you have a solid foundation to convey your thoughts. And suddenly you have transformed disparate ideas into a recipe for success. The implications of documenting your philosophy are huge and by doing so you will:

Set clear expectations for your employees
Set expectations for executives, higher level managers and peers for how you operate and how it will benefit them
Have a recipe for success that you can continually improve and iterate on

With this information in mind I decided to take years of ideas I had accumulated and started to jot them down. I refined them and wove them into overarching vision. And when I thought about what I was ultimately trying to achieve it became clear that I was always trying to deliver truly innovative software to as many people as possible. And so my Innovate for the Masses™ philosophy was born. I present it to you below unedited. It is a continual work in progress but something that has served me well so far.

	Philosophy: Innovate for the Masses™

	Innovate for the Masses™
	Create unique and defensible value: Products should deliver something truly special that cannot be found in other solutions and simultaneously provide a defensible moat. Build best in class solutions: Products should be fully functional and should not cut corners. We should do everything required to delight customers, nothing more and nothing less. Support all customers: Products should be accessible by all existing and future customers. One off solutions are never okay. Enterprise class reliability and scalability: Product should be robust with 4 9’s reliability and the ability to scale to all customer demands.
	How we do it
	Ruthless prioritization: We question the necessity and value of every feature or piece of code. We only work on things that deliver essential value to the customer. Avoid premature optimization: we only build exactly what the customer needs. No more and no less. Read between the lines: We listen to our customers but don’t just cater to their demands. We find the commonality amongst all our customer’s requirements and build a truly unique and defensible product that surprises, delights, and addresses their needs. Communicate like crazy: We are one team with one vision and one goal. Everybody must constantly be talking to innovate and build cohesive products Right spot right time: We believe every team member plays an important role in the team’s success whether that is in a leading role or a supporting role. Work harder than anyone else: We will win by out working all of our competitors. Don’t chase the competition: We don’t chase every move our competitors make. We pay attention but follow our vision and goals and methodically work towards delivering on them without being distracted.
	Expectations for our people
	Be insanely passionate: Our employees exude passion. We are a passion first organization. Get a lot done / execute like crazy: Our employees are insanely productive. They get more done than anyone else. Care: Our employees give a shit. They care about the product, team, company, and customer like something they hold dear. Have a sense of humor: Our employees laugh. At themselves and each other. We believe you should leave work every day having smiled so much it hurts. Don’t whine or complain: Our employees don’t whine or complain they express their opinions and try to instigate change in the direction they want to see. If a decision doesn’t go their way they disagree and commit. Don’t play politics: Our employees don’t play politics. They lay it all out on the table and do their job to the best of their abilities … that is what they get rewarded for. Dare to disagree: Our employees disagree loudly and proudly. Good disagreement is central to progress. Different opinions are valued and we seek out constructive conflict

Having a vision and philosophy is not all rainbows and unicorns. Creating a philosophy and broadcasting it to your coworkers is the equivalent of driving a giant metal stake deep into the ground. You may find throughout the course of your career that sometimes people don’t agree with your strategy and when they don’t you only have three options, change your strategy, change their minds, or move on. Or again to quote Pete:

“Coach Seifert was specifically adamant that I not change who I was or my mentality. He said clearly “Pete, you’ve got to do it the way you know how.” After my experience in New York, I wondered if I shouldn’t try to be more political, but the advice I got from the two mentors was uncompromising – and some of the best I ever received.”

In closing if you are anything like me or Pete Carroll I strongly encourage you to write down your great ideas and formulate them into a cohesive philosophy. It will be will worth your while.

**For reference Pete Carroll explicitly calls out John Wooden for influencing his strategies and techniques. I highly encourage people to also read John Wooden’s book “Wooden on Leadership”

September 23, 2017September 24, 2017

Twitter Feed

[rotatingtweets screen_name=’speechwrecko’ include_rts=’1′ tweet_count=’20’ timeout=’5000′]

September 18, 2017October 8, 2017

Managing Research Projects in an Agile Development Environment

Anyone who has worked in an agile organization has found that certain projects don’t quiet fit the agile mold. Nowhere is this more apparent than with research oriented projects. After all if there is complete uncertainty in the scope and outcome of a project, as would be the case in a research project, how do you create user stories and estimate story points? And if you can’t create stories and estimate the associated costs how can you hold your team accountable, communicate status to the rest of the organization, and make cost / benefit tradeoffs? Simple! You can’t.

I’ve personally dealt with this issue after hiring several researchers to work on an agile software product team. Initially, I struggled to interleave our research projects with our other production work so I started looking for a solution. The answer to my problem came after reviewing the agile literature and the scientific method and concluding that research projects really just represent an extreme of what the agile process is ultimately trying to solve. Below I will walk you through how I arrived at this solution and details on how you can apply similar tactics in your own research organization.

AGILE PROCESS

Early in my career at Microsoft someone handed me a copy of Steven McConnell’s book Code Complete.

At the time my greatest take away from that book was the concept of the “Cone of Uncertainty”. The “Cone of Uncertainty” states that the uncertainty of a given project decreases as time progresses and more details are flushed out.

Historically the “Cone of Uncertainty” was dealt with by creating detailed upfront plans and using waterfall project management approaches. The trouble with those methodologies is that they’re extremely resistant to scope change. Largely because scope change reintroduces uncertainty.

The agile manifesto attempts to eliminate the “cone of uncertainty” problem by following the principle of “Responding to change over following a plan”. Most agile methodologies use some form of iterative development to reduce uncertainty, with the idea being that if you’re working on smaller well defined chunks of a larger project uncertainty is removed and the project can slowly adapt to changing requirements. Mike Cohn wrote in an article titled “The Certainty of Uncertainty”.

“The best way to deal with uncertainty is to iterate. To reduce uncertainty about what the product should be, work in short iterations and show (or, ideally give) working software to users every few weeks. Uncertainty about how to develop the product is similarly reduced by iterating. For example, missing tasks can be added to plans, inadequate designs can be corrected sooner rather than later, bad estimates can be amended, and so on.”

If I take the above information together I can conclude two things. First, the agile method attempts to reduce or eliminate uncertainty by making every project a function of smaller work items iterated over time. Or framed in mathematical notation:

Where: T = Max Iterations, M = Backlog, N = User Stories belong to M

Secondly, if a research project is really just a project with maximum uncertainty then the same framework should apply. Only there would be an unbounded number of work items over an unbounded amount of time. Or framed in mathematical notation:

According to this logic a research project should actually work within an agile framework. We just need to figure out how to construct M (i.e. backlog) and how to bound M and T (i.e. number of iterations).

SCIENTIFIC METHOD

So what are reasonable user stories for a research project and why are they potentially infinite? It occurred to me that research in general follows the scientific method and that the scientific method may be a good framework for story generation.

In essence the scientific method can be boiled down to three phases: a research phase, an iterative hypothesis testing phase, and a communicate or productize phase. The unbounded component of research is that many hypotheses end in failure leading to another hypothesis that must be tested and this can potentially go on ad nauseam. This provided me a compelling framework for how to break research into user stories.

The first story in any research project correlates to the first phase in the scientific method. This story should be a time bounded spike that frames the initial question, covers any background research, and has an acceptance criteria of generating the required stories for the next phase of the project, hypothesis testing.

The next set of stories are all part of the hypothesis testing phase. These stories include any development work required to test the hypothesis, any data collection required, running the tests, and analyzing the results. If the hypothesis proves false the team should circle back to the background research phase and continue on with the process.

The final phase in this framework is only relevant when a hypothesis is proven to be true. This phase contains multiple stories including any communication or publishing of results, IP protection, and a handoff to whomever might be building the final product (which might be the same team). The final handoff story should also be a spike and the acceptance criteria should include the user stories required for the production deployment.

BOUNDING AN UNBOUNDED PROJECT

Now how do you go about making sure research stories don’t go on forever? How do you bound T and M? And how do you communicate the cost / value trade offs with management?

I have found that the previously described framework only works if you apply the following guidelines in conjunction. Specifically

For any research project to be considered we must have enough information for the project to pass the “sniff test” (i.e. Is it possible in a reasonable amount of time and does it make business sense).
The initial estimate for research projects are based on the expected number of hypothesis iterations and the cost must be inline with the expected project value (i.e. if the research is perceived to have large value it may be worth iterating for a long time).
If the number of hypothesis iterations exceeds the original cost the cost/benefit analysis must be revisited and the project should be canceled if the cost has exceeded expected value.

CONCLUSION

What I have presented here is a process by which you can take an unbounded research project and place a structure around it that will work in companies using an agile development methodology. Besides allowing research projects to function in an agile organization this framework also provides a method for bounding research problems and communicating the cost / benefit trade offs to management and other relevant parties. For those who have faced similar issues integrating research oriented projects into an agile culture I hope this methodology provides some ideas on how you can better integrate research into your processes.

September 5, 2017September 6, 2017

Microsoft’s 5.1% Word Error Rate (WER) Announcement is Complete and Utter Bullshit

I apologize! That title was actually generated by Microsoft’s speech recognition system incorrectly transcribing “Microsoft’s 5.1% Word Error Rate (WER) Announcement is Completely Misleading”. Okay, that was snarky, but I promise Microsoft compelled me to write that. You see in the course of editing my previous post Microsoft had to go and put out a press release announcing “Microsoft Researchers Achieve new Conversational Speech Recognition Milestone”. Their announcement flies in the face of my previous post and therefore I had no choice but to attempt an epic takedown.

Before I try to dismantle Microsoft’s irrational clam I would like to state that the the researchers at Microsoft (some of whom I have crossed paths with while working on the Xbox Kinect and HoloLens) have done some solid research with potential implications on how we build production speech recognition systems. I have no issues with the technical nature of the research paper underpinning the press release, but I do take issue with the marketing and PR spin applied on top of it. So without further ado “LET’S GET READY TO RUMBLE”.

There are two primary issues with the announcement made by Microsoft:

Does Microsoft’s testing provide conclusive evidence that the 5.1% WER results will generalize
Are the tactics used viable from a cost/compute/timeliness perspective in a production system

Let’s tackle each of these issues independently.

Will the Results Generalize

In my previous post I discussed why large data sets were critical for training truly accurate conversational speech recognition systems. While I do take issue with the data size used to train the Microsoft speech recognition system, the larger issue is with the test set used to validate the word error rate.

In Andrew Ng’s seminal talk on the “nuts and bolts of machine learning”, he goes into great detail on the different data sets required for training, testing and validating machine learning algorithms. I encourage anybody interested in the optimal process for training and testing machine learning / AI like algorithms to watch this seriously awesome video. In terms of Microsoft’s research I want to focus on the relatively small size of their test corpus, it’s overlap with the training data, and the fact that the chosen corpus appears cherry-picked.

Corpus Size

The test set Microsoft selected for calculating the reported 5.1% WER is the 2000 NIST CTS SWITCHBOARD corpus. While I was unable to find the specific number of hours of conversation in this test corpus I was able to confirm that the 1998 and 2001 NIST CTS data sets contained 3 and 5 hours of conversation respectively. We can therefore assume the number of hours of conversation in the 2000 set is similar in duration. When considering the overall size of the conversational speech domain explained in my previous post a test set of this size is hardly sufficient for making any broad claims about meeting or beating human transcription accuracy.

Training Data Overlap

As you dig into the details of the NIST corpus a dirty little secret is quickly revealed. Let me start by quoting directly from the source:

“Of the forty speakers in these conversations thirty-six appear in conversations of the published Switchboard Corpus.”

Let me translate that for you. Thirty-six of the speakers in the test corpus are the same speakers used in Microsoft’s training corpus. I’ll also remind you that the Switchboard corpus only has 543 speakers to begin with. This raises a foundational questions about whether the test data is really distinct relative to the training set. You see almost all modern speech recognition systems use something called i-vectors to help achieve speaker independence (sometimes called speaker adaptation). Since the same speakers, on the same devices, in the same environments exist in both the training and test corpus there will invariably be a correlation between the i-vectors generated by the two data sets.

Per the diagram below, a truly honest measure of WER would require the the test data be truly distinct from the training set . In other words it should pull from a data set that includes different speakers, different content, and different acoustic environments. What is clear from the Microsoft paper is that this didn’t happen which calls into question whether the published results will truly generalize. It also greatly diminishes the the validity of any claim about a new “milestone” being achieved in conversational speech recognition.

Cherry-picking

It’s worth noting that the full 2000 NIST CTS corpus actually contains a total of 40 conversations. Twenty of those conversations are from the Switchboard corpus and twenty are from a different corpus called “Call Home”. This begs the question of why Microsoft only validated against the Switchboard portion of the corpus. While I can’t say for sure what their intent was, my best guess is because if they had used the Call Home data the results would not have led to the desired goal of meeting or beating “human accuracy”.

Taken altogether, the small corpus, with overlapping data, and a cherry picked data set you can’t help but ask did Microsoft really achieve a “new conversational speech recognition milestone”?

Is it Production Ready

EBTKS. For those not familiar with texting slang, that stands for “Everything But the Kitchen Sink”, and it’s really the best description of the system Microsoft used for this research. This calls into question the production viability of their proposed solution.

Ensemble Models

At the acoustic model (AM) and language model (LM) layer Microsoft is using an ensemble model technique. This technique requires training multiple models and processing each utterance through every model. A separate algorithm is used to combine the outputs of the different models. In essence this equates to trying to run multiple recognizers at once for every audio utterance. It currently requires an enormous number of machines to transcribe phone calls in real-time at scale Microsoft appears to be running 4 distinct AMs and multiple LMs which will have serious performance impacts. This raises questions about the number of machines and associated costs required to run a system like the one used in Microsoft’s paper.

Language MODEL RESCORING

On top of the ensemble modeling Microsoft is also using language Model Rescoring. In order to rescore you usually have an initial language model produce an N-BEST lattice which is basically the top N paths predicted by the language model. This lattice needs to be stored or held in memory in order for the rescoring to take place. In Microsoft’s case they are generating a 500-best lattice. While not crazy holding a 500-best lattice in memory in a scaled production speech recognition system would not be ideal unless it provided significant accuracy gains. According to the paper the gains from rescoring were minimal at best.

In Conclusion

So where does that leave us? Microsoft has done some great research on advancing speech recognition algorithms. Research that I greatly appreciate and hope to review further. However for Microsoft to even imply that they achieved some epic milestone in matching human transcription accuracy is downright preposterous.

In the words of renowned Johns Hopkins speech recognition researcher Daniel Povey:

“… … this whole competition between IBM and Microsoft on Switchboard is just a pissing contest, in which they both try to add in more data and bigger system combinations to beat the other one’s number. It doesn’t really indicate any special progress.”

August 23, 2017August 25, 2017

“Blinded by the Light, Revved up like a ???”

Image result for i don't understand what you're saying

I probably sang that Manfred Mann song a thousand times in my teen years and I was pretty sure the last word in that lyric was a feminine hygiene product until Google came along and taught me otherwise. It turns out the lyrics to Blinded by the Light are very difficult to understand and so is conversational speech.

For my first substantive blog post on this site I’d like to continue on a theme we have been covering over at Marchex around the complexity in building automatic speech recognition (ASR) systems that can accurately understand unbounded conversational speech. In this post I intend to dive a little deeper into WHY conversational ASR systems are so difficult to build, possible solutions to improve them, and the bounty for those who finally succeed.

There are really three primary issues that are limiting current systems from accurately recognizing conversational speech: Data, Data, and Data. More specifically: Required Data Size, Lack of Publically Available Data Sets, and Cost and Complexity with Acquiring the Required Data.

Required Data Size

There is no strict answer for how much data is needed to solve a given machine learning problem, but one oft-cited rule is the “rule of 10”. The rule of 10 states that you need roughly 10 times as many examples as you have parameters. While there are multiple parts of an ASR system including an acoustic model (AM) and a language model (LM), for now I am going to focus on the LM. One parameter used in an LM is called an n-gram, specifically in most cases a trigram. A trigram is basically the probabilities of any 3 words being seen next to each other. So if we take the rule of 10 that would imply we need 10 times the number of 3 word combinations required for our task.

This is where the problem arises. You see we humans write beautifully but we speak like idiots. Grammar goes out the window when people talk, we stutter, words are often repeated over and over while people search for their next thought, and honestly some folks downright make up words that don’t even exist. Taken together that means one can expect to see almost ANY combination of 3 words in the wild. Everything from “a a a “ to “zebra zebra zebra” . So if you don’t mind rewinding your brain to highschool math and combinatorics that means the number of 3 word combinations is:

| Number of Words in US English | ^3

| ~500,000 | ^3 = 125 QUADRILLION (i.e. a really #$%&’ing big number)

If we apply the rule of 10 we would need 1.25 QUINTILLION (i.e. an even bigger #$%&’ing number) utterances (basically a spoken sentence) containing examples of these trigrams. Let me put this in perspective for you. A single spoken utterance saved in a text file is roughly 50 bytes in size. So in order to to store 1.25 QUINTILLION utterances I would need 50 * 1.25 QUINTILLION bytes of storage. Or … 62,500 Petabytes! For reference 20 years of internet archiving only consumed 23 petabytes as of 2015. And if that doesn’t frame it for you think about it this way. The average utterance duration is roughly 1.5 seconds. If I were to string 1.25 QUINTILLION recorded utterances together it would take approximately 60 millennia to play it back!

So what’s the point? The point is that the data size required to cover all possible examples of spoken US English is almost inconceivable. Is the rule of 10 an exact science? No. Does it matter? No, because even if this estimate is wrong by 1/2 or 3/4 it is still huge. Ultimately the data size needed to properly train a conversational ASR system is gargantuan.

Lack of Publically Available Data Sets

Okay so we need a lot of data. Can’t we just buy it? No! Most publically available data sets are shockingly small compared to the size of the domain I described above. As my fellow Marchex coworkers reported in our recently published research paper the size of 2 of the most commonly used data sets, fisher English and switchboard, is prohibitively small.


	Switchboard	Fisher English	Marchex
Hours	309	2,000	5,000+
Speakers	543	20,407	605,000
Utterances	391,000	1,600,000	11,700,000
Conversations	2,400	16,000+	288,000
Words	3,000,000	18,000,000	79,500,000

Dat Acquisition Cost and Complexity

Alright if you can’t buy it why don’t companies just go and collect the data themselves? Well it turns out collecting 62,500 petabytes or 60 millennia’s worth of people conversing is no simple task. There are two primary problems, collecting that amount of audio data and labeling it.

Audio Data

Where could someone acquire that quantity of audio data? Well, there are countless hours of TV and Radio interviews out there but the dialog is generally scripted and edited so not reflective of true conversational speech. On top of that in most cases companies do not have the legal rights to the data and acquiring those rights would be prohibitively expensive.

Amazon, Apple, Microsoft, Google, and other companies are all collecting mountains of data from various voice assistants (Alexa, Siri, etc.) and voicemail messages. However all that speech data is mostly unidirectional and non-conversational (“Alexa tell me the weather” is not really conversational).

That leaves one obvious channel for acquiring conversational speech and that is phone calls. So why don’t companies just collect call recordings at scale? The answer is simple: WIRETAPPING.

In the US wiretapping is a federal and state statute aimed at ensuring your communications are private and there are criminal and civil penalties for those who violate the law. What makes wiretapping laws particularly problematic is that the law varies by state specifically around who must consent to being recorded.

So why does this matter? Well because 12 states require bidirectional consent and phone networks are open (nobody can guarantee they control both sides of the call). While any company can update their “terms of service” to notify you that you are being recorded, they would have no easy way to guarantee that the other party has consented. Unless they start playing that pesky message “this call might be recorded or monitored” in front of every call, including your weekly call with your mother! This puts scalable call recording for consumer oriented phone services mostly out of reach since the risk of violating a criminal law is too high (I think it is safe to say Mark Zuckerberg has no interest in going to jail).

In fact just ask Google who has dealt with an ongoing wiretapping case because they were scanning the emails of GMAIL users to place targeted ads. The argument is in fact incredibly similar in that Google was not just reading the email of GMAIL users but also any yahoo, Hotmail, etc. user who sent a GMAIL user an email. In the 12 states requiring bidirectional consent the non-Gmail users never consented which has potentially caused Google to violate the law.

Labeling

Even if by some miracle we could collect that amount of audio data how would we label it? In general ASR systems (and all other machine learning systems) require accurately labeled data (sometimes called “ground truth”). In the speech recognition world that generally involves hand transcribing data. And if it would take 60 millennia to read out that much speech imagine how long it would take to hand transcribe it. Simply put, it is not feasible in our lifetimes at any reasonable cost.

What’s the Solution

It turns out almost all companies record phone calls. Recordings from any one company would have highly biased content but in aggregate consumer to business recorded phone calls are an amazing source of conversational speech at scale. Because you need a wide cross section of content to ensure subject matter diversity, companies who provide platform call recording solutions and have legal access to the aggregate data are really the best sources of this content.

But what about the labeling? Well the only reasonable solution for labeling that content is using unsupervised or semi-supervised automated solutions for labeling the data. This is where Marchex has invested and you can read more details about our semi-supervised approach in our research paper. I hope to cover this topic in detail in a future post.

Why Does Any of this Matter

You might be asking if highly accurate conversational speech recognition is really necessary. Or you might be thinking “My Alexa already works awesome” But if you are a sci-fi nerd like me you’re anxiously awaiting the day that you can step foot on the holodeck and have a real conversation with an AI entity or crack open a beer with a fully conversant robot like Data from Star Trek. For that to happen we need to truly understand conversational speech. We need to understand it so machines can properly decipher what humans are saying and we need to understand it so machines can generate speech that mimics human dialog.

Highly accurate conversational speech recognition is necessary for us to fulfill the promised vision of artificial intelligence. Who knows maybe in a few years a holographic Manfred Mann and I will be doing a duet in my own personal holodeck. Can you hear it? “Blinded by the light, revved up like a deuce …”

July 6, 2017

Hello World!

I’m back! After a long hiatus from blogging I’ve decided to give it a go again. I’d like to use this blog as a forum for sharing my thoughts on advanced technologies like speech recognition, audio signal processing, natural language processing, and artificial intelligence. Additionally I will be covering my theories on how to manage team’s building these types of technologies and how we bring these products to market. I hope you find the content engaging and stay tuned for my first official post in the coming weeks!

Philosophy: Innovate for the Masses™

Innovate for the Masses™

How we do it

Expectations for our people