Teachers Keep on Teaching – ‘til I Reach my Highest Ground

Forgive me Stevie Wonder for slightly reordering your lyrics, but I think you’d agree that it’s hard to reach your highest ground without the help of teachers. Being a teacher is often a thankless job. Nobody gets rich or famous for being a teacher, yet the contribution teachers make to society is invaluable. So, in honor of Teacher Appreciation Week, I’d like to take the opportunity to thank some of the teachers who helped me reach my highest ground.

The Teachers Who Made Me What I Am Today

Erik Lawrence


My saxophone teacher from the age of 9 until I was 18, Erik is largely responsible for my love of music. Seeing that my career has largely centered around music and technology, Erik can largely take credit for planting the seed and nurturing the music branch of that tree. When I began to take an interest in the piano and composing Erik was quick to urge my parents to get me piano lessons. And when my parents wanted to buy me a new saxophone as a graduation present it was Erik who took me to the best New York area music stores to find the perfect sax. Beyond music, Erik was an extremely positive role model throughout my formative years. He taught me to treat others with respect, be accountable for my mistakes, and value my time and the time of others. For all of the above and much more I am grateful for the impact Erik had on my life.

David Snider


By the age of 13, I had taught myself to play piano, I was beginning to compose my own music, and I was slowly collecting a variety of audio electronics. I owned a Yamaha SY55 synthesizer with an onboard sequencer, a Tascam four-track recorder, and boxes of random audio cables. Recognizing my newfound passions Erik Lawrence convinced my parents to get me piano lessons and introduced me to David Snider. If Erik was my music guru then David was my technology guru. Upon realizing my knack for creating music and tinkering with anything music electronics-related, David managed to convince my parents to buy me my first computer (an Apple Mac) and my first audio software package (Mark of the Unicorn’s MOTU – Performer). He took my meandering teenage hobbies and turned them into a focused passion that would ultimately drive a large part of my career. David brought much more than technology to my life. He taught me how to play Jazz piano, something I still do to this day. In fact, 30+ years later I can still play the song Misty exactly the way he taught to me. While David may not remember this, he called me in the early days of my freshman year of Music school to wish me luck and give me some advice on avoiding some of the pitfalls of a musician’s life. It seemed inconsequential at the time but the fact that he cared enough to do that is remarkable.

Eileen M Curley

In the first quarter of my freshman year of high school, I had a failing grade in math. This was not acceptable in the Flaks household, so my mother reached out to the teacher to see if she had any advice on how I could improve my grade. Ms. Curley selflessly gave her own free time to provide me with extra help. It quickly became apparent that my problem with math was unrelated to my aptitude and purely a function of not paying attention and not doing the work. In no time at all, I went from an F to an A. That year I scored a 99 on New York State standardize math exam (the regents) losing only 1 point for carelessly not carrying a negative sign down to my final answer on one question. When Ms. Curley received the results from that exam, she took time out of her day to directly call my house and excitedly tell my Mother how well I did. My time with Ms. Curley was a turning point in my life. Little did she know that I would ultimately go on to be a math major in college, leading to a career in math and computer science.

William (Bill) Garbinsky (a.k.a. Mr. G)

William Garbinsky was a musician first and a high school music teacher second. He loved music and he gave innumerable hours during and after school to help students like me become better musicians. He taught the concert band, wind ensemble, marching band, and jazz band and I was a member of all of them. He gave band nerds like me a place to call home and surrounded us with a like-minded peer group that made us all feel like we were part of something bigger. Mr. G even took time out of his day to teach AP Music History and Music Theory classes to the small cohort of students who were interested. Thanks to those Advanced Placement College Credits I had some free time on my schedule when I entered a college, which I promptly filled with math classes. Sadly Mr. G passed away some years ago but I hope he knows what a difference he made in my life and the lives of countless others.

James (Jim) McElwaine


I was lucky enough to be accepted into the Conservatory of Music at Purchase College. I was even more fortunate to study under James McElwaine. Professor McElwaine was a Physics student before going full bore into music. So, when he stumbled upon a kid in his music program who was taking calculus classes as electives, he embraced it and pushed me to pursue it further. Beyond encouraging me to explore the math program, Jim recognized my passion for everything audio electronics related, and he opened every door he could, including getting me jobs running live sound for campus events, running the conservatories recording studios and he even got me my first real paid gig as a recording engineer. Professor McElwaine’s willingness to embrace and encourage my odd trajectory through music school played a huge role in my ability to progress into a master’s program that ultimately allowed me to go from using pro-audio equipment to building it.

Martin (Marty) Lewinter


If my music professor was a physicist then surely, I needed a math professor who was also a musician. Lucky for me the head of the math program Martin Lewinter also happened to be a seasoned musician. Professor Lewinter taught that very first calculus class I took as an elective. After witnessing my interest in math, Marty encouraged me to take on a second degree. Before long I was pursuing two simultaneous bachelor’s degrees with a focus in music composition and math/computer science. Professor Lewinter gave hours of his time towards helping me as the math curriculum progressively got harder and he continued to push me to excel in both the math and the music program. When I started applying to graduate schools with a heavier engineering focus, I picked up some textbooks to independently review. After struggling over some of the math equations I asked Professor Lewinter for some help. I still remember our conversation, where I showed him an equation in a book and he had to explain to me that engineers used j for imaginary numbers, not i, so as not to be confused with the variable for current. It was a simple thing that just might have prevented my first year in graduate school from turning into a complete disaster!

Ken Pohlmann


In my junior and senior years of college, I started to dive deeper into the underlying math behind the audio tools I was using. I happened to be reading a book called the Principles of Digital Audio and found a note about the author who was a professor of “music engineering” at the University of Miami. Music engineering sounded like an awfully good way to combine four grueling years of math and music education, so I sent Professor Pohlmann an email asking if he’d consider accepting a student without an undergraduate electrical engineering degree. Ken was kind enough to respond, and he recommended I take an extra year to get some basic engineering credits and he pointed me towards some textbooks that might give me an early head start. Well, I did buy the books, but I otherwise ignored him and applied to the program anyway. I still remember being overjoyed at receiving an acceptance letter where Ken told me that he thought my math background would carry me through the curriculum. With Professor Pohlmann and the University of Miami Music Engineering program, I stumbled into a small world of like-minded folks who had a passion for math and music. Professor Pohlman took a hodgepodge of academic pursuits I haphazardly pieced together and combined them into one coherent subject that would ultimately lead to my final career as an engineer, manager, and executive on countless audio projects.

Will Pirkle


How many teachers have fed you information that you can directly correlate to your current and future earnings? Not many, but that is exactly what Will Pirkle did for me and many others. Professor Pirkle was able to perfectly blend theory and practice and teach me how to effectively turn everything I had learned into real software that did amazing things with an audio signal. Will took all the ethereal subject matter I had learned over the years and made it into something I could feel and touch. It’s that skill set along with my own willingness to pester anybody for something I want, that led me to my first full-time job with a music software company called Opcode, (ironically a competitor of MOTU) bringing me full circle back to some of my earlier education. Will’s teaching has stood the test of time and I still find a use for some of what he taught. And whenever anybody asks for advice about the audio/music engineering space I regurgitate much of the knowledge Professor Pirkle imparted on me. Without a doubt, I can say that my employability and financial wellbeing are directly tied to everything I learned from Professor Pirkle.

To All the Teachers

While the eight teachers above had the most profound effect on my life there are many other teachers who contributed to my success and I’d like to offer my thanks to all of them. And to all the teachers out there who feel unappreciated, please remember that somewhere out there, in that sea of children, is a kid who just needs a little extra push to find out who they are and be the best version of themselves. Keep fighting for those kids because I am living proof of the impact you can have.

One Final Note of Gratitude

Since Mother’s Day is fast approaching, I’d be remiss if I didn’t thank the greatest teacher of them all, my Mother, Susan Flaks. My mother was there for every step of the journey described in this post. Whether that was teaching me my first notes on the piano, driving me to private music lessons, paying for that first computer, pushing me to get extra help when I needed it, paying for college, or just supporting me through my entire education, she was the root of all my academic and professional success. My mother was more than just an amazing parent, she was also a teacher for more decades than she would care for me to publicly comment on, and I know she had a positive influence on numerous students who like me, went on to be happy, healthy, and well-rounded adults, who have made a positive contribution to their communities and the world.

Voting is Just a Precision and Recall Optimization Problem

It’s hard to avoid the constant bickering about the results of our last election. Should mail-in voting be legal? Do we need stricter voter identification laws? Was there fraud in the last election? Did it impact the results? These are just a fraction of the questions circulating around elections and voter integrity these days. Sadly, these questions appear to be highly politicized and it’s unclear if anybody is really interested in asking what an optimal election system looks like.

In a true fair and accurate representative democracy, a vote not counted is just as costly as one inaccurately counted. More precisely, a single mother with no childcare who doesn’t vote because of 4-hour lines is just as damaging to the system as a vote for a republican candidate that is intentionally or accidentally recorded for the opposing Democratic candidate.

Therefore, we can conclude an optimal election system really involves optimizing on both axes. How do we make sure everyone who wants to vote gets to vote? And how do we ensure every vote is counted accurately? When viewed this way one can’t help but see the parallels to optimizing a machine learning classifier for precision (when we count votes for a given candidate how often did we get it right) and recall (of all possible votes for that candidate how many did we find).

Back the Truck Up! What is Precision and Recall Anyway

Precision and Recall are two metrics often used to measure the accuracy of a classifier. You might ask “why not just measure accuracy?” and that would be a valid question. Accuracy defined as everything we classified correctly divided by everything we evaluated, suffers from what is commonly known as the imbalanced class problem.

Suppose we have a classifier (a.k.a. laws and regulations) that can take a known set of voters who intend to vote “democrat” and “not democrat” (actual / input) and then outputs the recorded vote (predicted / output).

Let’s assume we evaluate 100 intended voters/votes, 97 of which intend to not vote for the democratic candidate and let’s build the dumbest classifier ever known. We are just going to count every vote as “not democrat”, regardless of whether the ballot was marked for the democratic candidate or not.

N (number of votes) = 100 Output (Predicted) Value
Democrat Not a Democrat
Input (Actual) Value Democrat TP = 0 FN = 3 TOTAL DEMOCRATS = 3
Not a Democrat FP = 0 TN = 97 TOTAL NOT DEMOCRATS = 97

To make our calculations a little easier we can take those numbers and drop them into a table that compares inputs to outputs also known as a confusion matrix. To simplify some of our future calculations we can further define some of the cells in the table above

  • True Positives (TP): Correctly captured an intended vote for the democrats as a vote for the democrats (97)
  • True Negatives (TN): Correctly captured a vote NOT intended for the democrats as a vote, not for the democrats (97)
  • False Positives (FP): Incorrectly captured a vote NOT intended for the democrats as a vote for the democrats (0)
  • False Negatives (FN): Incorrectly captured an intended vote for the democrats as a vote not for the democrats (3)

Now we can slightly relabel our accuracy equation and calculate our accuracy with our naïve classifier and the associated values from the table above.

97% Accuracy! We just created the world’s stupidest classifier and achieved 97% accuracy! And therein lies the rub. The second I expose this classifier to the real world with a more balanced set of inputs across classes we will quickly see our accuracy plummet. Hence, we need a better set of metrics. Ladies and gentlemen, I am delighted to introduce …

  • Precision: Of the votes recorded (predicted) for the Democrats, how many were correct

  • Recall: Of all possible votes for the Democrats, how many did we find

What becomes blatantly clear from evaluating these two metrics is that our classifier, which appeared to have great accuracy, is terrible. None of the intended votes for the democrats were correctly captured and of all possible intended votes for the democrats, we found none of them. It’s worth noting that the example I’ve presented here is for a binary classifier (democrat, not democrat) but these metrics can easily be adapted to multi-class systems that more accurately reflect our actual candidate choices in the United States.

There’s No Such Thing as 100% Precision and Recall

Gödel’s incompleteness theorem, which loosely states that every non-trivial formal system is either incomplete or inconsistent, likely applies to machine learning and artificial intelligence systems. In other words, since machine learning algorithms are built around our known formal mathematical systems there will be some truths they can never describe. A consequence of that belief and something I preach to everyone I work with is that there is really no such thing as 100% precision and recall. No matter how great your model is and what your test metrics tell you. There will always be edge cases.

So if 100% precision and recall is all but impossible what do we do? When developing products around machine learning classifiers, we often ask ourselves what is most important to the customer, precision, recall, or both. For example, if I create a facial recognition system that notifies the police if you are a wanted criminal, we probably want to air on the side of precision because arresting innocent individuals would be intolerable. But in other cases, like flagging inappropriate images on a social network for human review, we might want to air on the side of recall, so we capture most if not all images and allow humans to further refine the set.

It turns out that very often precision and recall can be traded off. Most classifiers emit a confidence score of sorts (also known as a SoftMax output) and by just varying the threshold on that output we can trade-off precision for recall and vice-vera. Another way to think about this is, if I require my classifier to be very confident in its output before I accept the result, I can tip the results in favor of precision. If I loosen my confidence threshold, I can tip it back in favor of recall.

And how might this apply in voting? Well, if I structure my laws and regulations such that every voter must vote in person with 6 forms of ID and the vote is tallied in front of the voter by a 10-person bipartisan evaluation team who must all agree … we will likely have very high precision. After all, we’ve greatly increased the confidence in the vote outcome. But at what expense? We will also likely slow down the voting process and create massive lines which will significantly increase the number of people who might have intended to vote but don’t actually do so, hence decreasing recall.

Remind me again what the hell this has to do with Voting

The conservative-leaning Heritage Foundation makes the following statement on their website:

“It is incumbent upon state governments to safeguard the electoral process and ensure that every voter’s right to cast a ballot is protected.”

I strongly subscribe to that statement and I believe it is critical to the success of any representative democracy. But ensuring that every voter’s right to cast a ballot is protected, requires not only that we accurately record the captured votes, but also ensure that every voter who intends to vote is unhindered in doing so.

Maybe we need to move entirely to in-person voting while simultaneously allocating sufficient funds for more polling stations, government-mandated paid time off, and government-provided childcare. Or maybe we need all mail-in ballots but some new process or technology to ensure the accuracy of the votes. Ultimately, I don’t pretend to know the right answer, or if we even have a problem, to begin with. What I do know is that if we wish to improve our election systems we must first start with data on where we stand today and then tweak our laws and regulations to simultaneously optimize for precision and recall.

So, the next time a politician proposes changes to our election system ask … no demand, they provide data on the current system and how their proposed changes will impact precision and recall. Because only when we optimize for both these metrics can we stop worrying about making America great again and start working on making America even greater!

“If you start me up I’ll never stop …” Until We Successfully Exit

“Hey, our fledgling startup is on path to being the next *INSERT BIG TECH COMPANY NAME HERE* and we think you’re a great fit for our CTO role”. Find me a technical leader who hasn’t been enticed by those words and you’ll have found a liar. So, what happens when one succumbs to the temptation and joins an early-stage startup? Well, if you have been wondering where I’ve been for the past couple of years, I was fighting the good fight at a small, early-stage NLP/Machine Learning based risk intelligence startup. And while I’m not retired or sailing around the world in my new 500-foot yacht, we were able to successfully exit the company with a net positive outcome for all involved. My hope with this post is that I can share some of my acquired wisdom, and perhaps steer the next willing victim down a similar path of success.

If I could sum up my key learnings in a few bullet points, it would boil down to this:

  • If you don’t believe … don’t join
  • Be prepared to contribute in any way possible
  • Find the product and focus on building it
  • Pick the race you have enough fuel for and win it

What I’d like to do in the rest of this post is break down each one of these items a little further.

If you don’t believe … don’t join

Maybe this goes without saying, but if you don’t believe in the vision, the people, and the product you shouldn’t join the startup approaching you. The CTO title is alluring, and it is easy to fool yourself into taking a job for the wrong reasons. But the startup experience is an emotional slog of ups and downs and it will be nearly impossible to weather the ride if you don’t wake up every day with an unyielding conviction for what you’re doing. As I’ll explain later in this post, you don’t need to believe you’re working for the next Facebook, but you do need to believe you are building a compelling product that has real value for you, your coworkers, your investors, and your customers.

Be prepared to contribute in any way possible

For the first few months on the job I used to go into our tiny office and empty all the trash bins because, if I didn’t, that small office with 5 engineers started to smell! It didn’t take long for someone to call out that I was appropriately titled, CTO (a.k.a. Chief Trash Officer). You might be asking why anybody would take a CTO job to wind up being the corporate custodian, but that is what was needed on some days.

While I have steadfastly maintained my technical chops throughout my career, I hadn’t really written a lick of production code for nearly two decades prior to this job. But with limited resources, it became clear I also needed to contribute to the code base and so I dusted off those deeply buried skills and contributed where I could. When you join a startup with that CTO title, it is easy to convince yourself that you’ll build a huge team, be swimming in resources, and have an opportunity to direct the band versus playing in it. But you’ll quickly find that in the early stages of a startup, the success of the company will depend on your willingness to drop your ego and contribute wherever you can.

Find the product and focus on building it

Great salespeople can sell you the Brooklyn Bridge. And if you’re just lucky enough, you might have a George C. Parker in your ranks. But the problem with great salespeople is they will do almost anything to close the sale and that comes with a real risk that they’ll sell custom work. If that happens over an extended period of time, you will be unable to focus on the core product offering and you’ll quickly find you’re the CTO of a work-for-hire / consulting company.

Startups face real financial pressures that often drive counterproductive behaviors. That often means doing anything necessary to drive growth in revenue, customers, or usage. But as illustrated in the graph below, high product variance will often ultimately lead to stagnant growth.

That’s because with every new feature comes a perpetual support cost. And if you keep building one-off features, and can’t fundraise fast enough, that cost will eventually come at the expense of delivering your true market-wide value proposition. If you allow this to happen, you’ll wind up with a company that generates some amount of revenue or usage but has no real value.

Companies that find true product/market fit should see product variance gradually decrease over time and this should allow the company to grow. Your growth trajectory may be linear when you need it to be exponential, but no per customer feature work will fix that problem and you may need to consider pivoting. If pivoting isn’t an option, it may be time to look for an exit.

As the CTO, a critical part of your job is to help the company find its product/market fit and then relentlessly focus on it. You need to hold the line against distractions and ensure the vast majority of resources are spent on features that align with the core value proposition. If you’ve truly found a product offering that is valued by a given market segment, and you can keep your resources focused on building it, growth will follow.

Pick the race you have enough fuel for and win it

I am an avid runner, and one of the great lessons of long-distance running is, that if you deplete your glycogen store, you’ll be unable to finish the race no matter how hard you trained. In other words, you can’t win the race if you have insufficient fuel. This is also very true of startups. If you’re SpaceX or Magic Leap, you’re running an ultra-marathon and you need a tremendous amount of capital in order to have sufficient time and resources to realize the value. But fundraising is hard, and even if you have an amazing product and top-notch talent, there can be significant barriers to acquiring sufficient capital.

The mistake some startups make is that they continue to run an ultra-marathon when they only have fuel for a 5k and that can lead to a premature or unnecessary failure. If funding becomes an issue, start looking for how your product might offer value to another firm. Start allocating resources towards making the product attractive for an acquisition. Aim to win a smaller race and seek more fuel on the next go around.

Final Thoughts

Taking on a CTO role at an early stage startup can be a great opportunity and lead to enormous success, but before you take the leap make sure you know what you’re getting into. Along the way don’t forget to stop and smell the roses. In the words of fellow Seattle native Macklemore, “Someday soon, your whole life’s gonna change. You’ll miss the magic of these good old days”.

Final Final Thoughts

No startup CTO is successful without support from an army of people. So I’d like to offer some gratitude to the following folks:

  • Greg Adams, Christ Hurst: Thanks for giving me an opportunity and treating me like a cofounder from day one.
  • Shane Walker, Cody Jones, Phil LiPari, Pavel Khlustikov, David Ulrich, Julie Bauer, Jason Scott, Carrie Birmingham, Rich Gridlestone, Bill Rick, Zach Pryde, Amy Well, David Debusk, Mikhail Zaydman, Jean-Roux, Bezuidenhout, Sergey Kurilkn (and others I may have forgotten): Thanks for being one the greatest teams I’ve ever worked with.
  • Brandon Shelton, Linda Fingerle, Wayne Boulais, Armando Pauker, Matt Abrams, Matthew Mills: Thank you for being outstanding board members, mentors, and investors
  • Ziad Ismail, Pete Christothoulou, Kirby Winfield: Thank you for the career advice during my first venture into the startup world.

*Note: You can read more about Stabilitas, OnSolve, and our acquisition at the links below:



Pair Programming or Bare(ly) Programming


“Sorry we don’t have enough resources, we only have four pairs” – As an engineering leader no other statement has made me cringe more.  After all four pairs is a healthy sized team of eight developers. 

Throughout my career I have run across CTOs, VPs, directors, development managers, teams, and individual developers who swear by pair programming with near religious devotion.   Personally I’ve maintained a healthy dose of skepticism when it comes to pairing as an overarching development philosophy.  

As an engineering leader my job is to build products that delight customers in the most efficient way possible.   Anecdotally, pairing consistently costs more and hence seems irresponsible to use exclusively as a development technique.    But admittedly anecdotal evidence is insufficient so I decided to dig through the research and see if I could find more empirical evidence to support my claim.


Pair programming is an agile software development methodology where two programmers work on the same task using one computer and keyboard.   One programmer is called the driver and operates the keyboard and does the primary coding work.   The other developer, often called the navigator, is responsible for observing the driver and providing guidance in order to speed up problem solving, improve design, and minimize defects.

The potential negative impact of pair programming is immediately clear to most people.   By applying two resources to a task you are effectively doubling the cost.  So unless there’s an equal or greater improvement in other project variables, pair programming would be nearly impossible to justify. Exploring the problem through a project management lens, where we have three variables, cost (including resources), time, and quality/scope, If we double our cost we’d expect to see an equivalent decrease in time to deliver or increase in quality or scope (or some factor of each).


In mathematical terms let’s assume the value of any given project X is equal to a weighted linear combination of cost, time and quality/scope. 


When pairing our cost is automatically going to double since we’ve applied two resources for a task that in theory can be completed by one.


In order for our project value to remain equal or be better we need our other variables to proportionally change in the right direction.   For example if our project now takes 50% less time we could argue we net out even.  Or if our scope or quality double, we would similarly be in a good position.


However, In my experience I’ve not seen pair programming live up to these expectations.  Instead I’ve seen tasks or user stories take the same amount of time and produce similar results at nearly double the cost.  But you shouldn’t take my word for it.  Let’s review the literature and see what the experts have to say.


There are actually a fair number of research papers that attempt to prove or disprove the efficacy of pair programming.  That said, in my survey of the literature I found most of the research to be ill designed for comparison to real world corporate product development organizations.  Specific issues include:

  • Developer Skills:  Most of the studies rely on university students that shouldn’t be compared to seasoned professional developers.
  • Non Production Environments:  The majority of the software used for evaluation is very far removed from real product development environments.
  • Organization Realities: Finally there is little or no accounting for organizational churn that happens in a real for-profit company
  • In spite of these issues it’s worth exploring these various research studies and the insights they provide on the impacts of pair programming.

    Many of the research papers evaluate the impact of pair programming on effort, which in at least one paper is defined as two times the duration or time required to complete a given task [1].  Specifically, effort increases ranging from 15% all the way to 100% have been observed [2].  In one of the more well conducted studies an effort increase of 84% was seen [1].   Since we know effort is just twice the duration of a single developer we can actually do some math to figure out how much faster pairs complete a task versus a single developer.


    Or by using our earlier project management equation, with a little rounding we can assume our pairing time weight would be roughly 9/10 the weight required for a single developer.


    This is nowhere near the factor of 1/2 or less we said we needed to make pair programming cost efficient.  Well if the research doesn’t support a sufficient decrease in time to completion perhaps there’s research indicating that a given project’s scope or quality will increase enough to offset the difference.  

    Unfortunately, once again the results are at best inconclusive, but in many cases support an actual decrease in scope and minimal or near zero increase in quality.  For example in [2] a reported 29% decrease in productivity was measured for pair programming team when measured as a function of completed use cases.

    Regarding quality, even in one of the more optimistic papers we only saw a 10% – 20% increase in quality (measured as test cases passed) [3].   According to [2], we only saw an 8% improvement in quality when measuring actual defects.   While these improvements are non trivial, when combined with the time and scope metrics it remains insufficient to offset the associated costs.

    Cherry Picking

    “But aren’t you just cherry picking the worst examples to justify your case” you might ask? Not really because even in the most optimistic research studies initial results were usually much worse and only improved over time.  For example in [3] initial increases in effort dropped from 60% to15% over time. Most of the research attributes these gains in effort to “pair jelling”.  In other words, as the pairs get to known each other they become more efficient.

    The problem with these studies is that they assume that once a pair jells the gain will hold.  However in any real for-profit organization there is potential for high variability in projects and staff which means pair jelling is unlikely to be a one off cost.  It is more likely a continuing cost to the business over time.

    Several studies also point out that the value of pair programming decreases with simpler tasks [4].   Therefor one must consider the ratio of simple to complex tasks in any given development cycle in order to understand the long term impacts of pair programming.  When I evaluated my own teams, I found multiple iterations where 75% of work items where smaller changes that could easily be tackled by a single developer in the same timeframe.  

    Finally, one paper [5] attempted to justify pair programming by evaluating Net Present Value (NPV).   In this paper an argument is made that even if it costs more to pair program, faster time to market warrants the cost.  I take issue with this calculation since it does not factor in the opportunity cost of having those extra resources not work on a different higher priority project.  

    For example if we take the reported 84% increase in effort and assume we finish our project in 9/10 the time of a single developer, we must ask ourselves what happens when a key customer asks for a critical bug fix?   I can tell that customer to wait until I finish my current project or I can split my pair and work on both at the same time at the small cost of a 1/10 increase in duration.  By splitting my pair I’ve delighted my key customer as quickly as possible at a trivial cost. Clearly you need to factor in the opportunity cost of not delighting that customer when evaluating the value of pair programming.

    To Pair or Not to Pair

    So should you pair or not pair?  There are a lot of reasons a team might use pair programming.  In some cases the cost / benefit tradeoff may be worthwhile.  Pairing can be very effective at educating new team members, improving the skills of junior team members, cross training, and reducing the cost of complex tasks.  If you take anything away from this post let it be:

  • Challenge the Efficacy of Pair Programming: If your team or engineering manager wants to exclusively use pair programming, don’t blindly accept it.  Collect the data to validate if it is really cost effective
  • Pair when it makes Sense:  Use pairing selectively when it makes sense including educating new team members, improving the skills of junior team members, cross training, and reducing the cost of complex tasks.
  • Factor in Opportunity Costs: Make sure you consider the opportunity costs of projects not being worked on when pairing.
  • In short don’t allow yourself to be swayed by a dogmatic insistence that pair programming is better.  As a leader your job is to challenge your team to delight customers in the most cost effective way possible.   Pairing should only be used if it definitively contributes to that cause.


    [1] Arisholm, Erik, et al. “Evaluating pair programming with respect to system complexity and programmer expertise.” IEEE Transactions on Software Engineering 33.2 (2007). – Summary available at https://pdfs.semanticscholar.org/9787/c9663cad3a1c21550f2e5e365e70fd01d3aa.pdf

    [2] Vanhanen, Jari, and Casper Lassenius. “Effects of pair programming at the development team level: an experiment.” Empirical Software Engineering, 2005. 2005 International Symposium on. IEEE, 2005. https://pdfs.semanticscholar.org/40dd/fa666bf367cfffaae421dbd3c6170a3e3dc3.pdf

    [3] Cockburn, Alistair, and Laurie Williams. “The costs and benefits of pair programming.” Extreme programming examined (2000): 223-247. http://www.cs.pomona.edu/~markk/cs121.f07/supp/williams_prpgm.pdf

    [4] Lui, Kim, and Keith Chan. “When does a pair outperform two individuals?.” Extreme programming and agile processes in software engineering (2003): 1011-1011. ftp://nozdr.ru/biblio/kolxo3/Cs/CsLn/E/Extreme%20Programming%20and%20Agile%20Processes%20in%20Software%20Engineering,%204%20conf.,%20XP%202003(LNCS2675,%20Springer,%202003)(ISBN%203540402152)(479s)_CsLn_.pdf#page=240

    [5] Padberg, Frank, and Matthias M. Muller. “Analyzing the cost and benefit of pair programming.” Software Metrics Symposium, 2003. Proceedings. Ninth International. IEEE, 2003. http://wwwipd.ira.uka.de/Tichy/uploads/publikationen/32/metrics03.pdf

    End-to-End Speech Recognition: Part 1 – Neural Networks for Executives (I Mean Dummies)

    When I originally contemplated the subject of my next blog post, I thought it might be interesting to provide a thorough explanation of the latest and greatest speech recognition algorithms, often referred to as End-to-End Speech Recognition, Deep Speech, or Connectionist Temporal Classification (CTC).   However, as I began to research the topic I quickly discovered that my basic knowledge of neural networks was woefully lacking.  Several weeks of reading and a few hundred lines of code later, I realized before I could teach a fellow plebe like myself about end-to-end speech recognition,  I probably needed to introduce the fundamentals first.

    With that in mind, what was intended to be a single entry will likely turn into multiple blog posts covering an overview of end-to-end speech recognition and some fundamentals of deep learning that make it possible.  In this first post I’d like to provide a brief introduction to end-to-end speech recognition and then give a more detailed tutorial about one of the basic components of deep learning, a multilayer perceptron, also known as a feed forward neural network.  I’ll then walk you through how I brought all this information together while building a very basic end-to-end speech recognition system.

    End-to-End Speech Recognition

    So what is end-to-end speech recognition anyway?  At it’s most basic level an end-to-end speech recognition solution aims to train a machine to convert speech to text by directly piping raw audio input with associated labeled text through a deep learning algorithm.   The resulting model is then able to recognize speech with no further algorithmic components.


    And why is this any better than traditional speech recognition systems?  Traditional speech recognition systems use a much more complicated architecture that includes feature generation, acoustic modeling, language modeling, and a variety of other algorithmic techniques in order to be accurate and effective.   This in turn makes the training, testing, and code complexity far more difficult than would be with an end-to-end system.


    In other words an end-to-end solution greatly reduces the complexity in building a speech recognition system.   And if that alone doesn’t convince you of the value an end-to-end recognizer brings to the table, several research teams, most notably the folks at Baidu, have shown that they can achieve superior accuracy results over traditional speech recognition systems.

    To validate the possibilities of an end-to-end speech recognition system I decided to build my own.  However, I quickly found that building such a system required advanced knowledge of deep learning techniques.   This is because the current end-to-end systems generally rely on more complex neural network algorithms like Recurrent Neural Networks (RNNs) and something called the connectionist temporal loss function that are difficult to understand if you don’t have a solid understanding of basic neural networks.   So I opted to take a simpler approach and see if I could build a very simple end-to-end recognizer using basic deep learning techniques.   Specifically a feed forward neural network or multi layer perceptron.

    Neural Network Fundamentals

    Before I dive into the details, let me provide a quick tutorial on the feed forward neural network.  The underlying element of a neural network is called a perceptron or an artificial neuron.  Much like a biological neuron, a perceptron takes a series of inputs, performs a function on those inputs, and produces and output that can be passed to other neurons.


    The simplest function is just a sum of weighted inputs.  However this function is a linear relationship and the world is rarely linear so we apply something called an activation function to help impart nonlinearity.   There are actually numerous activation functions used in neural networks, some linear and some not, but the Sigmoid and TanH functions are two you will commonly see in the relevant literature.


    Now that we know what a neuron is, a neural network is really just a collection of multiple interconnected neurons.   Neurons are grouped and connected in “layers”.   The simplest neural network is a single layer network that connects one or more inputs to one or more outputs.   There is no calculation on the input layer, only the output layer.


    Neural networks can grow in complexity by adding additional layers which are commonly referred to as “hidden layers”.  In theory a network can contain an infinite number of layers with an infinite number of neurons although this is neither practical or necessary.


    The only remaining question then is how do we know what weights will give us the outputs we are looking for.  A simple feed forward neural network uses a technique called forward and back propagation to train the network and find the optimal weights.   There are dozens of books and blog posts devoted to the subject of how the forward and back propagation algorithms work, but for the sake of this blog post I’ll provide an introductory explanation along with pointers to additional information.

    The main idea requires randomly initializing our weights and pushing the inputs “forward” through the network so we can make an output prediction.   We then use a cost or loss function to calculate how far our prediction was from the expected result.

    Our ultimate goals is to reduce our error or cost to the lowest point possible (sometimes referred to as the global minimum).  To do this we use an algorithm called gradient decent.   The goal of the gradient descent algorithm is to find the partial derivative of the cost function with respect to each weight.


    In other words we’re looking for the direction (+/-) and slope of our cost function to tell us how large to adjust our weights and in which direction in order to get to zero cost (or close to it).  If the gradient is 0 we have reached our minima.   While I won’t go into the details thanks to the concept of the chain rule in calculus we can actually start at the output layer , perform the gradient descent algorithm, and “back” propagate it to the next layer and all the way back to our inputs.  Along the way we are calculating how much we need to adjust our weights to get closer to that zero cost.


    When training a neural network we continue to forward and back propagate until we we have minimized the error.  While I have grossly oversimplified the explanation for forward and back propagation, this is fundamentally how neural networks work.  I have provided links to more detailed descriptions at the end of this post.

    Putting it All Together

    Now that we have some basic knowledge of end-to-end speech recognition systems and neural networks, we’re ready to make a simple end-to-end speech recognizer.  To build this recognizer I used python and the numpy library to help with the matrix math.

    However, before we start we need a simple speech data set.  Preferably one consisting of utterances with only single words.  This would eliminate the need to deal with time alignment (i.e. which text goes with which audio segment in time).  Luckily I found a great freely available dataset consisting of people speaking single digits 0 – 9 with fifty utterances per digit per person.   This data set met the criteria of being a single word while also being sufficiently large enough to train a neural network.

    With labeled audio data in hand the next step required is reading in the audio data and the associated labels  For this I used the python librosa library.  Librosa provides easy to use out-of-the-box functions for computing the Short Time Fourier Transform (STFT) which is necessary to get the frequency spectrum of our audio signal (e.g. our input signal).  Librosa additionally provides handy functions for computing other audio features like Mel Frequency Cepstral Coefficients (MFCC) which can also be a useful audio input feature (note my code provides an alternative implementation that uses MFCC’s instead of the raw spectrum)

    for files in file_list:
        relative_path = 'recordings/' + files[0]
        file_name = os.path.join(os.path.dirname(__file__), relative_path)
        y, sr = load(file_name, sr=None)
        filesize = sys.getsizeof(y)
        if output_type == 'spectrum':
            spectrum = stft(y, nfft, hop_length=int(filesize / 2))
            mag, phase = magphase(spectrum)
        mfcc = feature.mfcc(y, sr, n_mfcc=nmfcc, hop_length=int(filesize / 2))
        mfcc = mfcc[1:nmfcc]

    Beyond the audio, we also need to store the associated digit spoken in each audio file.   When training a multiclass classifier ( in our case our classes are 0 – 9) it’s common to use something called “one hot” vectors to represent the output.   This is just a vector where all the classes are represented by 0 except for the one element representing the actual output class.   So in our case we have a 10 element vector and if the audio file is someone saying “one’ the vector would look like [0 1 0 0 0 0 0 0 0 0 ].

    class digits:
        zero    = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
        one     = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
        two     = [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
        three   = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
        four    = [0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
        five    = [0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
        six     = [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]
        seven   = [0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
        eight   = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
        nine    = [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

    With our inputs and outputs squared away it’s time to define our network. The variables that make up your network are also known as hyper-parameters. For my end-to-end recognizer I selected the following hyper-parameters: (*Note that selecting hyper-parameters is half art and half science and your choices will be critical to the success of your network.  I have provided additional resources below)

    • Number of layers:3 (input, output and one hidden layer)
    • Nodes in hidden layer: 2048 (1x our frequency bins)
    • Activation functions: TanH (HIdden), Sigmoid (Output)
    • Weight initialization algorithm: Xavier or Glorot
    • Learning rate = 0.001
    • Rate decay = 0.0001
    input_layer = layers.Layer(inputs=training_inputs.shape[0], neurons=training_inputs.shape[1] + 1)
    if mode == 'E2E':
        hidden_layer = layers.Layer(inputs=training_inputs.shape[1] + 1, neurons=2048,
        output_layer = layers.Layer(inputs=2048, neurons=training_outputs.shape[1],
        output_layer.Initialize_Synaptic_Weights()if mode == 'E2E':

    nnet = NeuralNetwork(layer1=input_layer, layer2=hidden_layer, layer3=output_layer, learning_rate=0.001,
    learning_rate_decay=0.0001, momentum=0.5)

    So now that we have our inputs and outputs, and we’ve defined our network, all we need to do is train using our forward and back propagation functions. Per my earlier description the forward propagation algorithm is quite simple and is really just summing the weighted inputs and applying the activation functions. Using matrix math this can be written in three or four simple lines of code.

    def Feed_Forward(self, inputs):
        self.l1_inputs[:,0:self.layer1.neurons-1] = inputs
        self.l2_hidden = self.layer2.activation(dot(self.l1_inputs, self.layer2.synaptic_weights))
        self.l3_output = self.layer3.activation(dot(self.l2_hidden, self.layer3.synaptic_weights))
        return  self.l3_output

    The forward propagation algorithm gives us our predicted output.  Using that predicted output we can perform our back propagation.  Much like my earlier explanation we need to perform a series of steps for each layer.   Specifically we need to calculate the error, calculate the gradient, and adjust our weights based on the previous two calculations.

    def Back_Propogate(self, outputs):
        output_deltas = numpy.zeros((self.layer1.inputs, self.layer3.neurons))
        l3_output_error = -(outputs - self.l3_output)
        if self.layer3.activation_derivative == activationfunctions.Sigmoid_Activation_Derivative:
            output_deltas = self.layer3.activation_derivative(self.l3_output) * l3_output_error
        elif self.layer3.activation_derivative == activationfunctions.softmax_derivative:
            output_deltas = l3_output_error
        elif self.layer3.activation_derivative == activationfunctions.Oland_Et_Al_Derivative:
            output_deltas = self.layer3.activation_derivative(self.l3_output) - outputs
        hidden_deltas = numpy.zeros((self.layer1.inputs, self.layer2.neurons))
        l2_hidden_error = output_deltas.dot(self.layer3.synaptic_weights.T)
        hidden_deltas = self.layer2.activation_derivative(self.l2_hidden) * l2_hidden_error
        adjustment1 = self.l2_hidden.T.dot(output_deltas)
        self.layer3.synaptic_weights = self.layer3.synaptic_weights - (adjustment1 * self.learning_rate) #+ self.l3_output_adjustment * self.momentum
        self.l3_output_adjustment = adjustment1
        adjustment2 = self.l1_inputs.T.dot(hidden_deltas)
        self.layer2.synaptic_weights = self.layer2.synaptic_weights - (adjustment2 * self.learning_rate) #+ self.l2_hidden_adjustment * self.momentum
        self.l2_hidden_adjustment = adjustment2

    To bring it all together we just need to iterate over our forward and back propagation algorithms until we have stopped learning or have reduced our cost or error to it’s lowest possible point.

    def Train(self, inputs, outputs, iterations):
        for iteration in range(iterations):
            error = 0.0
            # random.shuffle(patterns)
            # turn off random
            randomize = numpy.arange(len(inputs))
            inputs = inputs[randomize]
            outputs = outputs[randomize]
            error = self.Back_Propogate(outputs)
            error = numpy.average(error)
            if iteration % 10 == 0:
                print('error %-.5f' % error)
            # learning rate decay
            self.learning_rate = self.learning_rate * (
            self.learning_rate / (self.learning_rate + (self.learning_rate * self.learning_rate_decay)))

    That’s it!  While there is a lot more glue code and learning that went into this implementation what I have presented here represents the fundamental building blocks of a basic end-to-end speech recognition system.  I have made the full project available on GitHub and you can evaluate the code yourself in order to fully comprehend all the details.  I’ve also provided a bevy of resources below that helped get me to this point and can do the same for you.

    Final Thoughts

    You might be asking why a senior leader in my position would spend the time required to go through this exercise.  There are some general principles I like to follow and I think anybody managing a research oriented (or really any engineering) team should consider as well.  Specifically:

    • ABL – Always Be Learning:  If you want to innovate you need to be up to speed on the latest technology trends.
    • Earn your team’s respect:  The best way to earn the respect of your technical team is to get into the trenches.  Show them that you understand their job and all the pain that comes with it.  In other words write code (any code), test it, check it in, and push it to production.
    • Lead by example: If you want your team to “innovate for the masses”, it’s best demonstrate the behaviors you are looking for.

    Hopefully this post has given you a basic understanding of end-to-end speech recognition systems and neural networks  If you’re really brave perhaps you’ve learned how to build your own simple end-to-end recognizer.  But if you take nothing else away from this article I hope it’s that you’ll invest your time improving your own technical skills and getting in the trenches to earn your team’s respect.

    In an upcoming post I’ll dig deeper into end-to-end speech recognition algorithms and how they work.  Specifically we’ll cover recurrent neural networks and the connectionist temporal classification algorithms that truly allow these systems to be superior over traditional speech recognition systems.  In the mean time I hope you get a chance to “wreck a nice beach”!

    1. “How to build a simple neural network in 9 lines of Python code” – Milo Spencer-Harper
    2. “How to build a multi-layered neural network in Python” – Milo Spencer-Harper
    3. “Understanding and coding Neural Networks from Scratch in Python and R” – Sunil Ray
    4. “How to Compute the Derivative of  Sigmoid Function (fully worked example)” – Jeremy (no last name)
    5. “Practical recommendation for Gradient-Based Training of Deep Architectures” – Yoshua Bengio
    6. “How to train your Deep Neural Network” – Rishabh Shukla
    7. “Understanding the difficulty of training deep feedforward neural networks” – Xavier Glorot and Yoshua Bengio
    8. “Deep Learning Basics: Neural Networks, Backpropegation, and stochastic Gradient Descent” –  Alex Minnaar
    9. “Speech Recognition: You down with CTC” – Karl N.
    10. “Deep Speech: Scaling up end-to-end speech recognition” – Andrew Y. Ng et al.
    11.  “Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks” – Alex Graves et al.

    Creating a Management Philosophy

    Just In case you couldn’t infer this from my previous posts some folks consider me opinionated and occasionally dogmatic.  What else would you expect from a born and raised New Yorker, who grew up in a household where arguing your point was considered a cultural birthright!

    Unfortunately, while having strong opinions and ideas can be a positive, I’ve found throughout my career that those ideas don’t always resonate with coworkers.  Even when those ideas are sound.  When I started to lead and manage larger teams this increasingly became a handicap and I soon realized I needed a better way to get my thoughts across.

    Enter Pete Carroll.   That’s right, the same Pete Carrol who led the Seattle Seahawks to two super bowls and USC to two college national championships.  Not too long ago a good friend and neighbor happened to tell me about a management training class he attended based on material from Pete Carroll and his “Win Forever” philosophy.  It sounded very compelling and I immediately purchased Pete’s book based on the same concepts.

    It was immediately clear from reading this book that Pete Carroll had faced similar challenges earlier in his career.

    “But while I had a sense inside me of what we needed, I hadn’t articulated it very well.  I didn’t have the details worked out in my own mind so that I could lay them out clearly and convincingly to anybody else”

    In short, this book preaches a simple strategy for dealing with an inability to convey your ideas, which is to write them down, iterate on them, and formulate them into a single cohesive vision.   By doing so you change the conversation from “hey, here is my opinion” to “hey, I have a strategy for winning and here it is”.   Or in Pete’s words;

    “by December I finally had a clear, organized template of my core values , my philosophy, and – most importantly – my overarching vision for what I wanted to stand for as a person, a coach, and a competitor”

    Armed with a singular vision and philosophy you have a solid foundation to convey your thoughts.  And suddenly you have transformed disparate ideas into a recipe for success.  The implications of documenting your philosophy are huge and by doing so you will:

    • Set clear expectations for your employees
    • Set expectations for executives, higher level managers and peers for how you operate and how it will benefit them
    • Have a recipe for success that you can continually improve and iterate on

    With this information in mind I decided to take years of ideas I had accumulated and started to jot them down.  I refined them and wove them into overarching vision.  And when I thought about what I was ultimately trying to achieve it became clear that I was always trying to deliver truly innovative software to as many people as possible.   And so my Innovate for the Masses™  philosophy was born.   I present it to you below unedited.  It is a continual work in progress but something that has served me well so far.

    Philosophy: Innovate for the Masses™

    Innovate for the Masses™
      • Create unique and defensible value: Products should deliver something truly special that cannot be found in other solutions and simultaneously provide a defensible moat.
      • Build best in class solutions: Products should be fully functional and should not cut corners. We should do everything required to delight customers, nothing more and nothing less.
      • Support all customers: Products should be accessible by all existing and future customers. One off solutions are never okay.
      • Enterprise class reliability and scalability: Product should be robust with 4 9’s reliability and the ability to scale to all customer demands.
    How we do it
    • Ruthless prioritization: We question the necessity and value of every feature or piece of code. We only work on things that deliver essential value to the customer.
    • Avoid premature optimization: we only build exactly what the customer needs. No more and no less.
    • Read between the lines: We listen to our customers but don’t just cater to their demands. We find the commonality amongst all our customer’s requirements and build a truly unique and defensible product that surprises, delights, and addresses their needs.
    • Communicate like crazy: We are one team with one vision and one goal. Everybody must constantly be talking to innovate and build cohesive products
    • Right spot right time: We believe every team member plays an important role in the team’s success whether that is in a leading role or a supporting role.
    • Work harder than anyone else: We will win by out working all of our competitors.
    • Don’t chase the competition: We don’t chase every move our competitors make. We pay attention but follow our vision and goals and methodically work towards delivering on them without being distracted.
    Expectations for our people
    • Be insanely passionate: Our employees exude passion. We are a passion first organization.
    • Get a lot done / execute like crazy: Our employees are insanely productive. They get more done than anyone else.
    • Care: Our employees give a shit. They care about the product, team, company, and customer like something they hold dear.
    • Have a sense of humor: Our employees laugh. At themselves and each other. We believe you should leave work every day having smiled so much it hurts.
    • Don’t whine or complain: Our employees don’t whine or complain they express their opinions and try to instigate change in the direction they want to see. If a decision doesn’t go their way they disagree and commit.
    • Don’t play politics: Our employees don’t play politics. They lay it all out on the table and do their job to the best of their abilities … that is what they get rewarded for.
    • Dare to disagree: Our employees disagree loudly and proudly. Good disagreement is central to progress. Different opinions are valued and we seek out constructive conflict

    Having a vision and philosophy is not all rainbows and unicorns.   Creating a philosophy and broadcasting it to your coworkers is the equivalent of driving a giant metal stake deep into the ground.   You may find throughout the course of your career that sometimes people don’t agree with your strategy and when they don’t you only have three options, change your strategy, change their minds, or move on.  Or again to quote Pete:

    “Coach Seifert was specifically adamant that I not change who I was or my mentality. He said clearly “Pete, you’ve got to do it the way you know how.” After my experience in New York, I wondered if I shouldn’t try to be more political, but the advice I got from the two mentors was uncompromising – and some of the best I ever received.”


    In closing if you are anything like me or Pete Carroll I strongly encourage you to write down your great ideas and formulate them into a cohesive philosophy.   It will be will worth your while.

    **For reference Pete Carroll explicitly calls out John Wooden for influencing his strategies and techniques.  I highly encourage people to also read John Wooden’s book “Wooden on Leadership”

    Twitter Feed

    @FoxNews⁩ this article leads with a headline about a “Soros” backed prosecutor, but never mentions ⁦@georgesoros⁩ and provides no information of any sort that justifies that headline. Looks a lot like blatant #antisemitism and #fakenews foxnews.com/us/soros-backe…

    About 3 weeks ago from Jason Flaks's Twitter via Twitter for iPhone

    Managing Research Projects in an Agile Development Environment

    Anyone who has worked in an agile organization has found that certain projects don’t quiet fit the agile mold.   Nowhere is this more apparent than with research oriented projects.   After all if there is complete uncertainty in the scope and outcome of a project, as would be the case in a research project, how do you create user stories and estimate story points?   And if you can’t create stories and estimate the associated costs how can you hold your team accountable, communicate status to the rest of the organization, and make cost / benefit tradeoffs?  Simple!  You can’t.

    I’ve personally dealt with this issue after hiring several researchers to work on an agile software product team.  Initially, I struggled to interleave our research projects with our other production work so I started looking for a solution.  The answer to my problem came after reviewing the agile literature and the scientific method and concluding that research projects really just represent an extreme of what the agile process is ultimately trying to solve. Below I will walk you through how I arrived at this solution and details on how you can apply similar tactics in your own research organization.


    Early in my career at Microsoft someone handed me a copy of Steven McConnell’s book Code Complete.

    At the time my greatest take away from that book was the concept of the “Cone of Uncertainty”.   The “Cone of Uncertainty” states that the uncertainty of a given project decreases as time progresses and more details are flushed out.


    Historically the “Cone of Uncertainty” was dealt with by creating detailed upfront plans and using waterfall project management approaches.  The trouble with those methodologies is that they’re extremely resistant to scope change.   Largely because scope change reintroduces uncertainty.

    The agile manifesto attempts to eliminate the “cone of uncertainty” problem by following the principle of “Responding to change over following a plan”.   Most agile methodologies use some form of iterative development to reduce uncertainty, with the idea being that if you’re working on smaller well defined chunks of a larger project uncertainty is removed and the project can slowly adapt to changing requirements.  Mike Cohn wrote in an article titled “The Certainty of Uncertainty”.

    “The best way to deal with uncertainty is to iterate. To reduce uncertainty about what the product should be, work in short iterations and show (or, ideally give) working software to users every few weeks. Uncertainty about how to develop the product is similarly reduced by iterating. For example, missing tasks can be added to plans, inadequate designs can be corrected sooner rather than later, bad estimates can be amended, and so on.”

    If I take the above information together I can conclude two things.  First, the agile method attempts to reduce or eliminate uncertainty by making every project a function of smaller work items iterated over time.  Or framed in mathematical notation:


    Where: T = Max Iterations, M = Backlog, N = User Stories belong to M

    Secondly, if a research project is really just a project with maximum uncertainty then the same framework should apply.   Only there would be an unbounded number of work items over an unbounded amount of time.   Or framed in mathematical notation:


    According to this logic a research project should actually work within an agile framework.   We just need to figure out how to construct M (i.e. backlog) and how to bound M and T (i.e. number of iterations).


    So what are reasonable user stories for a research project and why are they potentially infinite?  It occurred to me that research in general follows the scientific method and that the scientific method may be a good framework for story generation.


    In essence the scientific method can be boiled down to three phases: a research phase, an iterative hypothesis testing phase, and a communicate or productize phase.  The unbounded component of research is that many hypotheses end in failure leading to another hypothesis that must be tested and this can potentially go on ad nauseam.  This provided me a compelling framework for how to break research into user stories.


    The first story in any research project correlates to the first phase in the scientific method.  This story should be a time bounded spike that frames the initial question, covers any background research, and has an acceptance criteria of generating the required stories for the next phase of the project, hypothesis testing.

    The next set of stories are all part of the hypothesis testing phase.  These stories include any development work required to test the hypothesis, any data collection required, running the tests, and analyzing the results.   If the hypothesis proves false the team should circle back to the background research phase and continue on with the process.

    The final phase in this framework is only relevant when a hypothesis is proven to be true.   This phase contains multiple stories including any communication or publishing of results, IP protection, and a handoff to whomever might be building the final product (which might be the same team).   The final handoff story should also be a spike and the acceptance criteria should include the user stories required for the production deployment.


    Now how do you go about making sure research stories don’t go on forever?  How do you bound T and M?  And how do you communicate the cost / value trade offs with management?

    I have found that the previously described framework only works if you apply the following guidelines in conjunction.  Specifically

    1. For any research project to be considered we must have enough information for the project to pass the “sniff test” (i.e. Is it possible in a reasonable amount of time and does it make business sense).
    2. The initial estimate for research projects are based on the expected number of hypothesis iterations and the cost must be inline with the expected project value (i.e. if the research is perceived to have large value it may be worth iterating for a long time).
    3. If the number of hypothesis iterations exceeds the original cost the cost/benefit analysis must be revisited and the project should be canceled if the cost has exceeded expected value.

    What I have presented here is a process by which you can take an unbounded research project and place a structure around it that will work in companies using an agile development methodology.  Besides allowing research projects to function in an agile organization this framework also provides a method for bounding research problems and communicating the cost / benefit trade offs to management and other relevant parties.   For those who have faced similar issues integrating research oriented projects into an agile culture I hope this methodology provides some ideas on how you can better integrate research into your processes.

    Microsoft’s 5.1% Word Error Rate (WER) Announcement is Complete and Utter Bullshit


    I apologize! That title was actually generated by Microsoft’s speech recognition system incorrectly transcribing “Microsoft’s 5.1% Word Error Rate (WER) Announcement is Completely Misleading”.   Okay, that was snarky, but I promise Microsoft compelled me to write that.  You see in the course of editing my previous post Microsoft had to go and put out a press release announcing “Microsoft Researchers Achieve new Conversational Speech Recognition Milestone”.  Their announcement flies in the face of my previous post and therefore I had no choice but to attempt an epic takedown.

    Before I try to dismantle Microsoft’s irrational clam I would like to state that the the researchers at Microsoft (some of whom I have crossed paths with while working on the Xbox Kinect and HoloLens) have done some solid research with potential implications on how we build production speech recognition systems.   I have no issues with the technical nature of the research paper underpinning the press release, but I do take issue with the marketing and PR spin applied on top of it.  So without further ado “LET’S GET READY TO RUMBLE”.

    There are two primary issues with the announcement made by Microsoft:

    1. Does Microsoft’s testing provide conclusive evidence that the 5.1% WER results will generalize
    2. Are the tactics used viable from a cost/compute/timeliness perspective in a production system

    Let’s tackle each of these issues independently.

    Will the Results Generalize

    In my previous post I discussed why large data sets were critical for training truly accurate conversational speech recognition systems.   While I do take issue with the data size used to train the Microsoft speech recognition system, the larger issue is with the test set used to validate the word error rate.

    In Andrew Ng’s seminal talk on the “nuts and bolts of machine learning”, he goes into great detail on the different data sets required for training, testing and validating machine learning algorithms.  I encourage anybody interested in the optimal process for training and testing machine learning / AI like algorithms to watch this seriously awesome video.   In terms of Microsoft’s research I want to focus on the relatively small size of their test corpus, it’s overlap with the training data, and the fact that the chosen corpus appears cherry-picked.

    Corpus Size

    The test set Microsoft selected for calculating the reported  5.1% WER is the 2000 NIST CTS SWITCHBOARD corpus.  While I was unable to find the specific number of hours of conversation in this test corpus I was able to confirm that the 1998 and 2001 NIST CTS data sets contained 3 and 5 hours of conversation respectively.  We can therefore assume the number of hours of conversation in the 2000 set is similar in duration.   When considering the overall size of the conversational speech domain explained in my previous post  a test set of this size is hardly sufficient for making any broad claims about meeting or beating human transcription accuracy.

    Training Data Overlap

    As you dig into the details of the NIST corpus a dirty little secret is quickly revealed.  Let me start by quoting directly from the source:

    “Of the forty speakers in these conversations thirty-six appear in conversations of the published Switchboard Corpus.”

    Let me translate that for you.   Thirty-six of the speakers in the test corpus are the same speakers used in Microsoft’s training corpus.   I’ll also remind you that the Switchboard corpus only has 543 speakers to begin with.  This raises a foundational questions about whether the test data is really distinct relative to the training set.   You see almost all modern speech recognition systems use something called i-vectors to help achieve speaker independence (sometimes called speaker adaptation).  Since the same speakers, on the same devices, in the same environments exist in both the training and test corpus there will invariably be a correlation between the i-vectors generated by the two data sets.

    Per the diagram below, a truly honest measure of WER would require the the test data be truly distinct from the training set .  In other words it should pull from a data set that includes different speakers, different content, and different acoustic environments.   What is clear from the Microsoft paper is that this didn’t happen which calls into question whether the published results will truly generalize.  It also greatly diminishes the the validity of any claim about a new “milestone” being achieved in conversational speech recognition.


    It’s worth noting that the full 2000 NIST CTS corpus actually contains a total of 40 conversations.   Twenty of those conversations are from the Switchboard corpus and twenty are from a different corpus called “Call Home”.   This begs the question of why Microsoft only validated against the Switchboard portion of the corpus.   While I can’t say for sure what their intent was, my best guess is because if they had used the Call Home data the results would not have led to the desired goal of meeting or beating “human accuracy”.

    Taken altogether, the small corpus, with overlapping data, and a cherry picked data set you can’t help but ask did Microsoft really achieve a “new conversational speech recognition milestone”?

    Is it Production Ready

    EBTKS.  For those not familiar with texting slang, that stands for “Everything But the Kitchen Sink”,  and it’s really the best description of the system Microsoft used for this research.  This calls into question the production viability of their proposed solution.

    Ensemble Models

    At the acoustic model (AM) and language model (LM) layer Microsoft is using an ensemble model technique.   This technique requires training multiple models and processing each utterance through every model.   A separate algorithm is used to combine the outputs of the different models.   In essence this equates to trying to run multiple recognizers at once for every audio utterance.  It currently requires an enormous number of machines to transcribe phone calls in real-time at scale   Microsoft appears to be running 4 distinct AMs and multiple LMs which will have serious performance impacts.   This raises questions about the number of machines and associated costs required to run a system like the one used in Microsoft’s paper.


    On top of the ensemble modeling Microsoft is also using language Model Rescoring.  In order to rescore you usually have an initial language model produce an N-BEST lattice which is basically the top N paths predicted by the language model.   This lattice needs to be stored or held in memory in order for the rescoring to take place.   In Microsoft’s case they are generating a 500-best lattice.   While not crazy holding a 500-best lattice in memory in a scaled production speech recognition system would not be ideal unless it provided significant accuracy gains.   According to the paper the gains from rescoring were minimal at best.

    In Conclusion

    So where does that leave us?  Microsoft has done some great research on advancing speech recognition algorithms.   Research that I greatly appreciate and hope to review further.   However for Microsoft to even imply that they achieved some epic milestone in matching human transcription accuracy is downright preposterous.

    In the words of renowned Johns Hopkins speech recognition researcher Daniel Povey:

    “… … this whole competition between IBM and Microsoft on Switchboard is just a pissing contest, in which they both try to add in more data and bigger system combinations to beat the other one’s number.  It doesn’t really indicate any special progress.”

    “Blinded by the Light, Revved up like a ???”

    Image result for i don't understand what you're saying

    I probably sang that Manfred Mann song a thousand times in my teen years and I was pretty sure the last word in that lyric was a feminine hygiene product until Google came along and taught me otherwise.   It turns out the lyrics to Blinded by the Light are very difficult to understand and so is conversational speech.

    For my first substantive blog post on this site I’d like to continue on a theme we have been covering over at Marchex around the complexity in building automatic speech recognition (ASR) systems that can accurately understand unbounded conversational speech.   In this post I intend to dive a little deeper into WHY conversational ASR systems are so difficult to build, possible solutions to improve them, and the bounty for those who finally succeed.

    There are really three primary issues that are limiting current systems from accurately recognizing conversational speech: Data, Data, and Data.    More specifically: Required Data Size,  Lack of Publically Available Data Sets, and Cost and Complexity with Acquiring the Required Data.

    Required Data Size

    There is no strict answer for how much data is needed to solve a given machine learning problem, but one oft-cited rule is the “rule of 10”.   The rule of 10 states that you need roughly 10 times as many examples as you have parameters.   While there are multiple parts of an ASR system including an acoustic model (AM) and a language model (LM), for now I am going to focus on the LM.   One parameter used in an LM is called an n-gram, specifically in most cases a trigram.   A trigram is basically the probabilities of any 3 words being seen next to each other.   So if we take the rule of 10 that would imply we need 10 times the number of 3 word combinations required for our task.

    This is where the problem arises.  You see we humans write beautifully but we speak like idiots.   Grammar goes out the window when people talk, we stutter, words are often repeated over and over while people search for their next thought, and honestly some folks downright make up words that don’t even exist.   Taken together that means one can expect to see almost ANY combination of 3 words in the wild.   Everything from “a a a “ to “zebra zebra zebra” .   So if you don’t mind rewinding your brain to highschool math and combinatorics that means the number of 3 word combinations is:

    | Number of Words in US English | ^3


    | ~500,000 | ^3 = 125 QUADRILLION (i.e. a really #$%&’ing big number)

    If we apply the rule of 10 we would need 1.25 QUINTILLION (i.e. an even bigger #$%&’ing number) utterances (basically a spoken sentence) containing examples of these trigrams.   Let me put this in perspective for you.    A single spoken utterance saved in a text file is roughly 50 bytes in size.   So in order to to store 1.25 QUINTILLION utterances I would need 50 * 1.25 QUINTILLION bytes of storage.  Or … 62,500 Petabytes!   For reference 20 years of internet archiving only consumed 23 petabytes as of 2015.  And if that doesn’t frame it for you think about it this way.   The average utterance duration is roughly 1.5 seconds. If I were to string 1.25 QUINTILLION recorded utterances together it would take approximately 60 millennia to play it back!

    So what’s the point?   The point is that the data size required to cover all possible examples of spoken US English is almost inconceivable.  Is the rule of 10 an exact science?  No.  Does it matter?  No, because even if this estimate is wrong by 1/2 or 3/4 it is still huge.   Ultimately the data size needed to properly train a conversational ASR system is gargantuan.

    Lack of Publically Available Data Sets

    Okay so we need a lot of data.   Can’t we just buy it?  No!  Most publically available data sets are shockingly small compared to the size of the domain I described above.  As my fellow Marchex coworkers reported in our recently published research paper the size of 2 of the most commonly used data sets, fisher English and switchboard, is prohibitively small.

    Switchboard Fisher English Marchex




















    Dat Acquisition Cost and Complexity

    Alright if you can’t buy it why don’t companies just go and collect the data themselves?   Well it turns out collecting 62,500 petabytes or 60 millennia’s worth of people conversing is no simple task.   There are two primary problems, collecting that amount of audio data and labeling it.

    Audio Data

    Where could someone acquire that quantity of audio data?   Well, there are countless hours of TV and Radio interviews out there but the dialog is generally scripted and edited so not reflective of true conversational speech.  On top of that in most cases companies do not have the legal rights to the data and acquiring those rights would be prohibitively expensive.

    Amazon, Apple, Microsoft, Google, and other companies are all collecting mountains of data from various voice assistants (Alexa, Siri, etc.) and voicemail messages.   However all that speech data is mostly unidirectional and non-conversational (“Alexa tell me the weather” is not really conversational).

    That leaves one obvious channel for acquiring conversational speech and that is phone calls.    So why don’t companies just collect call recordings at scale? The answer is simple:  WIRETAPPING.

    In the US wiretapping is a federal and state statute aimed at ensuring your communications are private and there are criminal and civil penalties for those who violate the law.   What makes wiretapping laws particularly problematic is that the law varies by state specifically around who must consent to being recorded.

    So why does this matter?   Well because 12 states require bidirectional consent and phone networks are open (nobody can guarantee they control both sides of the call).  While any company can update their “terms of service” to notify you that you are being recorded, they would have no easy way to guarantee that the other party has consented.   Unless they start playing that pesky message “this call might be recorded or monitored” in front of every call, including your weekly call with your mother!  This puts scalable call recording for consumer oriented phone services mostly out of reach since the risk of violating a criminal law is too high (I think it is safe to say Mark Zuckerberg has no interest in going to jail).

    In fact just ask Google who has dealt with an ongoing wiretapping case because they were scanning the emails of GMAIL users to place targeted ads.   The argument is in fact incredibly similar in that Google was not just reading the email of GMAIL users but also any yahoo, Hotmail, etc. user who sent a GMAIL user an email.   In the 12 states requiring bidirectional consent the non-Gmail users never consented which has potentially caused Google to violate the law.


    Even if by some miracle we could collect that amount of audio data how would we label it?   In general ASR systems (and all other machine learning systems) require accurately labeled data (sometimes called “ground truth”).   In the speech recognition world that generally involves hand transcribing data.   And if it would take 60 millennia to read out that much speech imagine how long it would take to hand transcribe it.   Simply put, it is not feasible in our lifetimes at any reasonable cost.

    What’s the Solution

    It turns out almost all companies record phone calls.  Recordings from any one company would have highly biased content but in aggregate consumer to business recorded phone calls are an amazing source of conversational speech at scale.   Because you need a  wide cross section of content to ensure subject matter diversity, companies who provide platform call recording solutions and have legal access to the aggregate data are really the best sources of this content.

    But what about the labeling?  Well the only reasonable solution for labeling that content is using unsupervised or semi-supervised automated solutions for labeling the data.   This is where Marchex has invested and you can read more details about our semi-supervised approach in our research paper.   I hope to cover this topic in detail in a future post.

    Why Does Any of this Matter

    You might be asking if highly accurate conversational speech recognition is really necessary.   Or you might be thinking “My Alexa already works awesome”   But if you are a sci-fi nerd like me you’re anxiously awaiting the day that you can step foot on the holodeck and have a real conversation with an AI entity or crack open a beer with a fully conversant robot like Data from Star Trek.   For that to happen we need to truly understand conversational speech.  We need to understand it so machines can properly decipher what humans are saying and we need to understand it so machines can generate speech that mimics human dialog.

    Highly accurate conversational speech recognition is necessary for us to fulfill the promised vision of artificial intelligence.  Who knows maybe in a few years a holographic Manfred Mann and I will be doing a duet in my own personal holodeck.   Can you hear it?   “Blinded by the light, revved up like a deuce …”