ASRU 2013 Day 4: What is Wrong with ASR?

home..

ASRU 2013 Day 4: What is Wrong with ASR?

December 2013

The final day was a short day with only one talk followed by a panel discussion.

“OUCH: Outing Unfortunate Characteristics of HMMs (Used for Speech Recognition)” presented by Jordan Cohen and Steve Wegman

Steve described research he has been doing into the types of errors made by standard ASR systems. He broke the errors made by HMMs into two types, errors caused by the statistical independence assumption and errors caused by model mismatch. HMMs assume that each frame is independent given the current state. We know this assumption is not true, but it works reasonably well in practice and is convenient, both mathematically and computationally. Steve demonstrated with a clever sampling approach that nearly all the errors are caused by this statistical independence assumption. Not only that, but discriminative training methods are essentially ways to compensate for this assumption.

The previous statements are only true in matched conditions, when the training and testing conditions are the same. In the case of an unmatched setting—for instance, when the microphone used for training and testing is different—a large portion of the errors are caused by this mismatch.

Jordan presented results from a survey of speech recognition experts. The goal of the survey was to discover what current experts believed was wrong with the current approaches to ASR. Basically the results of the study said everything to varying degrees, with language and acoustic models leading the pack. There was also some discussion about the brittleness of our models; anytime something changes, error rates always increase. I found one comment particularly interesting, “If your model is a mixture of 64 Gaussians, then you have no idea what the true model is.” Jordan ended his presentation by calling for researchers to investigate new, more accurate models—though he had no thoughts on what those models might look like.

Final Panel Discussion

The start of the panel discussion followed up on the comments made by Steve and Jordan during their talks. Jordan pointed out that HMMs as a technology are over 40 years old, but they still form the basic structure of our current state-of-the-art models. Several people echoed this statement, basically stating that we should be moving away from HMM models. This was countered by others saying that many alternative models have been proposed and tried, but none have ever surpassed the performance of an HMM.

Everyone agrees that an elegant model that solves ASR would be great. The disagreement is over whether this is feasible or necessary. Others believe that after another 25-50 years of incremental improvements, ASR would be “solved” for all practical purposes. The solution may not be elegant or bring about any scientific understanding, but it solves the engineering problem. Would a gigantic Hybrid DNN model trained on a million hours of speech in all conceivable conditions using a trillion word language model solve the problem?

Hynek Hermansky made several good points. His vision of future ASR, and I have seen this echoed by others, is a set of parallel systems working together. Instead of a single elegant model, we have many non-optimal systems combined in an intelligent manner to solve the problem. He also encouraged researchers to be more scientific in their work. Most papers in the field are of the “we tried approach X and obtained a small, but statistically significant improvement over approach Y” type. We should be asking questions that teach us something interesting no matter what the answer is. I agree, but performing that type of research is very difficult.

Another issue regarding the research of new models that was not discussed is funding. At the moment there is very little funding for basic research in ASR (I imagine all areas of science and engineering have similar issues). The large grants that I see are for specific applications or near term goals. Nobody gets funding for a project that might not bear fruit for 10 or 15 years, if ever.

Work in limited resource languages was also tied back into this discussion. Aren Jansen made the point that thinking about solutions to zero resource tasks may deliver models and features useful to LVCSR. Much of the work in zero resource tasks requires the relaxing of constraints set by experts such as acoustic units, pronunciations, and overall model structure. By learning these things in an unsupervised fashion, we may discover solutions that are better than the ones used in state-of-the-art systems. Aren also mentioned that thinking about alternate metrics can help break research out of a rut. This was particularly relevant with the recent resurgence in keyword spotting (using the ATWV metric) thanks to the Babel project.

Joe Olive brought the conference full circle by bringing up neural networks again. He said that from a very high-level, neural networks are just abstractions of HMMs. I think his point was that if HMMs are not the solution to ASR, then Tandem and Hybrid approaches—even with deep neural networks—are not the solution either.

Comments? Send me an email.