ASRU 2013 Day 1: Neural Networks

home..

ASRU 2013 Day 1: Neural Networks

December 2013

The structure of this ASRU conference is different than other conferences I have attended. All accepted papers were posters and all posters were hung in a single room at the same time. Each day was focused mainly on keynote style lectures with time to view posters interspersed between the talks. Below are my thoughts on the three talks of the first day.

“Physiological Basis of Speech Processing: From Rodents to Humans” presented by Cristoph Schreiner.

An interesting talk that presented recent work on studying the acoustic processing of animals. Since it was given from the perspective of a bioengineer working with the actual wetware of living creatures, it was more difficult to follow than most of the other talks in the general ASR area. One aspect that I did like was that he focused more on the actual processing in the cortex. Most of the previous work I have seen focuses on the processing and filtering happening in the cochlea.

I was previously aware that we have neurons that respond to particular frequencies. I always assumed that those frequencies were within the hearing range of normal human hearing. Professor Schreiner mentioned that many of the neurons actually respond to frequencies below 20Hz—the lower bound for human hearing. These neurons would be capturing information about the spectral envelope.

He made two comments about filterbanks. Modulation filterbanks are not static in the brain. Would non-static filterbanks for ASR features also make any sense? The recently popular STFB Gabor filters are mostly inseparable in time and frequency, while most cortical neurons are separable. Does this mean that these Gabor filters are not as biologically inspired as I had previously thought?

I was also intrigued by his comment regarding difficulties in hearing in noisy environments for individuals with hearing loss. My understanding is that most (all?) work is focused at the cochlear level. The main idea is to clean the speech signal prior to any processing done in the cortex. He seemed to imply that he thought the real solution was to instead “train” the cortical units to recognize speech in these noisy environments. According to his hypothesis, humans have difficulty recognizing speech in those environments because it is a situation the brain never experienced during early development. We just need a study where a bunch of young children have their hearing damaged in a similar manner to those of the elderly. After several years of development, will they be able to recognize speech better than adults with similar hearing? Obviously this type of study would be unethical, but I wonder if there are similar studies on children with natural hearing loss.

Overall this talk gave me some new information, I just do not know what to do with it.

“Multilayer Perceptrons for Speech Recognition: There and Back Again” presented by Nelson Morgan.

I always enjoy hearing Morgan talk. He is one of the more entertaining and informative speakers in the field (plus, he is my academic grandfather). Now that Neural Networks are in vogue again, Morgan gave a nice historical overview of their development and use in speech recognition. I think he wanted to emphasize that while we are seeing large gains with the deep neural networks now, most of the ideas are 10, 20, or even 30 years old. He did not want to diminish the work of researchers today, but he did say that the ideas were all there, someone just needed to implement them. It seems to me that it was also an issue of computational power and data. Hybrid and Tandem acoustic models were always a good idea.

Morgan also made some other interesting points. Both width and depth are important for DNNs (Morgan kept referring to deep neural nets as MLPs, but I will stick with the DNN term—an MLP with more hidden layers is still an MLP), but how deep and how wide do they need to be? He ran some experiments where he fixed the total number of parameters. From there he trained DNNs with increasing numbers of hidden layers. Since the total number of parameters was fixed, setting the depth implicitly sets the width. Depending on the task, different numbers of hidden layers (between 2 and 7) was optimal. His take away point was that you should not necessarily assume an a priori optimal structure, you should determine it based on the data and the task.

His other main point was that features are still important. Much of the recent work has been saying that features are not important because the extra hidden layers will learn good representations anyway. Morgan quoted Hinton as saying, “I do think features are important; I just think they should be learned.” While the same kinds of features previously used in ASR may not work for DNNs, you cannot just work straight from the waveform. Hermansky also had a very interesting quote, “the goal of the front-end is to destroy information.” Based on other comments throughout the day, it appears the importance of hand-derived, perceptually-motivated features are a point of contention among researchers.

He ended his talk with some comments about the biological plausibility of DNNs. Their ASR performance has improved, but they are essentially the same McCulloch-Pitts models from 70 years ago. The improvements come from the increased number of parameters, not increased complexity of the individual units. Also, the learning does not seem to be biologically plausible. Morgan still hopes to see advances stemming from improved biological understanding.

This prompted a question from the audience, “why so much concern about replicating the human approach to speech recognition, humans are not perfect?” I liked Morgan’s response. While a pure engineering approach would be welcome, the potential design space is infinite. Humans are the only example of a good speech recognition system that we have. At the very least, focusing on biology allows us to reduce the search space.

“Context-Dependent Deep Neural Networks for Large Vocabulary Speech Recognition: From Discovery to Practical Systems” presented by Frank Seide.

Seide was asked to talk because his results in a 2010 paper showing 30-40% relative error reduction compared to standard GMM/HMM approaches likely lead to the explosion in DNN work we see now. He focused on presenting practical information of interest to a researcher wanting to work with DNNs, something I very much appreciated.

It took 60 days to train the original models for that paper. That is a huge amount of computation and probably illustrates better than anything else why DNNs have only recently become popular. After presenting thorough evidence from the literature that the gains seen by DNN systems are genuine, Seide attempted to provide some intuition behind the models. The input layer are landmarks, features used by the later layers. Each hidden layer unit represents a class/non-class membership. They are like soft logical operators. The final output layer is a linear classifier.

His discussions of training answered one of my main questions. Generative pre training is not really necessary anymore. You can simply do greedy layer-wise discriminative pre training. In fact, you do not need to train any of the layers fully or even use triphone states as targets (monophones work just as well). He had a long discussion on sequence training, but the main takeaway seemed to be that you can do similar things to what you do with GMM/HMM systems; except you need to be careful about silence.

There were a few more interesting nuggets. One surprising piece of information was that nobody really knows how to parallelize the training well. Even with the hundreds of machines, only speedups between 3x and 5x are possible. DNN adaptation only appears to be useful for smaller models. Larger models seem to implicitly handle this already. If you are interested in alternative architectures, convolutional neural networks seem much more promising than rectified linear units.

Of course, the discussion eventually came back to features. He showed best performance using mean-normalized log filterbank features (without deltas). Seide even said, “we have undone 30 years of feature research.” Some members of the audience did not like that comment.

Comments? Send me an email.