Interesting Papers from ASRU 2013

home..

Interesting Papers from ASRU 2013

December 2013

Below are a list of papers that I found particularly interesting. This is just personal preference and the presence or absence of a paper on the list does not necessarily reflect on its quality.

Learning a Subword Vocabulary based on Unigram Likelihood by Matti Varjokallio, Mikko Kurimo, and Sami Virpioja: A subword vocabulary can be useful for several reasons. In agglutinative and morphologically complex languages, the effective vocabulary size can be in the millions. Even with a large amount of training data, building models over word units can be difficult. When dealing with a limited amount of training data, subword units can be useful in handling OOV terms, even in languages with a more limited vocabulary. The optimal method for building a subword unit vocabulary is an open question—and is likely data and language dependent. In this paper Varjokallio et al. present a new method for discovering subword units. They use an iterative procedure that starts with a large number of hypothesized subword units. At each step, the subword units are scored by an n-gram character model. Based on these scores, the original words in the training data are segmented into their most likely sequence of subword units. At each iteration, subword units are removed until the desired number of subword units remain. LVCSR experiments showed a very slight improvement over using Morfessor. While the CER and WER results were not exciting, I suspect this could be an interesting approach to apply to keyword spotting tasks.

Speaker Adaptation of Neural Network Acoustic Models using I-Vectors by George Saon, Hagen Soltau, David Nahamoo, and Michael Picheny: Not much to add besides what the title says. Assuming you have a standard DNN/HMM setup and a way to generate I-vectors, this approach is a simple way to give a nice improvement in WER. During training and decoding, the speaker-specific I-vector is appended to the frame-based features. I thought this was a clever way of attaching a compact representation of the speaker to each feature vector. Results were similar to using VTLN and FMLLR for speaker-adapted features, but a further improvement is seen by using the speaker-adapted features with the I-vector approach.

Neighbour Selection and Adaptation for Rapid Speaker-Dependent ASR by Udhyakumar Nallasamy, Mark Fuhs, Monika Woszczyna, Florian Metze, and Tanja Schultz: When a large-amount of speaker data exists, building a speaker-dependent model greatly outperforms speaker-independent models. In most cases, we do not have this speaker-dependent data, so other adaptation methods are used. Nallasamy et al. propose an alternative approach. They assume a small amount of speaker-data exists (5 minutes) and perform adaptation by selecting similar speakers from the training set. The improvement in WER was impressive and they also presented a thorough analysis. For instance, they showed that having an hour of speaker data for performing data selection was no better than having 5 minutes.

The TAO of ATWV: Probing the Mysteries of Keyword Search Performance by Steven Wegmann, Arlo Faria, Adam Janin, Korbinian Riedhammer, and Nelson Morgan: I have recently begun working on keyword spotting, so this paper was of particular interest. It discusses the performance metric (ATWV) in detail and analyzes potential avenues for improving performance with respect to that metric. In general, there are three potential way to improve performance; increase the number of total hits, improve the accuracy of scoring individual hits, and use better keyword-specific thresholds. They show that the reason system combination provides large gains is that it simply increases the total number of hits. In addition, they show that large gains are still available by improving the scores of hits and the keyword-specific thresholds.

Learning Filter Banks Within a Deep Neural Network Framework by Tara Sainath, Brian Kingsbury, Abdel-Rahman Mohamed, and Bhuvana Ramabhadran: The Mel-scale filter bank (and other similar variations) have long been used in speech recognition. Sainath et al. propose to learn the filter banks instead of simply using a predefined set of filters based on perceptual experiments. Since the introduction of DNNs, there has been work in unpacking the standard feature calculation processes and learning them automatically. This work follows in that vein. They show an improvement in WER when using the learned filters. They also provide an impressive amount of analysis. I did get the impression that correctly handling and normalizing the power spectral features is tricky. Applying this work would probably require a careful reading of the paper.

Porting Concepts from DNNs Back to GMMs by Kris Demuynck and Fabian Triefenbach: This paper was also voted best poster by the conference attendees. The authors hypothesize that there is nothing unique DNNs that make them outperform GMM systems; the aspects of DNNs that lead to strong performance can be applied to other systems. In this work, they show that by going “deep and wide” performance of a GMM system can be comparable to a DNN. One advantage of staying in the GMM space is that the decades of work in GMM adaptation can still be applied. I think this is a very interesting avenue of research—identify the important aspects of DNNs and apply them to other models.

Comments? Send me an email.