home..

Sequence-to-Sequence Models for Low-Resource ASR

Sequence-to-sequence (seq2seq) models of the autoregressive variety have proven very powerful over the last few years. Starting with work like Listen, Attend, and Spell, similar models quickly surpassed the performance of hybrid models on large, industry datasets. Since then, performance has steadily improved on smaller datasets like Librispeech and Switchboard—though those datasets are still large by academic standards.

Despite claims to the contrary, the models still struggle on low-resource languages with limited amounts of transcribed data, especially for more difficult domains. Note that while sometimes CTC-based models are referred to as sequence-to-sequence models, I am talking specifically about autoregressive models. If the model is not conditioned on its previous prediction, then it is relatively easy to add additional information through lexicons and language models, just like in traditional hybrid models.

We have been working to close the gap between hybrid and seq2seq models on more difficult, low-resource tasks. Our first approach used hybrid models to provide supervision to the seq2seq models. More recent work has removed the necessity of an initial hybrid model by leveraging unsupervised representation learning for pretraining.

Comments? Send me an email.
© 2023 William Hartmann   •  Theme  Moonwalk