Training Autoregressive Speech Recognition Models with Limited In-Domain Supervision

home..

Training Autoregressive Speech Recognition Models with Limited In-Domain Supervision

January 2023

This image was created with the assistance of DALL·E 2

Our recent paper demonstrates state-of-the-art results on conversational speech recognition with limited in-domain training. Even just 10 minutes of supervised data is enough to build a usable model. With three hours of in-domain supervision, our autoregressive sequence-to-sequence models match our previous best models trained on 50+ hours of transcribed data.

For the last few years we have been working towards replacing our hybrid ASR systems with all-neural sequence-to-sequence models. For large, relatively easy data (e.g, Librispeech) we could have done this years ago. But our focus has been on low-resource conversational speech, a more difficult task for sequence-to-sequence models, especially the autoregressive variety. Our first success was to match the performance of hybrid models by using them for supervision. But this result felt less than satisfactory—we still needed the hybrid models to bootstrap our training. Later we were able to entirely remove our reliance on the hybrid model through the use of self-supervised training. After several years of effort, we matched the performance of our existing systems with a newer technology. Matched performance, but not exceeded, not surpassed. We were ready to move beyond sufficient.

With our latest result, we have surpassed our hybrid model performance in two dimensions. Now our autoregressive sequence-to-sequence models achieve lower word error rate (WER) with less data. We took advantage of several resources to achieve this result. The most important was the XLS-R model from Meta¹. The fine-tuned XLS-R model combined with an external language model provided initial supervision to bootstrap the training process. Read speech from sources like CommonVoice² was also important. Because autoregressive models have an integrated language model, they tend to be data hungry. The out-of-domain read speech helped stabilize the training process. The read speech is also critical for fine-tuning the XLS-R model with limited amounts of in-domain supervised data as well.

Building on our previous work, we have demonstrated that a small, targeted autoregressive model can outperform the billion parameter XLS-R model with limited amounts of supervision. Even in the low resource conversational speech setting, hybrid models are no longer competitive with our autoregressive sequence-to-sequence models. We plan to continue to improve performance and to reduce the reliance on out-of-domain read speech.

“Training Autoregressive Speech Recognition Models with Limited in-domain Supervision”, Chak-Fai Li, Francis Keith, William Hartmann, and Matthew Snover, arXiv preprint arXiv:2210.15135, 2022. [arxiv] [bib] [post]
“Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition”, Chak-Fai Li, Francis Keith, William Hartmann, and Matthew Snover, in Proceedings of IEEE ICASSP, 2022. [publication] [arxiv] [bib] [post]
“Overcoming Domain Mismatch in Low Resource Sequence-to-Sequence ASR Models using Hybrid Generated Pseudotranscripts”, Chak-Fai Li, Francis Keith, William Hartmann, Matthew Snover, and Owen Kimball, arXiv preprint arXiv:2106.07716, 2021. [arxiv] [bib] [post]

References

https://ai.facebook.com/blog/xls-r-self-supervised-speech-processing-for-128-languages/ ↩
https://commonvoice.mozilla.org/en ↩

Comments? Send me an email.

Training Autoregressive Speech Recognition Models with Limited In-Domain Supervision

Related Publications

References