home..

Training Autoregressive Speech Recognition Models with Limited In-Domain Supervision

This image was created with the assistance of DALL·E 2

Our recent paper demonstrates state-of-the-art results on conversational speech recognition with limited in-domain training. Even just 10 minutes of supervised data is enough to build a usable model. With three hours of in-domain supervision, our autoregressive sequence-to-sequence models match our previous best models trained on 50+ hours of transcribed data.

For the last few years we have been working towards replacing our hybrid ASR systems with all-neural sequence-to-sequence models. For large, relatively easy data (e.g, Librispeech) we could have done this years ago. But our focus has been on low-resource conversational speech, a more difficult task for sequence-to-sequence models, especially the autoregressive variety. Our first success was to match the performance of hybrid models by using them for supervision. But this result felt less than satisfactory—we still needed the hybrid models to bootstrap our training. Later we were able to entirely remove our reliance on the hybrid model through the use of self-supervised training. After several years of effort, we matched the performance of our existing systems with a newer technology. Matched performance, but not exceeded, not surpassed. We were ready to move beyond sufficient.

With our latest result, we have surpassed our hybrid model performance in two dimensions. Now our autoregressive sequence-to-sequence models achieve lower word error rate (WER) with less data. We took advantage of several resources to achieve this result. The most important was the XLS-R model from Meta1. The fine-tuned XLS-R model combined with an external language model provided initial supervision to bootstrap the training process. Read speech from sources like CommonVoice2 was also important. Because autoregressive models have an integrated language model, they tend to be data hungry. The out-of-domain read speech helped stabilize the training process. The read speech is also critical for fine-tuning the XLS-R model with limited amounts of in-domain supervised data as well.

Building on our previous work, we have demonstrated that a small, targeted autoregressive model can outperform the billion parameter XLS-R model with limited amounts of supervision. Even in the low resource conversational speech setting, hybrid models are no longer competitive with our autoregressive sequence-to-sequence models. We plan to continue to improve performance and to reduce the reliance on out-of-domain read speech.

References

  1. https://ai.facebook.com/blog/xls-r-self-supervised-speech-processing-for-128-languages/ 

  2. https://commonvoice.mozilla.org/en 

Comments? Send me an email.
© 2023 William Hartmann   •  Theme  Moonwalk