home..

Overcoming Domain Mismatch in Low Resource Sequence-to-Sequence ASR Models using Hybrid Generated Pseudotranscripts

image generated from craiyon.com

While sequence-to-sequence (seq2seq) models have clearly improved the state-of-the-art on high-resource tasks, hybrid models can still outperform them on difficult, low-resource tasks. Some papers do show improvements over hybrid models, but upon closer inspection, the hybrid models are often impoverished in some way. Hybrid models have the advantage that the acoustic and language models are trained independently, so the language model can include data without associated audio. Note that CTC-based models are sometimes referred to as seq2seq models, but they can also easily take advantage of additional text data through lexicons and language models. In this case, we are referring specifically to autoregressive models when we say seq2seq. We demonstrate that we can use hybrid models to dramatically improve the performance of seq2seq models in low-resource scenarios.

The work was performed in the context of the IARPA MATERIAL program. For a set of five languages we had transcribed conversational speech and untranscribed broadcast news. The goal was to create a model that performed well in the broadcast domain even though we had no transcribed data in that domain.

Our initial seq2seq models trained using only supervised data performed poorly with an average WER of 66.7%. Even using standard semi-supervised techniques we were only able to bring the WER down to 60.7%. The issue was that even with fusion, the seq2seq model was unable to take advantage of the vasts amount of additional text data available. This hindered the quality of the psuedotranscripts generated by the model and the subsequent improvement from semi-supervised training.

Since hybrid models are better able to take advantage of the additional text data, we built hybrid models solely for the purpose of generating pseudotranscripts. The hybrid generated pseudotranscripts were then used as supervision for the seq2seq model. Because the initial WER of the hybrid model was much lower, the resulting improvement from semi-supervised training was much greater. The improvements to the seq2seq model come from two factors. The first is the pseudotranscribed audio provides additional acoustic data for the model to learn from. The second is the pseudotranscripts implicitly transfer knowledge from the external language model into the seq2seq model.

By using the hybrid model for supervision, we were able to reduce the WER for the seq2seq model from 66.7% to 29.0%, a dramatic improvement.

Comments? Send me an email.
© 2023 William Hartmann   •  Theme  Moonwalk