home..

Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition

image generated from craiyon.com

We have long had success using semi-supervised training for the purposes of domain adaptation of ASR models. Even limited amounts of in-domain untranscribed data, would produce large improvements. However, this experience was with hybrid models and we have not observed the same effect with sequence-to-sequence models. By combining our semi-supervised approach with unsupervised representation learning (also referred to as self-learning), we were able to produce much larger gains in performance.

Unsupervised learning has become very successful in NLP, vision, and speech in the last few years. Along with the success there has been a proliferation of approaches and techniques. For our work we were inspired by the HuBERT approach. The HuBERT approach clusters the untranscribed data and treats the cluster assignments as pseudolabels.

While the HuBERT approach generates the first set of clusters from MFCC features, we were not working in a purely unsupervised task. We intially train a Conformer-based encoder on out-of-domain data. The encoded representations from this out-of-domain data were then used as features for clustering.

The work was performed in the context of the IARPA MATERIAL program. For a set of three languages we had transcribed conversational speech and untranscribed broadcast news. The goal was to create a model that performed well in both domains even though we had no transcribed broadcast data.

The key contributions of our work are as follows

Comments? Send me an email.
© 2023 William Hartmann   •  Theme  Moonwalk