Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition

home..

Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition

October 2021

image generated from craiyon.com

We have long had success using semi-supervised training for the purposes of domain adaptation of ASR models. Even limited amounts of in-domain untranscribed data, would produce large improvements. However, this experience was with hybrid models and we have not observed the same effect with sequence-to-sequence models. By combining our semi-supervised approach with unsupervised representation learning (also referred to as self-learning), we were able to produce much larger gains in performance.

Unsupervised learning has become very successful in NLP, vision, and speech in the last few years. Along with the success there has been a proliferation of approaches and techniques. For our work we were inspired by the HuBERT approach. The HuBERT approach clusters the untranscribed data and treats the cluster assignments as pseudolabels.

While the HuBERT approach generates the first set of clusters from MFCC features, we were not working in a purely unsupervised task. We intially train a Conformer-based encoder on out-of-domain data. The encoded representations from this out-of-domain data were then used as features for clustering.

The work was performed in the context of the IARPA MATERIAL program. For a set of three languages we had transcribed conversational speech and untranscribed broadcast news. The goal was to create a model that performed well in both domains even though we had no transcribed broadcast data.

The key contributions of our work are as follows

We demonstrated unsupervised pretraining could be used for low-resource domain adaptation
We were able to train the model with a single GPU; the original HuBERT paper required 16 GPUs.
We showed that unsupervised pretraining was superior to semi-supervised training, but the two techniques are complementary
Using a CTC-based decoding, we were able to integrate external language model information in order to further improve semi-supervised training

“Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition”, Chak-Fai Li, Francis Keith, William Hartmann, and Matthew Snover, in Proceedings of IEEE ICASSP, 2022. [publication] [arxiv] [bib] [post]

Comments? Send me an email.

Combining Unsupervised and Text Augmented Semi-Supervised Learning for Low Resourced Autoregressive Speech Recognition

Related Publications