首页|Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

来源：

英文摘要

Fine-tuning pretrained ASR models for specific domains is challenging for small organizations with limited labeled data and computational resources. Here, we explore different data selection pipelines and propose a robust approach that improves ASR adaptation by filtering pseudo-labels generated using Whisper (encoder-decoder) and Zipformer (transducer) models. Our approach integrates multiple selection strategies -- including word error rate (WER) prediction, named entity recognition (NER), and character error rate (CER) analysis -- to extract high-quality training segments. We evaluate our method on Whisper and Zipformer using a 7500-hour baseline, comparing it to a CER-based approach relying on hypotheses from three ASR systems. Fine-tuning on 7500 hours of pseudo-labeled call center data achieves 12.3% WER, while our filtering reduces the dataset to 100 hours (1.4%) with similar performance; a similar trend is observed on Fisher English.

作者：Pradeep Rangappa、Andres Carofilis、Jeena Prakash、Shashi Kumar、Sergio Burdisso、Srikanth Madikeri、Esau Villatoro-Tello、Bidisha Sharma、Petr Motlicek、Kadri Hacioglu、Shankar Venkatesan、Saurabh Vyas、Andreas Stolcke

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Pradeep Rangappa,Andres Carofilis,Jeena Prakash,Shashi Kumar,Sergio Burdisso,Srikanth Madikeri,Esau Villatoro-Tello,Bidisha Sharma,Petr Motlicek,Kadri Hacioglu,Shankar Venkatesan,Saurabh Vyas,Andreas Stolcke.Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering[EB/OL].(2025-06-04)[2025-07-09].https://arxiv.org/abs/2506.03681.点此复制

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

Efficient Data Selection for Domain Adaptation of ASR Using Pseudo-Labels and Multi-Stage Filtering

评论