首页|Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

来源：

英文摘要

This paper reports on the development of a large-scale speech recognition model, Whale. Similar to models such as Whisper and OWSM, Whale leverages both a large model size and a diverse, extensive dataset. Whale's architecture integrates w2v-BERT self-supervised model, an encoder-decoder backbone built on E-Branchformer, and a joint CTC-attention decoding strategy. The training corpus comprises varied speech data, of not only public corpora but also in-house data, thereby enhancing the model's robustness to different speaking styles and acoustic conditions. Through evaluations on multiple benchmarks, Whale achieved comparable performance to existing models. In particular, it achieves a word error rate of 2.4% on the Librispeech test-clean set and a character error rate of 3.4% on the CSJ eval3 set, outperforming Whisper large-v3 and OWSM v3.1.

作者：Yosuke Kashiwagi、Hayato Futami、Emiru Tsunoo、Satoshi Asakawa

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Yosuke Kashiwagi,Hayato Futami,Emiru Tsunoo,Satoshi Asakawa.Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data[EB/OL].(2025-06-02)[2025-07-21].https://arxiv.org/abs/2506.01439.点此复制

Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

Whale: Large-Scale multilingual ASR model with w2v-BERT and E-Branchformer with large speech data

评论