首页|The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

来源：

英文摘要

The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.

作者：Chris Emezue、NaijaVoices Community、Busayo Awobade、Abraham Owodunni、Handel Emezue、Gloria Monica Tobechukwu Emezue、Nefertiti Nneoma Emezue、Sewade Ogun、Bunmi Akinremi、David Ifeoluwa Adelani、Chris Pal

作者单位：

学科分类：非洲诸语言

推荐引用：Chris Emezue,NaijaVoices Community,Busayo Awobade,Abraham Owodunni,Handel Emezue,Gloria Monica Tobechukwu Emezue,Nefertiti Nneoma Emezue,Sewade Ogun,Bunmi Akinremi,David Ifeoluwa Adelani,Chris Pal.The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages[EB/OL].(2025-05-26)[2025-06-21].https://arxiv.org/abs/2505.20564.点此复制

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

评论