RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations
RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations
We introduce RASMALAI, a large-scale speech dataset with rich text descriptions, designed to advance controllable and expressive text-to-speech (TTS) synthesis for 23 Indian languages and English. It comprises 13,000 hours of speech and 24 million text-description annotations with fine-grained attributes like speaker identity, accent, emotion, style, and background conditions. Using RASMALAI, we develop IndicParlerTTS, the first open-source, text-description-guided TTS for Indian languages. Systematic evaluation demonstrates its ability to generate high-quality speech for named speakers, reliably follow text descriptions and accurately synthesize specified attributes. Additionally, it effectively transfers expressive characteristics both within and across languages. IndicParlerTTS consistently achieves strong performance across these evaluations, setting a new standard for controllable multilingual expressive speech synthesis in Indian languages.
Ashwin Sankar、Yoach Lacombe、Sherry Thomas、Praveen Srinivasa Varadhan、Sanchit Gandhi、Mitesh M Khapra
南亚语系(澳斯特罗-亚细亚语系)计算技术、计算机技术
Ashwin Sankar,Yoach Lacombe,Sherry Thomas,Praveen Srinivasa Varadhan,Sanchit Gandhi,Mitesh M Khapra.RASMALAI: Resources for Adaptive Speech Modeling in Indian Languages with Accents and Intonations[EB/OL].(2025-05-24)[2025-07-17].https://arxiv.org/abs/2505.18609.点此复制
评论