Prosody Labeling with Phoneme-BERT and Speech Foundation Models
Prosody Labeling with Phoneme-BERT and Speech Foundation Models
This paper proposes a model for automatic prosodic label annotation, where the predicted labels can be used for training a prosody-controllable text-to-speech model. The proposed model utilizes not only rich acoustic features extracted by a self-supervised-learning (SSL)-based model or a Whisper encoder, but also linguistic features obtained from phoneme-input pretrained linguistic foundation models such as PnG BERT and PL-BERT. The concatenation of acoustic and linguistic features is used to predict phoneme-level prosodic labels. In the experimental evaluation on Japanese prosodic labels, including pitch accents and phrase break indices, it was observed that the combination of both speech and linguistic foundation models enhanced the prediction accuracy compared to using either a speech or linguistic input alone. Specifically, we achieved 89.8% prediction accuracy in accent labels, 93.2% in high-low pitch accents, and 94.3% in break indices.
Tomoki Koriyama
语言学计算技术、计算机技术
Tomoki Koriyama.Prosody Labeling with Phoneme-BERT and Speech Foundation Models[EB/OL].(2025-07-05)[2025-07-16].https://arxiv.org/abs/2507.03912.点此复制
评论