首页|Prosody Labeling with Phoneme-BERT and Speech Foundation Models

Prosody Labeling with Phoneme-BERT and Speech Foundation Models

来源：

英文摘要

This paper proposes a model for automatic prosodic label annotation, where the predicted labels can be used for training a prosody-controllable text-to-speech model. The proposed model utilizes not only rich acoustic features extracted by a self-supervised-learning (SSL)-based model or a Whisper encoder, but also linguistic features obtained from phoneme-input pretrained linguistic foundation models such as PnG BERT and PL-BERT. The concatenation of acoustic and linguistic features is used to predict phoneme-level prosodic labels. In the experimental evaluation on Japanese prosodic labels, including pitch accents and phrase break indices, it was observed that the combination of both speech and linguistic foundation models enhanced the prediction accuracy compared to using either a speech or linguistic input alone. Specifically, we achieved 89.8% prediction accuracy in accent labels, 93.2% in high-low pitch accents, and 94.3% in break indices.

作者：Tomoki Koriyama

作者单位：

学科分类：语言学计算技术、计算机技术

推荐引用：Tomoki Koriyama.Prosody Labeling with Phoneme-BERT and Speech Foundation Models[EB/OL].(2025-07-05)[2025-07-16].https://arxiv.org/abs/2507.03912.点此复制

Prosody Labeling with Phoneme-BERT and Speech Foundation Models

Prosody Labeling with Phoneme-BERT and Speech Foundation Models

评论