Improved Dysarthric Speech to Text Conversion via TTS Personalization
Improved Dysarthric Speech to Text Conversion via TTS Personalization
We present a case study on developing a customized speech-to-text system for a Hungarian speaker with severe dysarthria. State-of-the-art automatic speech recognition (ASR) models struggle with zero-shot transcription of dysarthric speech, yielding high error rates. To improve performance with limited real dysarthric data, we fine-tune an ASR model using synthetic speech generated via a personalized text-to-speech (TTS) system. We introduce a method for generating synthetic dysarthric speech with controlled severity by leveraging premorbidity recordings of the given speaker and speaker embedding interpolation, enabling ASR fine-tuning on a continuum of impairments. Fine-tuning on both real and synthetic dysarthric speech reduces the character error rate (CER) from 36-51% (zero-shot) to 7.3%. Our monolingual FastConformer_Hu ASR model significantly outperforms Whisper-turbo when fine-tuned on the same data, and the inclusion of synthetic speech contributes to an 18% relative CER reduction. These results highlight the potential of personalized ASR systems for improving accessibility for individuals with severe speech impairments.
Péter Mihajlik、Éva Székely、Piroska Barta、Máté Soma Kádár、Gergely Dobsinszki、László Tóth
计算技术、计算机技术语言学乌拉尔语系(芬兰-乌戈尔语系)
Péter Mihajlik,Éva Székely,Piroska Barta,Máté Soma Kádár,Gergely Dobsinszki,László Tóth.Improved Dysarthric Speech to Text Conversion via TTS Personalization[EB/OL].(2025-08-08)[2025-08-24].https://arxiv.org/abs/2508.06391.点此复制
评论