|国家预印本平台
首页|The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models

The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models

The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models

来源:Arxiv_logoArxiv
英文摘要

Human speech perception is multimodal. In natural speech, lip movements can precede corresponding voicing by a non-negligible gap of 100-300 ms, especially for specific consonants, affecting the time course of neural phonetic encoding in human listeners. However, it remains unexplored whether self-supervised learning models, which have been used to simulate audio-visual integration in humans, can capture this asynchronicity between audio and visual cues. We compared AV-HuBERT, an audio-visual model, with audio-only HuBERT, by using linear classifiers to track their phonetic decodability over time. We found that phoneme information becomes available in AV-HuBERT embeddings only about 20 ms before HuBERT, likely due to AV-HuBERT's lower temporal resolution and feature concatenation process. It suggests AV-HuBERT does not adequately capture the temporal dynamics of multimodal speech perception, limiting its suitability for modeling the multimodal speech perception process.

Yi Wang、Oli Danyi Liu、Peter Bell

语言学计算技术、计算机技术

Yi Wang,Oli Danyi Liu,Peter Bell.The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models[EB/OL].(2025-06-25)[2025-08-02].https://arxiv.org/abs/2506.20361.点此复制

评论