|国家预印本平台
首页|Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides

来源:Arxiv_logoArxiv
英文摘要

Incorporating visual modalities to assist Automatic Speech Recognition (ASR) tasks has led to significant improvements. However, existing Audio-Visual Speech Recognition (AVSR) datasets and methods typically rely solely on lip-reading information or speaking contextual video, neglecting the potential of combining these different valuable visual cues within the speaking context. In this paper, we release a multimodal Chinese AVSR dataset, Chinese-LiPS, comprising 100 hours of speech, video, and corresponding manual transcription, with the visual modality encompassing both lip-reading information and the presentation slides used by the speaker. Based on Chinese-LiPS, we develop a simple yet effective pipeline, LiPS-AVSR, which leverages both lip-reading and presentation slide information as visual modalities for AVSR tasks. Experiments show that lip-reading and presentation slide information improve ASR performance by approximately 8\% and 25\%, respectively, with a combined performance improvement of about 35\%. The dataset is available at https://kiri0824.github.io/Chinese-LiPS/

Jinghua Zhao、Yuhang Jia、Shiyao Wang、Jiaming Zhou、Hui Wang、Yong Qin

计算技术、计算机技术

Jinghua Zhao,Yuhang Jia,Shiyao Wang,Jiaming Zhou,Hui Wang,Yong Qin.Chinese-LiPS: A Chinese audio-visual speech recognition dataset with Lip-reading and Presentation Slides[EB/OL].(2025-04-21)[2025-05-09].https://arxiv.org/abs/2504.15066.点此复制

评论