AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations
Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint representations of both modalities. In this paper, we introduce AV-data2vec which addresses these challenges and builds audio-visual representations based on predicting contextualized representations which has been successful in the uni-modal case. The model uses a shared transformer encoder for both audio and video and can combine both modalities to improve speech recognition. Results on LRS3 show that AV-data2vec consistently outperforms existing methods under all settings with the same amount of data and model size.
Wei-Ning Hsu、Jiachen Lian、Alexei Baevski、Michael Auli
计算技术、计算机技术通信无线通信
Wei-Ning Hsu,Jiachen Lian,Alexei Baevski,Michael Auli.AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations[EB/OL].(2023-02-09)[2025-06-08].https://arxiv.org/abs/2302.06419.点此复制
评论