|国家预印本平台
首页|Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

来源:Arxiv_logoArxiv
英文摘要

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.

Parthasaarathy Sudarsanam、Irene Martín-Morató、Tuomas Virtanen

计算技术、计算机技术

Parthasaarathy Sudarsanam,Irene Martín-Morató,Tuomas Virtanen.Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities[EB/OL].(2025-05-20)[2025-07-16].https://arxiv.org/abs/2505.14562.点此复制

评论