首页|Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

来源：

英文摘要

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing large-scale unlabeled data to learn shared representations. Existing deep learning approach for trimodal alignment involves two-stages, that separately align visual-text and audio-text modalities. This approach suffers from mismatched data distributions, resulting in suboptimal alignment. Leveraging the AVCaps dataset, which provides audio, visual and audio-visual captions for video clips, our method jointly optimizes the representation of all the modalities using contrastive training. Our results demonstrate that the single-stage approach outperforms the two-stage method, achieving a two-fold improvement in audio based visual retrieval, highlighting the advantages of unified multimodal representation learning.

作者：Parthasaarathy Sudarsanam、Irene Martín-Morató、Tuomas Virtanen

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Parthasaarathy Sudarsanam,Irene Martín-Morató,Tuomas Virtanen.Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities[EB/OL].(2025-05-20)[2025-07-16].https://arxiv.org/abs/2505.14562.点此复制

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

评论