|国家预印本平台
首页|Audio-Visual Segmentation

Audio-Visual Segmentation

Audio-Visual Segmentation

来源:Arxiv_logoArxiv
英文摘要

We propose to explore a new problem called audio-visual segmentation (AVS), in which the goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame. To facilitate this research, we construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos. Two settings are studied with this benchmark: 1) semi-supervised audio-visual segmentation with a single sound source and 2) fully-supervised audio-visual segmentation with multiple sound sources. To deal with the AVS problem, we propose a novel method that uses a temporal pixel-wise audio-visual interaction module to inject audio semantics as guidance for the visual segmentation process. We also design a regularization loss to encourage the audio-visual mapping during training. Quantitative and qualitative experiments on the AVSBench compare our approach to several existing methods from related tasks, demonstrating that the proposed method is promising for building a bridge between the audio and pixel-wise visual semantics. Code is available at https://github.com/OpenNLPLab/AVSBench.

Meng Wang、Jiayi Zhang、Jianyuan Wang、Lingpeng Kong、Stan Birchfield、Jing Zhang、Yiran Zhong、Jinxing Zhou、Dan Guo、Weixuan Sun

电视广播通信

Meng Wang,Jiayi Zhang,Jianyuan Wang,Lingpeng Kong,Stan Birchfield,Jing Zhang,Yiran Zhong,Jinxing Zhou,Dan Guo,Weixuan Sun.Audio-Visual Segmentation[EB/OL].(2022-07-11)[2025-05-04].https://arxiv.org/abs/2207.05042.点此复制

评论