|国家预印本平台
首页|Epic-Sounds: A Large-scale Dataset of Actions That Sound

Epic-Sounds: A Large-scale Dataset of Actions That Sound

Epic-Sounds: A Large-scale Dataset of Actions That Sound

来源:Arxiv_logoArxiv
英文摘要

We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from video, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate state-of-the-art audio recognition and detection models on our dataset, for both audio-only and audio-visual methods. We also conduct analysis on: the temporal overlap between audio events, the temporal and label correlations between audio and visual modalities, the ambiguities in annotating materials from audio-only input, the importance of audio-only labels and the limitations of current models to understand actions that sound.

Dima Damen、Andrew Zisserman、Evangelos Kazakos、Jaesung Huh、Jacob Chalk

计算技术、计算机技术

Dima Damen,Andrew Zisserman,Evangelos Kazakos,Jaesung Huh,Jacob Chalk.Epic-Sounds: A Large-scale Dataset of Actions That Sound[EB/OL].(2025-07-16)[2025-08-04].https://arxiv.org/abs/2302.00646.点此复制

评论