|国家预印本平台
首页|Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

来源:Arxiv_logoArxiv
英文摘要

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

Bryan Seybold、Xiuye Gu、Vighnesh Birodkar、Sudheendra Vijayanarasimhan、David A. Ross、Vivek Rathod、Austin Myers

计算技术、计算机技术

Bryan Seybold,Xiuye Gu,Vighnesh Birodkar,Sudheendra Vijayanarasimhan,David A. Ross,Vivek Rathod,Austin Myers.Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features[EB/OL].(2022-12-20)[2025-06-30].https://arxiv.org/abs/2212.10596.点此复制

评论