首页|Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

来源：

英文摘要

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

作者：Bryan Seybold、Xiuye Gu、Vighnesh Birodkar、Sudheendra Vijayanarasimhan、David A. Ross、Vivek Rathod、Austin Myers

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Bryan Seybold,Xiuye Gu,Vighnesh Birodkar,Sudheendra Vijayanarasimhan,David A. Ross,Vivek Rathod,Austin Myers.Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features[EB/OL].(2022-12-20)[2025-06-30].https://arxiv.org/abs/2212.10596.点此复制

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

评论