|国家预印本平台
首页|Advancing Video Self-Supervised Learning via Image Foundation Models

Advancing Video Self-Supervised Learning via Image Foundation Models

Advancing Video Self-Supervised Learning via Image Foundation Models

来源:Arxiv_logoArxiv
英文摘要

In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by $3.4\times$ and GPU memory usage by $8.2\times$. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.

Jingwei Wu、Zhewei Huang、Chang Liu

10.1016/j.patrec.2025.03.015

计算技术、计算机技术

Jingwei Wu,Zhewei Huang,Chang Liu.Advancing Video Self-Supervised Learning via Image Foundation Models[EB/OL].(2025-05-25)[2025-06-27].https://arxiv.org/abs/2505.19218.点此复制

评论