|国家预印本平台
首页|TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness

来源:Arxiv_logoArxiv
英文摘要

The increasing ubiquity of video content and the corresponding demand for efficient access to meaningful information have elevated video summarization and video highlights as a vital research area. However, many state-of-the-art methods depend heavily either on supervised annotations or on attention-based models, which are computationally expensive and brittle in the face of distribution shifts that hinder cross-domain applicability across datasets. We introduce a pioneering self-supervised video summarization model that captures both spatial and temporal dependencies without the overhead of attention, RNNs, or transformers. Our framework integrates a novel set of Markov process-driven loss metrics and a two-stage self supervised learning paradigm that ensures both performance and efficiency. Our approach achieves state-of-the-art performance on the SUMME and TVSUM datasets, outperforming all existing unsupervised methods. It also rivals the best supervised models, demonstrating the potential for efficient, annotation-free architectures. This paves the way for more generalizable video summarization techniques and challenges the prevailing reliance on complex architectures.

Pritam Mishra、Coloma Ballester、Dimosthenis Karatzas

计算技术、计算机技术

Pritam Mishra,Coloma Ballester,Dimosthenis Karatzas.TRIM: A Self-Supervised Video Summarization Framework Maximizing Temporal Relative Information and Representativeness[EB/OL].(2025-06-25)[2025-07-16].https://arxiv.org/abs/2506.20588.点此复制

评论