|国家预印本平台
首页|KFFocus: Highlighting Keyframes for Enhanced Video Understanding

KFFocus: Highlighting Keyframes for Enhanced Video Understanding

KFFocus: Highlighting Keyframes for Enhanced Video Understanding

来源:Arxiv_logoArxiv
英文摘要

Recently, with the emergence of large language models, multimodal LLMs have demonstrated exceptional capabilities in image and video modalities. Despite advancements in video comprehension, the substantial computational demands of long video sequences lead current video LLMs (Vid-LLMs) to employ compression strategies at both the inter-frame level (e.g., uniform sampling of video frames) and intra-frame level (e.g., condensing all visual tokens of each frame into a limited number). However, this approach often neglects the uneven temporal distribution of critical information across frames, risking the omission of keyframes that contain essential temporal and semantic details. To tackle these challenges, we propose KFFocus, a method designed to efficiently compress video tokens and emphasize the informative context present within video frames. We substitute uniform sampling with a refined approach inspired by classic video compression principles to identify and capture keyframes based on their temporal redundancy. By assigning varying condensation ratios to frames based on their contextual relevance, KFFocus efficiently reduces token redundancy while preserving informative content details. Additionally, we introduce a spatiotemporal modeling module that encodes both the temporal relationships between video frames and the spatial structure within each frame, thus providing Vid-LLMs with a nuanced understanding of spatial-temporal dynamics. Extensive experiments on widely recognized video understanding benchmarks, especially long video scenarios, demonstrate that KFFocus significantly outperforms existing methods, achieving substantial computational efficiency and enhanced accuracy.

Ming Nie、Chunwei Wang、Hang Xu、Li Zhang

计算技术、计算机技术

Ming Nie,Chunwei Wang,Hang Xu,Li Zhang.KFFocus: Highlighting Keyframes for Enhanced Video Understanding[EB/OL].(2025-08-12)[2025-08-24].https://arxiv.org/abs/2508.08989.点此复制

评论