SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
计算技术、计算机技术电子技术应用遥感技术
Varun Biyyala,Jialu Li,Youshan Zhang,Bharat Chanderprakash Kathuria.SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing[EB/OL].(2025-01-13)[2025-09-18].https://arxiv.org/abs/2501.07554.点此复制
Video editing models have advanced significantly, but evaluating their
performance remains challenging. Traditional metrics, such as CLIP text and
image scores, often fall short: text scores are limited by inadequate training
data and hierarchical dependencies, while image scores fail to assess temporal
consistency. We present SST-EM (Semantic, Spatial, and Temporal Evaluation
Metric), a novel evaluation framework that leverages modern Vision-Language
Models (VLMs), Object Detection, and Temporal Consistency checks. SST-EM
comprises four components: (1) semantic extraction from frames using a VLM, (2)
primary object tracking with Object Detection, (3) focused object refinement
via an LLM agent, and (4) temporal consistency assessment using a Vision
Transformer (ViT). These components are integrated into a unified metric with
weights derived from human evaluations and regression analysis. The name SST-EM
reflects its focus on Semantic, Spatial, and Temporal aspects of video
evaluation. SST-EM provides a comprehensive evaluation of semantic fidelity and
temporal smoothness in video editing. The source code is available in the
\textbf{\href{https://github.com/custommetrics-sst/SST_CustomEvaluationMetrics.git}{GitHub
Repository}}.
展开英文信息
评论