首页|Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

来源：

英文摘要

Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a unified framework that combines a visual matching stream and an aural matching stream with a unique subtitles-based video segmentation approach. Additionally, the aural stream includes a complementary audio-based two-stage retrieval mechanism that enhances performance on long-duration videos. Considering the complex nature of retrieval from lengthy videos and its corresponding evaluation, we introduce a new retrieval evaluation method specifically designed for long-video retrieval to support further research. We conducted experiments on the YouCook2 benchmark, showing promising retrieval performance.

作者：Mohamed Eltahir、Osamah Sarraj、Mohammed Bremoo、Mohammed Khurd、Abdulrahman Alfrihidi、Taha Alshatiri、Mohammad Almatrafi、Tanveer Hussain

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Mohamed Eltahir,Osamah Sarraj,Mohammed Bremoo,Mohammed Khurd,Abdulrahman Alfrihidi,Taha Alshatiri,Mohammad Almatrafi,Tanveer Hussain.Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric[EB/OL].(2025-04-06)[2025-05-25].https://arxiv.org/abs/2504.04572.点此复制

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

评论