Smart Routing for Multimodal Video Retrieval: When to Search What
Smart Routing for Multimodal Video Retrieval: When to Search What
We introduce ModaRoute, an LLM-based intelligent routing system that dynamically selects optimal modalities for multimodal video retrieval. While dense text captions can achieve 75.9% Recall@5, they require expensive offline processing and miss critical visual information present in 34% of clips with scene text not captured by ASR. By analyzing query intent and predicting information needs, ModaRoute reduces computational overhead by 41% while achieving 60.9% Recall@5. Our approach uses GPT-4.1 to route queries across ASR (speech), OCR (text), and visual indices, averaging 1.78 modalities per query versus exhaustive 3.0 modality search. Evaluation on 1.8M video clips demonstrates that intelligent routing provides a practical solution for scaling multimodal retrieval systems, reducing infrastructure costs while maintaining competitive effectiveness for real-world deployment.
Kevin Dela Rosa
计算技术、计算机技术
Kevin Dela Rosa.Smart Routing for Multimodal Video Retrieval: When to Search What[EB/OL].(2025-07-12)[2025-08-10].https://arxiv.org/abs/2507.13374.点此复制
评论