|国家预印本平台
首页|PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

来源:Arxiv_logoArxiv
英文摘要

Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA

Muhammad Maaz、Shehan Munasinghe、Salman Khan、Mubarak Shah、Fahad Khan、Hanoona Abdul Rasheed、Rusiru Thushara

计算技术、计算机技术

Muhammad Maaz,Shehan Munasinghe,Salman Khan,Mubarak Shah,Fahad Khan,Hanoona Abdul Rasheed,Rusiru Thushara.PG-Video-LLaVA: Pixel Grounding Large Video-Language Models[EB/OL].(2023-11-22)[2025-05-17].https://arxiv.org/abs/2311.13435.点此复制

评论