How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?
We investigate whether off-the-shelf Multimodal Large Language Models (MLLMs) can tackle Online Episodic-Memory Video Question Answering (OEM-VQA) without additional training. Our pipeline converts a streaming egocentric video into a lightweight textual memory, only a few kilobytes per minute, via an MLLM descriptor module, and answers multiple-choice questions by querying this memory with an LLM reasoner module. On the QAEgo4D-Closed benchmark, our best configuration attains 56.0% accuracy with 3.6 kB per minute storage, matching the performance of dedicated state-of-the-art systems while being 10**4/10**5 times more memory-efficient. Extensive ablations provides insights into the role of each component and design choice, and highlight directions of improvement for future research.
Giuseppe Lando、Rosario Forte、Giovanni Maria Farinella、Antonino Furnari
计算技术、计算机技术
Giuseppe Lando,Rosario Forte,Giovanni Maria Farinella,Antonino Furnari.How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?[EB/OL].(2025-06-19)[2025-07-02].https://arxiv.org/abs/2506.16450.点此复制
评论