|国家预印本平台
首页|From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

来源:Arxiv_logoArxiv
英文摘要

This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet's external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.

Agnese Taluzzi、Davide Gesualdi、Riccardo Santambrogio、Chiara Plizzari、Francesca Palermo、Simone Mentasti、Matteo Matteucci

计算技术、计算机技术

Agnese Taluzzi,Davide Gesualdi,Riccardo Santambrogio,Chiara Plizzari,Francesca Palermo,Simone Mentasti,Matteo Matteucci.From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge[EB/OL].(2025-06-10)[2025-07-16].https://arxiv.org/abs/2506.08553.点此复制

评论