首页|From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

来源：

英文摘要

This report presents SceneNet and KnowledgeNet, our approaches developed for the HD-EPIC VQA Challenge 2025. SceneNet leverages scene graphs generated with a multi-modal large language model (MLLM) to capture fine-grained object interactions, spatial relationships, and temporally grounded events. In parallel, KnowledgeNet incorporates ConceptNet's external commonsense knowledge to introduce high-level semantic connections between entities, enabling reasoning beyond directly observable visual evidence. Each method demonstrates distinct strengths across the seven categories of the HD-EPIC benchmark, and their combination within our framework results in an overall accuracy of 44.21% on the challenge, highlighting its effectiveness for complex egocentric VQA tasks.

作者：Agnese Taluzzi、Davide Gesualdi、Riccardo Santambrogio、Chiara Plizzari、Francesca Palermo、Simone Mentasti、Matteo Matteucci

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Agnese Taluzzi,Davide Gesualdi,Riccardo Santambrogio,Chiara Plizzari,Francesca Palermo,Simone Mentasti,Matteo Matteucci.From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge[EB/OL].(2025-06-10)[2025-07-16].https://arxiv.org/abs/2506.08553.点此复制

From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

From Pixels to Graphs: using Scene and Knowledge Graphs for HD-EPIC VQA Challenge

评论