|国家预印本平台
首页|Visual Entity Linking via Multi-modal Learning

Visual Entity Linking via Multi-modal Learning

Visual Entity Linking via Multi-modal Learning

中文摘要英文摘要

Existing visual scene understanding methods mainly focus on identifying coarse-grained concepts aboutthe visual objects and their relationships, largely neglecting fine-grained scene understanding. In fact, manydata-driven applications on the Web (e.g., news-reading and e-shopping) require accurate recognition ofmuch less coarse concepts as entities and proper linking them to a knowledge graph (KG), which can taketheir performance to the next level. In light of this, in this paper, we identify a new research task: visual entitylinking for fine-grained scene understanding. To accomplish the task, we first extract features of candidateentities from different modalities, i.e., visual features, textual features, and KG features. Then, we design adeep modal-attention neural network-based learning-to-rank method which aggregates all features and mapsvisual objects to the entities in KG. Extensive experimental results on the newly constructed dataset showthat our proposed method is effective as it significantly improves the accuracy performance from 66.46% to83.16% compared with baselines.

Existing visual scene understanding methods mainly focus on identifying coarse-grained concepts aboutthe visual objects and their relationships, largely neglecting fine-grained scene understanding. In fact, manydata-driven applications on the Web (e.g., news-reading and e-shopping) require accurate recognition ofmuch less coarse concepts as entities and proper linking them to a knowledge graph (KG), which can taketheir performance to the next level. In light of this, in this paper, we identify a new research task: visual entitylinking for fine-grained scene understanding. To accomplish the task, we first extract features of candidateentities from different modalities, i.e., visual features, textual features, and KG features. Then, we design adeep modal-attention neural network-based learning-to-rank method which aggregates all features and mapsvisual objects to the entities in KG. Extensive experimental results on the newly constructed dataset showthat our proposed method is effective as it significantly improves the accuracy performance from 66.46% to83.16% compared with baselines.

Qiushuo, Zheng、Meng, Wang、Guilin, Qi、Hao, Wen

10.12074/202211.00385V1

计算技术、计算机技术

Knowledge graphMulti-modal learningEntity linkingLearning to rankKnowledge graph representation

Knowledge graphMulti-modal learningEntity linkingLearning to rankKnowledge graph representation

Qiushuo, Zheng,Meng, Wang,Guilin, Qi,Hao, Wen.Visual Entity Linking via Multi-modal Learning[EB/OL].(2022-11-28)[2025-08-02].https://chinaxiv.org/abs/202211.00385.点此复制

评论