首页|Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

来源：

英文摘要

Most visual grounding solutions primarily focus on realistic images. However, applications involving synthetic images, such as Graphical User Interfaces (GUIs), remain limited. This restricts the development of autonomous computer vision-powered artificial intelligence (AI) agents for automatic application interaction. Enabling AI to effectively understand and interact with GUIs is crucial to advancing automation in software testing, accessibility, and human-computer interaction. In this work, we explore Instruction Visual Grounding (IVG), a multi-modal approach to object identification within a GUI. More precisely, given a natural language instruction and a GUI screen, IVG locates the coordinates of the element on the screen where the instruction should be executed. We propose two main methods: (1) IVGocr, which combines a Large Language Model (LLM), an object detection model, and an Optical Character Recognition (OCR) module; and (2) IVGdirect, which uses a multimodal architecture for end-to-end grounding. For each method, we introduce a dedicated dataset. In addition, we propose the Central Point Validation (CPV) metric, a relaxed variant of the classical Central Proximity Score (CPS) metric. Our final test dataset is publicly released to support future research.

作者：El Hassane Ettifouri、Jessica López Espejel、Laura Minkova、Tassnim Dardouri、Walid Dahhane

作者单位：

学科分类：计算技术、计算机技术

推荐引用：El Hassane Ettifouri,Jessica López Espejel,Laura Minkova,Tassnim Dardouri,Walid Dahhane.Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces[EB/OL].(2025-07-18)[2025-08-04].https://arxiv.org/abs/2407.01558.点此复制

Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

评论