Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces
Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces
Most visual grounding solutions primarily focus on realistic images. However, applications involving synthetic images, such as Graphical User Interfaces (GUIs), remain limited. This restricts the development of autonomous computer vision-powered artificial intelligence (AI) agents for automatic application interaction. Enabling AI to effectively understand and interact with GUIs is crucial to advancing automation in software testing, accessibility, and human-computer interaction. In this work, we explore Instruction Visual Grounding (IVG), a multi-modal approach to object identification within a GUI. More precisely, given a natural language instruction and a GUI screen, IVG locates the coordinates of the element on the screen where the instruction should be executed. We propose two main methods: (1) IVGocr, which combines a Large Language Model (LLM), an object detection model, and an Optical Character Recognition (OCR) module; and (2) IVGdirect, which uses a multimodal architecture for end-to-end grounding. For each method, we introduce a dedicated dataset. In addition, we propose the Central Point Validation (CPV) metric, a relaxed variant of the classical Central Proximity Score (CPS) metric. Our final test dataset is publicly released to support future research.
El Hassane Ettifouri、Jessica López Espejel、Laura Minkova、Tassnim Dardouri、Walid Dahhane
计算技术、计算机技术
El Hassane Ettifouri,Jessica López Espejel,Laura Minkova,Tassnim Dardouri,Walid Dahhane.Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces[EB/OL].(2025-07-18)[2025-08-04].https://arxiv.org/abs/2407.01558.点此复制
评论