InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction
This paper introduces \textsc{InfantAgent-Next}, a generalist agent capable of interacting with computers in a multimodal manner, encompassing text, images, audio, and video. Unlike existing approaches that either build intricate workflows around a single large model or only provide workflow modularity, our agent integrates tool-based and pure vision agents within a highly modular architecture, enabling different models to collaboratively solve decoupled tasks in a step-by-step manner. Our generality is demonstrated by our ability to evaluate not only pure vision-based real-world benchmarks (i.e., OSWorld), but also more general or tool-intensive benchmarks (e.g., GAIA and SWE-Bench). Specifically, we achieve $\mathbf{7.27\%}$ accuracy on OSWorld, higher than Claude-Computer-Use. Codes and evaluation scripts are open-sourced at https://github.com/bin123apple/InfantAgent.
Bin Lei、Weitai Kang、Zijian Zhang、Winson Chen、Xi Xie、Shan Zuo、Mimi Xie、Ali Payani、Mingyi Hong、Yan Yan、Caiwen Ding
计算技术、计算机技术
Bin Lei,Weitai Kang,Zijian Zhang,Winson Chen,Xi Xie,Shan Zuo,Mimi Xie,Ali Payani,Mingyi Hong,Yan Yan,Caiwen Ding.InfantAgent-Next: A Multimodal Generalist Agent for Automated Computer Interaction[EB/OL].(2025-05-16)[2025-06-21].https://arxiv.org/abs/2505.10887.点此复制
评论