首页|Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

来源：

英文摘要

Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent's reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/

作者：Zaijing Li、Yuquan Xie、Rui Shao、Gongwei Chen、Weili Guan、Dongmei Jiang、Liqiang Nie

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Zaijing Li,Yuquan Xie,Rui Shao,Gongwei Chen,Weili Guan,Dongmei Jiang,Liqiang Nie.Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts[EB/OL].(2025-06-12)[2025-08-02].https://arxiv.org/abs/2506.10357.点此复制

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

评论