|国家预印本平台
首页|Offline Trajectory Optimization for Offline Reinforcement Learning

Offline Trajectory Optimization for Offline Reinforcement Learning

Offline Trajectory Optimization for Offline Reinforcement Learning

来源:Arxiv_logoArxiv
英文摘要

Offline reinforcement learning (RL) aims to learn policies without online explorations. To enlarge the training data, model-based offline RL learns a dynamics model which is utilized as a virtual environment to generate simulation data and enhance policy learning. However, existing data augmentation methods for offline RL suffer from (i) trivial improvement from short-horizon simulation; and (ii) the lack of evaluation and correction for generated data, leading to low-qualified augmentation. In this paper, we propose offline trajectory optimization for offline reinforcement learning (OTTO). The key motivation is to conduct long-horizon simulation and then utilize model uncertainty to evaluate and correct the augmented data. Specifically, we propose an ensemble of Transformers, a.k.a. World Transformers, to predict environment state dynamics and the reward function. Three strategies are proposed to use World Transformers to generate long-horizon trajectory simulation by perturbing the actions in the offline data. Then, an uncertainty-based World Evaluator is introduced to firstly evaluate the confidence of the generated trajectories and then perform the correction for low-confidence data. Finally, we jointly use the original data with the corrected augmentation data to train an offline RL algorithm. OTTO serves as a plug-in module and can be integrated with existing model-free offline RL methods. Experiments on various benchmarks show that OTTO can effectively improve the performance of representative offline RL algorithms, including in complex environments with sparse rewards like AntMaze. Codes are available at https://github.com/ZiqiZhao1/OTTO.

Ziqi Zhao、Zhaochun Ren、Liu Yang、Yunsen Liang、Fajie Yuan、Pengjie Ren、Zhumin Chen、jun Ma、Xin Xin

计算技术、计算机技术

Ziqi Zhao,Zhaochun Ren,Liu Yang,Yunsen Liang,Fajie Yuan,Pengjie Ren,Zhumin Chen,jun Ma,Xin Xin.Offline Trajectory Optimization for Offline Reinforcement Learning[EB/OL].(2025-07-10)[2025-07-16].https://arxiv.org/abs/2404.10393.点此复制

评论