|国家预印本平台
首页|PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization

来源:Arxiv_logoArxiv
英文摘要

Despite Proximal Policy Optimization (PPO) dominating policy gradient methods -- from robotic control to game AI -- its static trust region forces a brittle trade-off: aggressive clipping stifles early exploration, while late-stage updates destabilize convergence. PPO-BR establishes a new paradigm in adaptive RL by fusing exploration and convergence signals into a single bounded trust region -- a theoretically grounded innovation that outperforms five SOTA baselines with less than 2% overhead. This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery within a single adaptive mechanism. PPO-BR achieves 29.1% faster convergence by combining: (1) entropy-driven expansion (epsilon up) for exploration in high-uncertainty states, and (2) reward-guided contraction (epsilon down) for convergence stability. On six diverse benchmarks (MuJoCo, Atari, sparse-reward), PPO-BR achieves 29.1% faster convergence (p < 0.001), 2.3x lower reward variance than PPO, and less than 1.8% runtime overhead with only five lines of code change. PPO-BR's simplicity and theoretical guarantees make it ready-to-deploy in safety-critical domains -- from surgical robotics to autonomous drones. In contrast to recent methods such as Group Relative Policy Optimization (GRPO), PPO-BR offers a unified entropy-reward mechanism applicable to both language models and general reinforcement learning environments.

Ben Rahman

计算技术、计算机技术

Ben Rahman.PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization[EB/OL].(2025-05-23)[2025-06-07].https://arxiv.org/abs/2505.17714.点此复制

评论