|国家预印本平台
首页|Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives

Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives

Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives

来源:Arxiv_logoArxiv
英文摘要

Recent advances in reinforcement learning (RL) have renewed focus on the design of reward functions that shape agent behavior. Manually designing reward functions is tedious and error-prone. A principled alternative is to specify behaviors in a formal language that can be automatically translated into rewards. Omega-regular languages are a natural choice for this purpose, given their established role in formal verification and synthesis. However, existing methods using omega-regular specifications typically rely on discounted reward RL in episodic settings, with periodic resets. This setup misaligns with the semantics of omega-regular specifications, which describe properties over infinite behavior traces. In such cases, the average reward criterion and the continuing setting -- where the agent interacts with the environment over a single, uninterrupted lifetime -- are more appropriate. To address the challenges of infinite-horizon, continuing tasks, we focus on absolute liveness specifications -- a subclass of omega-regular languages that cannot be violated by any finite behavior prefix, making them well-suited to the continuing setting. We present the first model-free RL framework that translates absolute liveness specifications to average-reward objectives. Our approach enables learning in communicating MDPs without episodic resetting. We also introduce a reward structure for lexicographic multi-objective optimization, aiming to maximize an external average-reward objective among the policies that also maximize the satisfaction probability of a given omega-regular specification. Our method guarantees convergence in unknown communicating MDPs and supports on-the-fly reductions that do not require full knowledge of the environment, thus enabling model-free RL. Empirical results show our average-reward approach in continuing setting outperforms discount-based methods across benchmarks.

Milad Kazemi、Mateo Perez、Fabio Somenzi、Sadegh Soudjani、Ashutosh Trivedi、Alvaro Velasquez

计算技术、计算机技术

Milad Kazemi,Mateo Perez,Fabio Somenzi,Sadegh Soudjani,Ashutosh Trivedi,Alvaro Velasquez.Average Reward Reinforcement Learning for Omega-Regular and Mean-Payoff Objectives[EB/OL].(2025-05-21)[2025-06-03].https://arxiv.org/abs/2505.15693.点此复制

评论