首页|Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

来源：

英文摘要

We present the first $Q$-learning and actor-critic algorithms for robust average reward Markov Decision Processes (MDPs) with non-asymptotic convergence under contamination, TV distance and Wasserstein distance uncertainty sets. We show that the robust $Q$ Bellman operator is a strict contractive mapping with respect to a carefully constructed semi-norm with constant functions being quotiented out. This property supports a stochastic approximation update, that learns the optimal robust $Q$ function in $\tilde{\cO}(\epsilon^{-2})$ samples. We also show that the same idea can be used for robust $Q$ function estimation, which can be further used for critic estimation. Coupling it with theories in robust policy mirror descent update, we present a natural actor-critic algorithm that attains an $\epsilon$-optimal robust policy in $\tilde{\cO}(\epsilon^{-3})$ samples. These results advance the theory of distributionally robust reinforcement learning in the average reward setting.

作者：Yang Xu、Swetha Ganesh、Vaneet Aggarwal

作者单位：

学科分类：自动化基础理论

推荐引用：Yang Xu,Swetha Ganesh,Vaneet Aggarwal.Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning[EB/OL].(2025-06-08)[2025-06-28].https://arxiv.org/abs/2506.07040.点此复制

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

Efficient $Q$-Learning and Actor-Critic Methods for Robust Average Reward Reinforcement Learning

评论