首页|A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

来源：

英文摘要

Motivated by applications in risk-sensitive reinforcement learning, we study mean-variance optimization in a discounted reward Markov Decision Process (MDP). Specifically, we analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging, both with and without regularization. Our bounds exhibit an exponentially decaying dependence on the initial error and a convergence rate of $O(1/t)$ after $t$ iterations. Moreover, for the regularized TD variant, our bound holds for a universal step size. Next, we integrate a Simultaneous Perturbation Stochastic Approximation (SPSA)-based actor update with an LFA critic and establish an $O(n^{-1/4})$ convergence guarantee, where $n$ denotes the iterations of the SPSA-based actor-critic algorithm. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning, with a focus on variance as a risk measure.

作者：Krishna Jagannathan、L. A. Prashanth、Tejaram Sangadi

作者单位：

学科分类：计算技术、计算机技术自动化基础理论自动化技术、自动化技术设备

推荐引用：Krishna Jagannathan,L. A. Prashanth,Tejaram Sangadi.A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP[EB/OL].(2024-06-12)[2025-05-09].https://arxiv.org/abs/2406.07892.点此复制

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP

评论