首页|On-line Policy Iteration with Policy Switching for Markov Decision Processes

On-line Policy Iteration with Policy Switching for Markov Decision Processes

来源：

英文摘要

Motivated from Bertsekas' recent study on policy iteration (PI) for solving the problems of infinite-horizon discounted Markov decision processes (MDPs) in an on-line setting, we develop an off-line PI integrated with a multi-policy improvement method of policy switching and then adapt its asynchronous variant into on-line PI algorithm that generates a sequence of policies over time. The current policy is updated into the next policy by switching the action only at the current state while ensuring the monotonicity of the value functions of the policies in the sequence. Depending on MDP's state-transition structure, the sequence converges in a finite time to an optimal policy for an associated local MDP. When MDP is communicating, the sequence converges to an optimal policy for the original MDP.

作者：Hyeong Soo Chang

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Hyeong Soo Chang.On-line Policy Iteration with Policy Switching for Markov Decision Processes[EB/OL].(2021-12-03)[2025-08-02].https://arxiv.org/abs/2112.02177.点此复制

On-line Policy Iteration with Policy Switching for Markov Decision Processes

On-line Policy Iteration with Policy Switching for Markov Decision Processes

评论