首页|Model-free Posterior Sampling via Learning Rate Randomization

Model-free Posterior Sampling via Learning Rate Randomization

来源：

英文摘要

In this paper, we introduce Randomized Q-learning (RandQL), a novel randomized model-free algorithm for regret minimization in episodic Markov Decision Processes (MDPs). To the best of our knowledge, RandQL is the first tractable model-free posterior sampling-based algorithm. We analyze the performance of RandQL in both tabular and non-tabular metric space settings. In tabular MDPs, RandQL achieves a regret bound of order $\widetilde{O}(\sqrt{H^{5}SAT})$, where $H$ is the planning horizon, $S$ is the number of states, $A$ is the number of actions, and $T$ is the number of episodes. For a metric state-action space, RandQL enjoys a regret bound of order $\widetilde{O}(H^{5/2} T^{(d_z+1)/(d_z+2)})$, where $d_z$ denotes the zooming dimension. Notably, RandQL achieves optimistic exploration without using bonuses, relying instead on a novel idea of learning rate randomization. Our empirical study shows that RandQL outperforms existing approaches on baseline exploration environments.

作者：Daniil Tiapkin、Alexey Naumov、Remi Munos、Eric Moulines、Denis Belomestny、Daniele Calandriello、Pierre Perrault、Michal Valko、Pierre Menard

作者单位：

学科分类：自动化基础理论计算技术、计算机技术

推荐引用：Daniil Tiapkin,Alexey Naumov,Remi Munos,Eric Moulines,Denis Belomestny,Daniele Calandriello,Pierre Perrault,Michal Valko,Pierre Menard.Model-free Posterior Sampling via Learning Rate Randomization[EB/OL].(2025-07-07)[2025-07-16].https://arxiv.org/abs/2310.18186.点此复制

Model-free Posterior Sampling via Learning Rate Randomization

Model-free Posterior Sampling via Learning Rate Randomization

评论