首页|Extended UCB Policies for Frequentist Multi-armed Bandit Problems

Extended UCB Policies for Frequentist Multi-armed Bandit Problems

来源：

英文摘要

The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [22]. The previous UCB policies require some strict conditions on the reward distributions, which can be hard to guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders $p=4$ and $q=2$) to arbitrarily chosen $p>q>1$ as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order $O(log T)$, thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions. Furthermore, we achieve a near-optimal regret order without any knowledge of the reward distributions as long as their $p$-th moments exist for some $p>1$.

作者：Keqin Liu、Tianshuo Zheng、Zhi-Hua Zhou

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Keqin Liu,Tianshuo Zheng,Zhi-Hua Zhou.Extended UCB Policies for Frequentist Multi-armed Bandit Problems[EB/OL].(2025-06-30)[2025-07-20].https://arxiv.org/abs/1112.1768.点此复制

Extended UCB Policies for Frequentist Multi-armed Bandit Problems

Extended UCB Policies for Frequentist Multi-armed Bandit Problems

评论