|国家预印本平台
首页|Extended UCB Policies for Frequentist Multi-armed Bandit Problems

Extended UCB Policies for Frequentist Multi-armed Bandit Problems

Extended UCB Policies for Frequentist Multi-armed Bandit Problems

来源:Arxiv_logoArxiv
英文摘要

The multi-armed bandit (MAB) problem is a widely studied model in the field of operations research for sequential decision making and reinforcement learning. This paper mainly considers the classical MAB model with the heavy-tailed reward distributions. We introduce the extended robust UCB policy, which is an extension of the pioneering UCB policies proposed by Bubeck et al. [5] and Lattimore [22]. The previous UCB policies require some strict conditions on the reward distributions, which can be hard to guarantee in practical scenarios. Our extended robust UCB generalizes Lattimore's seminary work (for moments of orders $p=4$ and $q=2$) to arbitrarily chosen $p>q>1$ as long as the two moments have a known controlled relationship, while still achieving the optimal regret growth order $O(log T)$, thus providing a broadened application area of the UCB policies for the heavy-tailed reward distributions. Furthermore, we achieve a near-optimal regret order without any knowledge of the reward distributions as long as their $p$-th moments exist for some $p>1$.

Keqin Liu、Tianshuo Zheng、Zhi-Hua Zhou

计算技术、计算机技术

Keqin Liu,Tianshuo Zheng,Zhi-Hua Zhou.Extended UCB Policies for Frequentist Multi-armed Bandit Problems[EB/OL].(2025-06-30)[2025-07-20].https://arxiv.org/abs/1112.1768.点此复制

评论