Contextual Bandits with Stage-wise Constraints
Contextual Bandits with Stage-wise Constraints
We study contextual bandits in the presence of a stage-wise constraint when the constraint must be satisfied both with high probability and in expectation. We start with the linear case where both the reward function and the stage-wise constraint (cost function) are linear. In each of the high probability and in expectation settings, we propose an upper-confidence bound algorithm for the problem and prove a $T$-round regret bound for it. We also prove a lower-bound for this constrained problem, show how our algorithms and analyses can be extended to multiple constraints, and provide simulations to validate our theoretical results. In the high probability setting, we describe the minimum requirements for the action set for our algorithm to be tractable. In the setting that the constraint is in expectation, we specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting with regret analysis. Finally, we extend our results to the case where the reward and cost functions are both non-linear. We propose an algorithm for this case and prove a regret bound for it that characterize the function class complexity by the eluder dimension.
Mohammad Ghavamzadeh、Aldo Pacchiano、Peter Bartlett
计算技术、计算机技术
Mohammad Ghavamzadeh,Aldo Pacchiano,Peter Bartlett.Contextual Bandits with Stage-wise Constraints[EB/OL].(2025-08-21)[2025-09-02].https://arxiv.org/abs/2401.08016.点此复制
评论