PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data
PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data
Off-policy evaluation (OPE) methods aim to estimate the value of a new reinforcement learning (RL) policy prior to deployment. Recent advances have shown that leveraging auxiliary datasets, such as those synthesized by generative models, can improve the accuracy of these value estimates. Unfortunately, such auxiliary datasets may also be biased, and existing methods for using data augmentation for OPE in RL lack principled uncertainty quantification. In high stakes settings like healthcare, reliable uncertainty estimates are important for comparing policy value estimates. In this work, we propose two approaches to construct valid confidence intervals for OPE when using data augmentation. The first provides a confidence interval over the policy performance conditioned on a particular initial state $V^Ï(s_0)$-- such intervals are particularly important for human-centered applications. To do so we introduce a new conformal prediction method for high dimensional state MDPs. Second, we consider the more common task of estimating the average policy performance over many initial states; to do so we draw on ideas from doubly robust estimation and prediction powered inference. Across simulators spanning robotics, healthcare and inventory management, and a real healthcare dataset from MIMIC-IV, we find that our methods can use augmented data and still consistently produce intervals that cover the ground truth values, unlike previously proposed methods.
Aishwarya Mandyam、Jason Meng、Ge Gao、Jiankai Sun、Mac Schwager、Barbara E. Engelhardt、Emma Brunskill
医学研究方法
Aishwarya Mandyam,Jason Meng,Ge Gao,Jiankai Sun,Mac Schwager,Barbara E. Engelhardt,Emma Brunskill.PERRY: Policy Evaluation with Confidence Intervals using Auxiliary Data[EB/OL].(2025-07-26)[2025-08-18].https://arxiv.org/abs/2507.20068.点此复制
评论