首页|VRAIL: Vectorized Reward-based Attribution for Interpretable Learning

VRAIL: Vectorized Reward-based Attribution for Interpretable Learning

来源：

英文摘要

We propose VRAIL (Vectorized Reward-based Attribution for Interpretable Learning), a bi-level framework for value-based reinforcement learning (RL) that learns interpretable weight representations from state features. VRAIL consists of two stages: a deep learning (DL) stage that fits an estimated value function using state features, and an RL stage that uses this to shape learning via potential-based reward transformations. The estimator is modeled in either linear or quadratic form, allowing attribution of importance to individual features and their interactions. Empirical results on the Taxi-v3 environment demonstrate that VRAIL improves training stability and convergence compared to standard DQN, without requiring environment modifications. Further analysis shows that VRAIL uncovers semantically meaningful subgoals, such as passenger possession, highlighting its ability to produce human-interpretable behavior. Our findings suggest that VRAIL serves as a general, model-agnostic framework for reward shaping that enhances both learning and interpretability.

作者：Youjin Jang、Jeongjin Han、Jina Kim

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Youjin Jang,Jeongjin Han,Jina Kim.VRAIL: Vectorized Reward-based Attribution for Interpretable Learning[EB/OL].(2025-06-25)[2025-06-29].https://arxiv.org/abs/2506.16014.点此复制

VRAIL: Vectorized Reward-based Attribution for Interpretable Learning

VRAIL: Vectorized Reward-based Attribution for Interpretable Learning

评论