|国家预印本平台
首页|Logit Dynamics in Softmax Policy Gradient Methods

Logit Dynamics in Softmax Policy Gradient Methods

Logit Dynamics in Softmax Policy Gradient Methods

来源:Arxiv_logoArxiv
英文摘要

We analyzes the logit dynamics of softmax policy gradient methods. We derive the exact formula for the L2 norm of the logit update vector: $$ \|\Delta \mathbf{z}\|_2 \propto \sqrt{1-2P_c + C(P)} $$ This equation demonstrates that update magnitudes are determined by the chosen action's probability ($P_c$) and the policy's collision probability ($C(P)$), a measure of concentration inversely related to entropy. Our analysis reveals an inherent self-regulation mechanism where learning vigor is automatically modulated by policy confidence, providing a foundational insight into the stability and convergence of these methods.

Yingru Li

计算技术、计算机技术

Yingru Li.Logit Dynamics in Softmax Policy Gradient Methods[EB/OL].(2025-06-15)[2025-07-21].https://arxiv.org/abs/2506.12912.点此复制

评论