On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm
On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm
As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E\left[\|\nabla f(x)\|_1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[\|\nabla f(x)\|_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $\|\nabla f(x)\|_1=\varTheta(\sqrt{d})\|\nabla f(x)\|_2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD.
Huan Li、Yiming Dong、Zhouchen Lin
计算技术、计算机技术
Huan Li,Yiming Dong,Zhouchen Lin.On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm[EB/OL].(2025-05-17)[2025-06-19].https://arxiv.org/abs/2505.11840.点此复制
评论