首页|On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

来源：

英文摘要

As the default optimizer for training large language models, AdamW has achieved remarkable success in deep learning. However, its convergence behavior is not theoretically well-understood. This paper establishes the convergence rate $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{K^{1/4}})$ for AdamW measured by $\ell_1$ norm, where $K$ represents the iteration number, $d$ denotes the model dimension, and $C$ matches the constant in the optimal convergence rate of SGD. Theoretically, we have $E\left[\|\nabla f(x)\|_1\right]\geq\sqrt{\frac{2d}{\pi}}E\left[\|\nabla f(x)\|_2\right]$ when each element of $\nabla f(x)$ is generated from Gaussian distribution $\mathcal N(0,1)$. Empirically, our experimental results on real-world deep learning tasks reveal $\|\nabla f(x)\|_1=\varTheta(\sqrt{d})\|\nabla f(x)\|_2$. Both support that our convergence rate can be considered to be analogous to the optimal $\frac{1}{K}\sum_{k=1}^KE\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{C}{K^{1/4}})$ convergence rate of SGD.

作者：Huan Li、Yiming Dong、Zhouchen Lin

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Huan Li,Yiming Dong,Zhouchen Lin.On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm[EB/OL].(2025-05-17)[2025-06-19].https://arxiv.org/abs/2505.11840.点此复制

On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

On the $O(\frac{\sqrt{d}}{K^{1/4}})$ Convergence Rate of AdamW Measured by $\ell_1$ Norm

评论