首页|Grams: Gradient Descent with Adaptive Momentum Scaling

Grams: Gradient Descent with Adaptive Momentum Scaling

来源：

英文摘要

We introduce $\mathbf{G}$radient Descent with $\mathbf{A}$daptive $\mathbf{M}$omentum $\mathbf{S}$caling ($\mathbf{Grams}$), a novel optimization algorithm that decouples the direction and magnitude of parameter updates in deep learning. Unlike traditional optimizers that directly integrate momentum into updates, Grams separates the update direction, derived from current gradients, from momentum, which is used solely for adaptive magnitude scaling. This approach enables Grams to achieve improved loss descent compared to state-of-the-art cautious and momentum-based optimizers. We theoretically demonstrate that Grams descents faster than other state-of-the-art optimizers and establish a global convergence guarantee for Grams. We also validate its effectiveness through extensive empirical evaluations. The results demonstrate Grams' superior performance, including faster convergence and better generalization, compared to widely-used optimizers such as Adam, Lion, and their cautious variants. Our results highlight Grams' potential as a transformative approach for efficiently training and fine-tuning large language models. Code is available at https://github.com/Gunale0926/Grams.

作者：Xiaoyu Li、Yang Cao、Zhao Song

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Xiaoyu Li,Yang Cao,Zhao Song.Grams: Gradient Descent with Adaptive Momentum Scaling[EB/OL].(2024-12-22)[2025-05-22].https://arxiv.org/abs/2412.17107.点此复制

Grams: Gradient Descent with Adaptive Momentum Scaling

Grams: Gradient Descent with Adaptive Momentum Scaling

评论