On the Convergence of Gradient Descent on Learning Transformers with Residual Connections
On the Convergence of Gradient Descent on Learning Transformers with Residual Connections
Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.
Zhen Qin、Jinxin Zhou、Zhihui Zhu
计算技术、计算机技术
Zhen Qin,Jinxin Zhou,Zhihui Zhu.On the Convergence of Gradient Descent on Learning Transformers with Residual Connections[EB/OL].(2025-06-05)[2025-06-14].https://arxiv.org/abs/2506.05249.点此复制
评论