|国家预印本平台
首页|Grid-like Error-Correcting Codes for Matrix Multiplication with Better Correcting Capability

Grid-like Error-Correcting Codes for Matrix Multiplication with Better Correcting Capability

Grid-like Error-Correcting Codes for Matrix Multiplication with Better Correcting Capability

来源:Arxiv_logoArxiv
英文摘要

Matrix multiplication over the real field constitutes a foundational operation in the training of deep learning models, serving as a computational cornerstone for both forward and backward propagation processes. However, the presence of silent data corruption (SDC) in large-scale distributed training environments poses a significant threat to model convergence and predictive accuracy, particularly when such errors manifest during matrix multiplication. Due to their transient and non-intrusive nature, these errors often evade detection, allowing them to propagate and accumulate over time, ultimately leading to substantial degradation in model performance. In this paper, we introduce a novel error-correcting coding framework specifically tailored for matrix multiplication operations. Our proposed framework is designed to detect and correct multiple computational errors that may arise during the execution of matrix products. By leveraging a grid-based structural encoding scheme, our approach enhances error localization and correction capabilities across all participating matrices, thereby significantly improving the fault tolerance of the computation. Experimental results demonstrate that our method achieves deterministic correction of up to two erroneous symbols distributed across three matrices with 100\% reliability, while incurring only a 24\% overhead in computational time on GPU architectures. Furthermore, we provide a rigorous theoretical analysis of the error-correction properties inherent to our coding scheme, establishing its correctness and robustness under well-defined fault models.

Hao Shi、Zhengyi Jiang、Zhongyi Huang、Bo Bai、Gong Zhang、Hanxu Hou

计算技术、计算机技术

Hao Shi,Zhengyi Jiang,Zhongyi Huang,Bo Bai,Gong Zhang,Hanxu Hou.Grid-like Error-Correcting Codes for Matrix Multiplication with Better Correcting Capability[EB/OL].(2025-08-06)[2025-08-23].https://arxiv.org/abs/2508.04355.点此复制

评论