|国家预印本平台
首页|DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization

DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization

DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization

来源:Arxiv_logoArxiv
英文摘要

Quantization of Large Language Models (LLMs) has recently gained popularity, particularly for on-device settings with limited hardware resources. While efficient, quantization inevitably degrades model quality, especially in aggressive low-bit settings such as 3-bit and 4-bit precision. In this paper, we propose DecDEC, an inference scheme that improves the quality of low-bit LLMs while preserving the key benefits of quantization: GPU memory savings and latency reduction. DecDEC stores the residual matrix -- the difference between full-precision and quantized weights -- in CPU, and dynamically fetches the residuals for only a small portion of the weights. This portion corresponds to the salient channels, marked by activation outliers, with the fetched residuals helping to correct quantization errors in these channels. Salient channels are identified dynamically at each decoding step by analyzing the input activations -- this enables adaptation to the dynamic nature of activation distribution, thus maximizing the effectiveness of error compensation. We demonstrate the effectiveness of DecDEC by augmenting state-of-the-art quantization methods. For example, DecDEC reduces the perplexity of a 3-bit Llama-3-8B-Instruct model from 10.15 to 9.12 -- outperforming its 3.5-bit counterpart -- while adding less than 0.0003\% to GPU memory usage and incurring only a 1.7\% inference slowdown on NVIDIA RTX 4050 Mobile.

Jake Hyun、Hojoon Kim、Jae W. Lee、Yeonhong Park

计算技术、计算机技术

Jake Hyun,Hojoon Kim,Jae W. Lee,Yeonhong Park.DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization[EB/OL].(2025-06-24)[2025-07-09].https://arxiv.org/abs/2412.20185.点此复制

评论