|国家预印本平台
首页|Achieving binary weight and activation for LLMs using Post-Training Quantization

Achieving binary weight and activation for LLMs using Post-Training Quantization

Achieving binary weight and activation for LLMs using Post-Training Quantization

来源:Arxiv_logoArxiv
英文摘要

Quantizing large language models (LLMs) to 1-bit precision significantly reduces computational costs, but existing quantization techniques suffer from noticeable performance degradation when using weight and activation precisions below 4 bits (W4A4). In this paper, we propose a post-training quantization framework with W(1+1)A(1*4) configuration, where weights are quantized to 1 bit with an additional 1 bit for fine-grain grouping and activations are quantized to 1 bit with a 4-fold increase in the number of channels. For weight quantization, we propose utilizing Hessian-aware fine-grained grouping along with an EM-based quantization scheme. For activation quantization, we decompose INT4-quantized activations into a 4 * INT1 format equivalently and simultaneously smooth the scaling factors based on quantization errors, which further reduces the quantization errors in activations. Our method surpasses state-of-the-art (SOTA) LLM quantization baselines on W2A4 across multiple tasks, pushing the boundaries of existing LLM quantization methods toward fully binarized models.

Siqing Song、Chuang Wang、Ruiqi Wang、Yi Yang、Xuyao Zhang

计算技术、计算机技术

Siqing Song,Chuang Wang,Ruiqi Wang,Yi Yang,Xuyao Zhang.Achieving binary weight and activation for LLMs using Post-Training Quantization[EB/OL].(2025-04-07)[2025-04-30].https://arxiv.org/abs/2504.05352.点此复制

评论