NestQuant: Nested Lattice Quantization for Matrix Products and LLMs
NestQuant: Nested Lattice Quantization for Matrix Products and LLMs
Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Metas SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.
Semyon Savkin、Eitan Porat、Or Ordentlich、Yury Polyanskiy
计算技术、计算机技术
Semyon Savkin,Eitan Porat,Or Ordentlich,Yury Polyanskiy.NestQuant: Nested Lattice Quantization for Matrix Products and LLMs[EB/OL].(2025-07-26)[2025-08-16].https://arxiv.org/abs/2502.09720.点此复制
评论