|国家预印本平台
首页|NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

NestQuant: Nested Lattice Quantization for Matrix Products and LLMs

来源:Arxiv_logoArxiv
英文摘要

Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e.g., in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6.6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6.14) compared to state-of-the-art Metas SpinQuant (perplexity 7.3), OstQuant (7.3) and QuaRot (8.2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

Semyon Savkin、Eitan Porat、Or Ordentlich、Yury Polyanskiy

计算技术、计算机技术

Semyon Savkin,Eitan Porat,Or Ordentlich,Yury Polyanskiy.NestQuant: Nested Lattice Quantization for Matrix Products and LLMs[EB/OL].(2025-07-26)[2025-08-16].https://arxiv.org/abs/2502.09720.点此复制

评论