|国家预印本平台
首页|UltraSketchLLM: Saliency-Driven Sketching for Ultra-Low Bit LLM Compression

UltraSketchLLM: Saliency-Driven Sketching for Ultra-Low Bit LLM Compression

UltraSketchLLM: Saliency-Driven Sketching for Ultra-Low Bit LLM Compression

来源:Arxiv_logoArxiv
英文摘要

The rapid growth of large language models (LLMs) has outpaced the memory constraints of edge devices, necessitating extreme weight compression beyond the 1-bit limit. While quantization reduces model size, it is fundamentally limited to 1 bit per weight. Existing multiple-to-one compression methods either rely on mapping tables (inducing memory overhead) or incur severe accuracy degradation due to random weight grouping. We introduce UltraSketchLLM, an index-free, sketch-based framework that achieves ultra-low bit compression (down to 0.5 bits per weight) while preserving model performance. UltraSketchLLM leverages data sketching, a sub-linear representation technique from streaming applications, to map multiple weights to single values with bounded error. Our approach integrates an underestimate AbsMaxMin sketch to minimize relative errors for small weights, importance-aware space allocation to prioritize salient weights, and a straight-through estimator for compression-aware finetuning. Experiments on Llama-3.2-1B demonstrate up to 0.5-bit compression with competitive perplexity, alongside tolerable latency overhead. UltraSketchLLM offers a practical solution for deploying LLMs in resource-constrained environments.

Sunan Zou、Ziyun Zhang、Xueting Sun、Guojie Luo

计算技术、计算机技术

Sunan Zou,Ziyun Zhang,Xueting Sun,Guojie Luo.UltraSketchLLM: Saliency-Driven Sketching for Ultra-Low Bit LLM Compression[EB/OL].(2025-06-08)[2025-07-01].https://arxiv.org/abs/2506.17255.点此复制

评论