|国家预印本平台
| 注册
首页|ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs

ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs

Xin Nie Liang Dong Haicheng Zhang Jiawang Xiao G. Sun

Arxiv_logoArxiv

ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs

Xin Nie Liang Dong Haicheng Zhang Jiawang Xiao G. Sun

作者信息

Abstract

Weight quantization effectively reduces memory consumption and enable the deployment of Large Language Models on edge devices, yet existing hardware-friendly methods often rely on uniform quantization, which suffers from poor weight-distribution fitting and high dequantization overhead under low-bit settings. In this paper, we propose ELUTQ, an efficient quantization framework featuring a novel quantization format termed Hierarchical Linear Quantization (HLQ). HLQ is designed to better capture the statistical characteristics of weights and eliminate dequantization overhead using Bit-serial LUT-based GEMM operations. HLQ significantly improves model accuracy under low-bit settings and achieves performance comparable to QAT methods without any retraining of the weights. Moreover, an optimized quantization pipeline is integrated into ELUTQ, enabling it to complete the quantization of LLaMA 3.1-70B using only 64 GB of CPU memory and 48 GB of VRAM, reducing the hardware requirements for large-scale model quantization. To enable efficient deployment on edge devices, ELUTQ designs high-performance kernels to support end-to-end inference. Our 2-bit LLaMA3.1-8B achieves 1.5x speedup over AWQ on RTX 3090. Code is available at https://github.com/Nkniexin/ELUTQ.

引用本文复制引用

Xin Nie,Liang Dong,Haicheng Zhang,Jiawang Xiao,G. Sun.ELUTQ: Optimizing Quantization Accuracy under LUT-Based Computation for Edge LLMs[EB/OL].(2026-02-01)[2026-04-05].https://arxiv.org/abs/2510.19482.

学科分类

计算技术、计算机技术

评论

首发时间 2026-02-01
下载量:0
|
点击量:62
段落导航相关论文