|国家预印本平台
首页|ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition

来源:Arxiv_logoArxiv
英文摘要

Recent advancements in Large Language Models (LLMs) have demonstrated impressive capabilities as their scale expands to billions of parameters. Deploying these large-scale models on resource-constrained platforms presents significant challenges, with post-training fixed-point quantization often used as a model compression technique. However, quantization-only methods typically lead to significant accuracy degradation in LLMs when precision falls below 8 bits. This paper addresses this challenge through a software-hardware co-design framework, ITERA-LLM, which integrates sub-8-bit quantization with SVD-based iterative low-rank tensor decomposition for error compensation, leading to higher compression ratios and reduced computational complexity. The proposed approach is complemented by a hardware-aware Design Space Exploration (DSE) process that optimizes accuracy, latency, and resource utilization, tailoring the configuration to the specific requirements of the targeted LLM. Our results show that ITERA-LLM achieves linear layer latency reduction of up to 41.1%, compared to quantization-only baseline approach while maintaining similar model accuracy.

Keran Zheng、Yinting Huang、Zhewen Yu、Christos-Savvas Bouganis

计算技术、计算机技术

Keran Zheng,Yinting Huang,Zhewen Yu,Christos-Savvas Bouganis.ITERA-LLM: Boosting Sub-8-Bit Large Language Model Inference via Iterative Tensor Decomposition[EB/OL].(2025-05-13)[2025-06-14].https://arxiv.org/abs/2505.08981.点此复制

评论