|国家预印本平台
首页|Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism

来源:Arxiv_logoArxiv
英文摘要

In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devices with the unified memory architecture. The key idea of Ghidorah can be summarized in two steps: 1) leveraging speculative decoding approaches to enhance parallelism, and 2) ingeniously distributing workloads across multiple heterogeneous processing units to maximize computing power utilization. Ghidorah includes the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. The HCMP architecture guides partitioning by leveraging the unified memory design of end-user devices and adapting to the hybrid computational demands of speculative decoding. The ARCA approach is used to determine the optimal speculative strategy and partitioning strategy, balancing acceptance rate with parallel capability to maximize the speedup. Additionally, we optimize sparse computation on ARM CPUs. Experimental results show that Ghidorah can achieve up to 7.6x speedup in the dominant LLM decoding phase compared to the sequential decoding approach in NVIDIA Jetson NX.

Jinhui Wei、Ye Huang、Yuhui Zhou、Jiazhi Jiang、Jiangsu Du、Yutong Lu

计算技术、计算机技术

Jinhui Wei,Ye Huang,Yuhui Zhou,Jiazhi Jiang,Jiangsu Du,Yutong Lu.Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism[EB/OL].(2025-05-29)[2025-07-16].https://arxiv.org/abs/2505.23219.点此复制

评论