首页|MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

来源：

英文摘要

Large language models (LLMs) have demonstrated exceptional performance across a variety of tasks. However, their substantial scale leads to significant computational resource consumption during inference, resulting in high costs. Consequently, edge device inference presents a promising solution. The primary challenges of edge inference include memory usage and inference speed. This paper introduces MNN-LLM, a framework specifically designed to accelerate the deployment of large language models on mobile devices. MNN-LLM addresses the runtime characteristics of LLMs through model quantization and DRAM-Flash hybrid storage, effectively reducing memory usage. It rearranges weights and inputs based on mobile CPU instruction sets and GPU characteristics while employing strategies such as multicore load balancing, mixed-precision floating-point operations, and geometric computations to enhance performance. Notably, MNN-LLM achieves up to a 8.6x speed increase compared to current mainstream LLM-specific frameworks.

作者：Zhaode Wang、Jingbang Yang、Xinyu Qian、Shiwen Xing、Xiaotang Jiang、Chengfei Lv、Shengyu Zhang

作者单位：

DOI：10.1145/3700410.3702126

学科分类：计算技术、计算机技术

推荐引用：Zhaode Wang,Jingbang Yang,Xinyu Qian,Shiwen Xing,Xiaotang Jiang,Chengfei Lv,Shengyu Zhang.MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices[EB/OL].(2025-06-12)[2025-06-30].https://arxiv.org/abs/2506.10443.点此复制

MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

MNN-LLM: A Generic Inference Engine for Fast Large Language Model Deployment on Mobile Devices

评论