|国家预印本平台
首页|SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

来源:Arxiv_logoArxiv
英文摘要

Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed within an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K\geq1$, expert co-activation within the same MoE layer introduces non-submodularity, causing greedy methods to be ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.

Qian Chen、Xianhao Chen、Kaibin Huang

计算技术、计算机技术

Qian Chen,Xianhao Chen,Kaibin Huang.SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference[EB/OL].(2025-07-09)[2025-07-21].https://arxiv.org/abs/2507.06567.点此复制

评论