|国家预印本平台
| 注册
首页|混合专家模型中的高斯吸引子:软路由为何崩溃及马赛克瓦片如何解决 基于视觉MoE共享编码器持续学习的证据

混合专家模型中的高斯吸引子:软路由为何崩溃及马赛克瓦片如何解决 基于视觉MoE共享编码器持续学习的证据

张庆君

istic_logo国家预印本平台

混合专家模型中的高斯吸引子:软路由为何崩溃及马赛克瓦片如何解决 基于视觉MoE共享编码器持续学习的证据

The Gaussian Attractor in Mixture-of-Experts: Why Soft Routers Collapse and How Mosaic Tiles Solve It Evidence from Vision MoE with Shared Encoder on Continual Learning

张庆君1

作者信息

  • 1. 无锡太湖学院
  • 折叠

摘要

[目的] 专家坍缩——专家向同质化表征的收敛——是混合专家模型的核心病理。尽管已有大量缓解工作,但尚无统一理论解释坍缩为何发生。本文旨在建立这一统一理论,并提出架构层面的解决方案。 [方法] 我们整合了三条此前独立的研究线索:(1) 在CIFAR-100持续学习、文本生成等任务上超过200个MoE配置的18个月实验活动;(2) 生命周期管理的MoE系统中的相变动力学;(3) Klindt、LeCun和Balestriero (2026) 最近证明的高斯分布是对齐目标下唯一可辨识的潜变量分布。我们建立了交叉熵损失函数作为对齐目标的理论框架,并通过39个实验条件(3个随机种子)进行验证。核心实验包括:六象限数据依赖矩阵(跨数据集、部分类、重叠类、细粒度任务)、Router Freedom Theorem(8策略穷举,路由器参数从16到4112)、以及马赛克瓦片架构的三版迭代。 [结果] 关键发现如下:(1) 专家协议(软路由+AR loss+冻结)在五个非退化条件中全部失败,仅在相同域、完整类分布、无重叠任务这一退化条件下成功(Full=64.5%,4种子均值);(2) 软路由器的任何非零可训自由度(无论是冻结、可训、每任务重初始化还是加入弱AR)在120个epoch内都会被交叉熵对齐力拉向坍缩——Router Freedom Theorem证明零参数硬路由是唯一出路;(3) 马赛克瓦片架构(20个Linear(256,5)小瓦片替代4个Linear(256,100)大专家,Phase 2采用硬路由)在5/10/20/33任务配置下达到82.1-93.0%的Full准确率(3种子,39数据点),几乎零遗忘(<3pp),零AR loss,零专家死亡;(4) 重叠类对照实验(瓦片v3=83.2% vs 专家=16.8%,66pp差距)提供了确定性对照;(5) 随机互斥分组实验证明瓦片互斥性——而非类连续性——是零遗忘的充要条件(Full=76.9%,仅-1.0pp T0遗忘)。 [局限] 验证范围限于CIFAR-100图像分类和任务增量持续学习,尚未在语言或多模态领域测试。硬路由依赖任务边界oracle(持续学习定义下天然已知,但单任务场景需扩展)。 [结论] 反高斯算子强度存在明确层级:架构输出空间隔离 > 边界条件冻结 > 训练期梯度扰动。当专家的输出维度不共享时,高斯吸引子无梯度通路可循——将反坍缩防御从训练期干预转向结构设计,为MoE稳定性研究开辟了新方向。

Abstract

[Objective] Expert collapse—the convergence of experts to homogeneous representations—is the central pathology of Mixture-of-Experts (MoE) architectures. Despite extensive empirical mitigation, no unified account of why collapse occurs has been developed. This paper establishes a unified theory and proposes an architectural solution, validated across 39 experimental conditions spanning 200+ MoE configurations. [Methods] We synthesize three previously disconnected lines of evidence: 18-month experiments across 200+ MoE configurations on CIFAR-100 continual learning and text generation; phase-transition dynamics in lifecycle-managed MoE systems; and the proof by Klindt, LeCun & Balestriero (2026) that the Gaussian is the unique identifiable latent distribution under alignment objectives. The framework is validated through: (1) a six-quadrant data dependency matrix (cross-dataset, partial-class, overlapping-class, fine-grain tasks); (2) the Router Freedom Theorem—an exhaustive eight-strategy ablation across router capacities from 16 to 4112 parameters; and (3) a three-version iteration of the Mosaic Tile architecture spanning 13 experimental conditions across 3 random seeds. [Results] Key findings: (1) The expert protocol (soft router + AR loss + freeze) fails in five of six conditions, succeeding only in the degenerate case of same-domain, complete-class, non-overlapping tasks (Full=64.5%, 4-seed mean). (2) Any non-zero trainable freedom in the soft router during continual learning—whether frozen, trainable, per-task reinitialized, or weakly AR-protected—collapses to <27% Full within one task duration; the per-task CE alignment slope is invariant to router capacity across [16, 4112] parameters. Only two configurations survive: router frozen and router eliminated (zero-parameter hard routing). (3) The Mosaic Tile architecture replaces 4 large experts (Linear(256,100)) with 20 small tiles (Linear(256,5)) using hard routing by class range, achieving 82.1-93.0% Full accuracy across 5/10/20/33-task configurations (3 seeds, 39 data points) with negligible forgetting (<3pp), zero AR loss, and zero deaths. (4) The overlapping-class experiment provides the definitive controlled comparison: Tile v3=83.2% vs. Expert=16.8% (a 66pp gap). (5) A random mutually-exclusive grouping experiment (2 seeds) establishes that tile exclusivity—not class contiguity—is the necessary and sufficient condition for zero-forgetting (Full=76.9%, only -1.0pp T0 forgetting). [Limitations] Validation is restricted to CIFAR-100 image classification and task-incremental continual learning. Generalization to language, multimodal domains, and larger-scale architectures remains untested. Hard routing requires a task-boundary oracle, which is naturally given in continual learning but requires extension for single-task settings. [Conclusions] A hierarchy of anti-Gaussian operator strength emerges: architectural output-space isolation > boundary-condition freezing > training-time gradient perturbation. When experts do not share output dimensions, the Gaussian attractor has no gradient pathway to operate—shifting anti-collapse defense from training-time interventions to structural design. This opens a new research direction: what other structural properties can serve as built-in anti-Gaussian operators, eliminating auxiliary losses entirely?

关键词

混合专家模型,专家坍缩,高斯吸引子,持续学习,反冗余损失,马赛克瓦片,硬路由,灾难性遗忘,反高斯算子,Router Freedom Theorem

Key words

Mixture-of-Experts/ expert collapse/ Gaussian attractor/ continual learning/ anti-redundancy loss/ Mosaic Tile/ hard routing/ catastrophic forgetting/ anti-Gaussian operators/ Router Freedom Theorem

引用本文复制引用

张庆君.混合专家模型中的高斯吸引子:软路由为何崩溃及马赛克瓦片如何解决 基于视觉MoE共享编码器持续学习的证据[EB/OL].(2026-06-10)[2026-06-11].https://sinoxiv.napstic.cn/article/25960315.

学科分类

计算技术、计算机技术
首发时间 2026-06-10 17:15:49
下载量:1
|
点击量:11
段落导航相关论文