|国家预印本平台

首页|混合专家模型中的高斯吸引子：软路由为何崩溃及马赛克瓦片如何解决基于视觉MoE共享编码器持续学习的证据

混合专家模型中的高斯吸引子：软路由为何崩溃及马赛克瓦片如何解决基于视觉MoE共享编码器持续学习的证据

张庆君

DOI：10.12383/202606100001V1

✕

DOI：10.12383/202606100001V1

来源：

istic_logo

国家预印本平台

混合专家模型中的高斯吸引子：软路由为何崩溃及马赛克瓦片如何解决基于视觉MoE共享编码器持续学习的证据

The Gaussian Attractor in Mixture-of-Experts: Why Soft Routers Collapse and How Mosaic Tiles Solve It Evidence from Vision MoE with Shared Encoder on Continual Learning

张庆君¹

作者信息

1. 无锡太湖学院
折叠

摘要

[目的] 专家坍缩——专家向同质化表征的收敛——是混合专家模型的核心病理。尽管已有大量缓解工作，但尚无统一理论解释坍缩为何发生。本文旨在建立这一统一理论，并提出架构层面的解决方案。[方法] 我们整合了三条此前独立的研究线索：(1) 在CIFAR-100持续学习、文本生成等任务上超过200个MoE配置的18个月实验活动；(2) 生命周期管理的MoE系统中的相变动力学；(3) Klindt、LeCun和Balestriero (2026) 最近证明的高斯分布是对齐目标下唯一可辨识的潜变量分布。我们建立了交叉熵损失函数作为对齐目标的理论框架，并通过39个实验条件（3个随机种子）进行验证。核心实验包括：六象限数据依赖矩阵（跨数据集、部分类、重叠类、细粒度任务）、Router Freedom Theorem（8策略穷举，路由器参数从16到4112）、以及马赛克瓦片架构的三版迭代。[结果] 关键发现如下：(1) 专家协议（软路由+AR loss+冻结）在五个非退化条件中全部失败，仅在相同域、完整类分布、无重叠任务这一退化条件下成功（Full=64.5%，4种子均值）；(2) 软路由器的任何非零可训自由度（无论是冻结、可训、每任务重初始化还是加入弱AR）在120个epoch内都会被交叉熵对齐力拉向坍缩——Router Freedom Theorem证明零参数硬路由是唯一出路；(3) 马赛克瓦片架构（20个Linear(256,5)小瓦片替代4个Linear(256,100)大专家，Phase 2采用硬路由）在5/10/20/33任务配置下达到82.1-93.0%的Full准确率（3种子，39数据点），几乎零遗忘（<3pp），零AR loss，零专家死亡；(4) 重叠类对照实验（瓦片v3=83.2% vs 专家=16.8%，66pp差距）提供了确定性对照；(5) 随机互斥分组实验证明瓦片互斥性——而非类连续性——是零遗忘的充要条件（Full=76.9%，仅-1.0pp T0遗忘）。[局限] 验证范围限于CIFAR-100图像分类和任务增量持续学习，尚未在语言或多模态领域测试。硬路由依赖任务边界oracle（持续学习定义下天然已知，但单任务场景需扩展）。[结论] 反高斯算子强度存在明确层级：架构输出空间隔离 > 边界条件冻结 > 训练期梯度扰动。当专家的输出维度不共享时，高斯吸引子无梯度通路可循——将反坍缩防御从训练期干预转向结构设计，为MoE稳定性研究开辟了新方向。

Abstract

[Objective] Expert collapse—the convergence of experts to homogeneous representations—is the central pathology of Mixture-of-Experts (MoE) architectures. Despite extensive empirical mitigation, no unified account of why collapse occurs has been developed. This paper establishes a unified theory and proposes an architectural solution, validated across 39 experimental conditions spanning 200+ MoE configurations. [Methods] We synthesize three previously disconnected lines of evidence: 18-month experiments across 200+ MoE configurations on CIFAR-100 continual learning and text generation; phase-transition dynamics in lifecycle-managed MoE systems; and the proof by Klindt, LeCun & Balestriero (2026) that the Gaussian is the unique identifiable latent distribution under alignment objectives. The framework is validated through: (1) a six-quadrant data dependency matrix (cross-dataset, partial-class, overlapping-class, fine-grain tasks); (2) the Router Freedom Theorem—an exhaustive eight-strategy ablation across router capacities from 16 to 4112 parameters; and (3) a three-version iteration of the Mosaic Tile architecture spanning 13 experimental conditions across 3 random seeds. [Results] Key findings: (1) The expert protocol (soft router + AR loss + freeze) fails in five of six conditions, succeeding only in the degenerate case of same-domain, complete-class, non-overlapping tasks (Full=64.5%, 4-seed mean). (2) Any non-zero trainable freedom in the soft router during continual learning—whether frozen, trainable, per-task reinitialized, or weakly AR-protected—collapses to <27% Full within one task duration; the per-task CE alignment slope is invariant to router capacity across [16, 4112] parameters. Only two configurations survive: router frozen and router eliminated (zero-parameter hard routing). (3) The Mosaic Tile architecture replaces 4 large experts (Linear(256,100)) with 20 small tiles (Linear(256,5)) using hard routing by class range, achieving 82.1-93.0% Full accuracy across 5/10/20/33-task configurations (3 seeds, 39 data points) with negligible forgetting (<3pp), zero AR loss, and zero deaths. (4) The overlapping-class experiment provides the definitive controlled comparison: Tile v3=83.2% vs. Expert=16.8% (a 66pp gap). (5) A random mutually-exclusive grouping experiment (2 seeds) establishes that tile exclusivity—not class contiguity—is the necessary and sufficient condition for zero-forgetting (Full=76.9%, only -1.0pp T0 forgetting). [Limitations] Validation is restricted to CIFAR-100 image classification and task-incremental continual learning. Generalization to language, multimodal domains, and larger-scale architectures remains untested. Hard routing requires a task-boundary oracle, which is naturally given in continual learning but requires extension for single-task settings. [Conclusions] A hierarchy of anti-Gaussian operator strength emerges: architectural output-space isolation > boundary-condition freezing > training-time gradient perturbation. When experts do not share output dimensions, the Gaussian attractor has no gradient pathway to operate—shifting anti-collapse defense from training-time interventions to structural design. This opens a new research direction: what other structural properties can serve as built-in anti-Gaussian operators, eliminating auxiliary losses entirely?

关键词

混合专家模型，专家坍缩，高斯吸引子，持续学习，反冗余损失，马赛克瓦片，硬路由，灾难性遗忘，反高斯算子，Router Freedom Theorem

Key words

Mixture-of-Experts/ expert collapse/ Gaussian attractor/ continual learning/ anti-redundancy loss/ Mosaic Tile/ hard routing/ catastrophic forgetting/ anti-Gaussian operators/ Router Freedom Theorem

引用本文复制引用

张庆君.混合专家模型中的高斯吸引子：软路由为何崩溃及马赛克瓦片如何解决基于视觉MoE共享编码器持续学习的证据[EB/OL].(2026-06-10)[2026-06-11].https://sinoxiv.napstic.cn/article/25960315.

学科分类

计算技术、计算机技术

首发时间： 2026-06-10 17:15:49

下载量：1

|

点击量：11

段落导航

相关论文

摘要
Abstract
关键词
Key words
引用本文