The Gaussian Attractor in Mixture-of-Experts: Why Soft Routers Collapse and How Mosaic Tiles Solve It Evidence from Vision MoE with Shared Encoder on Continual Learning
[Objective] Expert collapse—the convergence of experts to homogeneous representations—is the central pathology of Mixture-of-Experts (MoE) architectures. Despite extensive empirical mitigation, no unified account of why collapse occurs has been developed. This paper establishes a unified theory and proposes an architectural solution, validated across 39 experimental conditions spanning 200+ MoE configurations.
[Methods] We synthesize three previously disconnected lines of evidence: 18-month experiments across 200+ MoE configurations on CIFAR-100 continual learning and text generation; phase-transition dynamics in lifecycle-managed MoE systems; and the proof by Klindt, LeCun & Balestriero (2026) that the Gaussian is the unique identifiable latent distribution under alignment objectives. The framework is validated through: (1) a six-quadrant data dependency matrix (cross-dataset, partial-class, overlapping-class, fine-grain tasks); (2) the Router Freedom Theorem—an exhaustive eight-strategy ablation across router capacities from 16 to 4112 parameters; and (3) a three-version iteration of the Mosaic Tile architecture spanning 13 experimental conditions across 3 random seeds.
[Results] Key findings: (1) The expert protocol (soft router + AR loss + freeze) fails in five of six conditions, succeeding only in the degenerate case of same-domain, complete-class, non-overlapping tasks (Full=64.5%, 4-seed mean). (2) Any non-zero trainable freedom in the soft router during continual learning—whether frozen, trainable, per-task reinitialized, or weakly AR-protected—collapses to <27% Full within one task duration; the per-task CE alignment slope is invariant to router capacity across [16, 4112] parameters. Only two configurations survive: router frozen and router eliminated (zero-parameter hard routing). (3) The Mosaic Tile architecture replaces 4 large experts (Linear(256,100)) with 20 small tiles (Linear(256,5)) using hard routing by class range, achieving 82.1-93.0% Full accuracy across 5/10/20/33-task configurations (3 seeds, 39 data points) with negligible forgetting (<3pp), zero AR loss, and zero deaths. (4) The overlapping-class experiment provides the definitive controlled comparison: Tile v3=83.2% vs. Expert=16.8% (a 66pp gap). (5) A random mutually-exclusive grouping experiment (2 seeds) establishes that tile exclusivity—not class contiguity—is the necessary and sufficient condition for zero-forgetting (Full=76.9%, only -1.0pp T0 forgetting).
[Limitations] Validation is restricted to CIFAR-100 image classification and task-incremental continual learning. Generalization to language, multimodal domains, and larger-scale architectures remains untested. Hard routing requires a task-boundary oracle, which is naturally given in continual learning but requires extension for single-task settings.
[Conclusions] A hierarchy of anti-Gaussian operator strength emerges: architectural output-space isolation > boundary-condition freezing > training-time gradient perturbation. When experts do not share output dimensions, the Gaussian attractor has no gradient pathway to operate—shifting anti-collapse defense from training-time interventions to structural design. This opens a new research direction: what other structural properties can serve as built-in anti-Gaussian operators, eliminating auxiliary losses entirely?