[2602.12587v1] Multi-Head Attention as a Source of Catastrophic Forgetting ...
well-balanced expert utilization. We attribute this gap to a pre-routing bottleneck: multi-head attention concatenates head-specific signals into a single post-attention router input, forcing routing to act o...