Research2026-05-12

Hierarchical Mixture-of-Experts with Two-Stage Optimization

arXiv:2605.08292v1 Announce Type: cross Abstract: Sparse Mixture-of-Experts (MoE) models scale capacity by routing each token to a small subset of experts. However, their routers exhibit a fundamental trade-off: strong load balancing can suppress expert specialization, while aggressive diversity...

Read Original Article on Arxiv CS.AI

arxivpapers