Highlights:
- New ProMoE framework enhances Mixture-of-Experts (MoE) routing in Diffusion Transformers.
- Achieves superior performance on ImageNet benchmarks using explicit routing guidance.
- Introduces two-step routing with conditional and prototypical routing for visual tokens.
- Incorporates routing contrastive loss to improve expert specialization and diversity.
TLDR:
Researchers have introduced ProMoE, a novel Mixture-of-Experts framework with explicit routing guidance for Diffusion Transformers. This approach addresses visual token redundancy and drastically improves model performance and scalability in computer vision tasks.
A research team led by Yujie Wei (https://arxiv.org/search/cs?searchtype=author&query=Wei,+Y), along with Shiwei Zhang (https://arxiv.org/search/cs?searchtype=author&query=Zhang,+S), Hangjie Yuan (https://arxiv.org/search/cs?searchtype=author&query=Yuan,+H), Yujin Han (https://arxiv.org/search/cs?searchtype=author&query=Han,+Y), Zhekai Chen (https://arxiv.org/search/cs?searchtype=author&query=Chen,+Z), Jiayu Wang (https://arxiv.org/search/cs?searchtype=author&query=Wang,+J), Difan Zou (https://arxiv.org/search/cs?searchtype=author&query=Zou,+D), Xihui Liu (https://arxiv.org/search/cs?searchtype=author&query=Liu,+X), Yingya Zhang (https://arxiv.org/search/cs?searchtype=author&query=Zhang,+Y), Yu Liu (https://arxiv.org/search/cs?searchtype=author&query=Liu,+Y), and Hongming Shan (https://arxiv.org/search/cs?searchtype=author&query=Shan,+H), has presented a pioneering approach to scaling Diffusion Transformers (DiTs) using Mixture-of-Experts (MoE) architectures. The paper, titled ‘Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance,’ tackles a long-standing challenge: while MoE frameworks have excelled in large language models (LLMs), their adaptation to vision transformers and diffusion models has been less successful due to the inherent redundancy and heterogeneity of image tokens.
The newly proposed framework, called ProMoE, introduces a two-step router design that explicitly guides token-expert assignments, enabling more coherent expert specialization. Traditional MoE routers often rely on statistical token separation, which works well for semantically rich linguistic tokens but poorly for redundant visual tokens in image synthesis tasks. ProMoE mitigates this by partitioning image tokens into conditional and unconditional groups based on their functional roles. The conditional tokens then undergo a prototypical routing step, where learnable semantic prototypes in latent space align tokens to appropriate experts based on content similarity. This approach allows the system to form semantically meaningful clusters, ensuring experts are specialized in distinct visual or functional features.
In addition, the authors propose a novel routing contrastive loss that reinforces intra-expert coherence and inter-expert diversity. By applying a contrastive objective within the routing mechanism, ProMoE ensures that each expert’s activations are consistent while discouraging overlap among experts’ domains. The results are significant: across ImageNet benchmarks and under both Rectified Flow and DDPM training objectives, ProMoE consistently outperforms existing methods for diffusion-based image generation. These findings highlight the importance of explicit routing design and semantic guidance in scaling visual transformers, offering a new direction for efficient, large-scale generative modeling in computer vision.
This breakthrough demonstrates how the ProMoE paradigm bridges the scalability gap between language and vision models, enabling more efficient training and inference without sacrificing quality. As diffusion-based image generation continues to grow across creative and scientific applications, frameworks like ProMoE could define the next generation of high-performance, semantically aware AI models.
Source:
Source:
Wei, Yujie et al. (2025). ‘Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance.’ arXiv:2510.24711 [cs.CV]. DOI: https://doi.org/10.48550/arXiv.2510.24711
