Dynamic Mixture of Experts for Adaptive Computation in Character-Level Transformers
This paper challenges the prevailing assumption that Mixture of Experts (MoE) consistently improves computational efficiency through a systematic evaluation of MoE variants in Transformer models. We implement and compare three approaches: basic MoE, top-<i>k</i> routing, and capacity-fac...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Information |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2078-2489/16/6/483 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This paper challenges the prevailing assumption that Mixture of Experts (MoE) consistently improves computational efficiency through a systematic evaluation of MoE variants in Transformer models. We implement and compare three approaches: basic MoE, top-<i>k</i> routing, and capacity-factored routing, each progressively addressing load-balancing challenges. Our experiments reveal critical trade-offs between performance and efficiency: while MoE models maintain validation performance comparable to baselines, they require significantly longer training times (a 50% increase) and demonstrate reduced inference speeds (up to 56% slower). Analysis of routing behavior shows that even with load-balancing techniques, expert utilization remains unevenly distributed. These findings provide empirical evidence that MoE’s computational benefits are highly dependent on model scale and task characteristics, challenging common assumptions about sparse architectures and offering crucial guidance for adaptive neural architecture design across different computational constraints. |
|---|---|
| ISSN: | 2078-2489 |