Dynamic Mixture of Experts for Adaptive Computation in Character-Level Transformers

This paper challenges the prevailing assumption that Mixture of Experts (MoE) consistently improves computational efficiency through a systematic evaluation of MoE variants in Transformer models. We implement and compare three approaches: basic MoE, top-<i>k</i> routing, and capacity-fac...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhigao Huang, Musheng Chen, Shiyan Zheng
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/6/483
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper challenges the prevailing assumption that Mixture of Experts (MoE) consistently improves computational efficiency through a systematic evaluation of MoE variants in Transformer models. We implement and compare three approaches: basic MoE, top-<i>k</i> routing, and capacity-factored routing, each progressively addressing load-balancing challenges. Our experiments reveal critical trade-offs between performance and efficiency: while MoE models maintain validation performance comparable to baselines, they require significantly longer training times (a 50% increase) and demonstrate reduced inference speeds (up to 56% slower). Analysis of routing behavior shows that even with load-balancing techniques, expert utilization remains unevenly distributed. These findings provide empirical evidence that MoE’s computational benefits are highly dependent on model scale and task characteristics, challenging common assumptions about sparse architectures and offering crucial guidance for adaptive neural architecture design across different computational constraints.
ISSN:2078-2489