Strategy-Switch: From All-Reduce to Parameter Server for Faster Efficient Training

Deep learning plays a pivotal role in numerous big data applications by enhancing the accuracy of models. However, the abundance of available data presents a challenge when training neural networks on a single node. Consequently, various distributed training methods have emerged. Among these, two pr...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nikodimos Provatas, Iasonas Chalas, Ioannis Konstantinou, Nectarios Koziris
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Deep learning distributed systems all-reduce parameter server
Online Access:	https://ieeexplore.ieee.org/document/10836684/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Deep learning plays a pivotal role in numerous big data applications by enhancing the accuracy of models. However, the abundance of available data presents a challenge when training neural networks on a single node. Consequently, various distributed training methods have emerged. Among these, two prevalent approaches are All-Reduce and Parameter Server. All-Reduce, operating synchronously, faces synchronization-related bottlenecks, while the Parameter Server, often used asynchronously, can potentially compromise the model’s performance. To harness the strengths of both setups, we introduce Strategy-Switch, a hybrid approach that offers the best of both worlds, combining speed with efficiency and high-quality results. This method initiates training under the All-Reduce system and, guided by an empirical rule, transitions to asynchronous Parameter Server training once the model stabilizes. Our experimental analysis demonstrates that we can achieve comparable accuracy to All-Reduce training but with significantly accelerated training.
ISSN:	2169-3536

Strategy-Switch: From All-Reduce to Parameter Server for Faster Efficient Training

Similar Items