Strategy-Switch: From All-Reduce to Parameter Server for Faster Efficient Training

Deep learning plays a pivotal role in numerous big data applications by enhancing the accuracy of models. However, the abundance of available data presents a challenge when training neural networks on a single node. Consequently, various distributed training methods have emerged. Among these, two pr...

Full description

Saved in:
Bibliographic Details
Main Authors: Nikodimos Provatas, Iasonas Chalas, Ioannis Konstantinou, Nectarios Koziris
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10836684/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Deep learning plays a pivotal role in numerous big data applications by enhancing the accuracy of models. However, the abundance of available data presents a challenge when training neural networks on a single node. Consequently, various distributed training methods have emerged. Among these, two prevalent approaches are All-Reduce and Parameter Server. All-Reduce, operating synchronously, faces synchronization-related bottlenecks, while the Parameter Server, often used asynchronously, can potentially compromise the model’s performance. To harness the strengths of both setups, we introduce Strategy-Switch, a hybrid approach that offers the best of both worlds, combining speed with efficiency and high-quality results. This method initiates training under the All-Reduce system and, guided by an empirical rule, transitions to asynchronous Parameter Server training once the model stabilizes. Our experimental analysis demonstrates that we can achieve comparable accuracy to All-Reduce training but with significantly accelerated training.
ISSN:2169-3536