Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks

As the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybri...

Full description

Saved in:
Bibliographic Details
Main Authors: Zan Zong, Minkun Guo, Mingshu Zhai, Yinan Tang, Jianjiang Li, Jidong Zhai
Format: Article
Language:English
Published: Tsinghua University Press 2025-06-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2025.9020031
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution. The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest. However, these computing resources are often geographically distributed. Due to the lack of awareness of heterogeneous devices and network topologies, existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively. To boost the computing capability of the connected heterogeneous clusters, we propose HGTrainer, an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training. HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators, with the awareness of relatively lower inter-cluster bandwidth. To achieve this goal, we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem. Besides, a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications. We evaluate HGTrainer on heterogeneous connected clusters with popular large language models. The experimental result shows that HGTrainer effectively improves 1.49× training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.
ISSN:2096-0654
2097-406X