Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks

As the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybri...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zan Zong, Minkun Guo, Mingshu Zhai, Yinan Tang, Jianjiang Li, Jidong Zhai
Format:	Article
Language:	English
Published:	Tsinghua University Press 2025-06-01
Series:	Big Data Mining and Analytics
Subjects:	deep learning system large model training heterogeneous geo-distributed clusters
Online Access:	https://www.sciopen.com/article/10.26599/BDMA.2025.9020031
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849304132686446592
author	Zan Zong Minkun Guo Mingshu Zhai Yinan Tang Jianjiang Li Jidong Zhai
author_facet	Zan Zong Minkun Guo Mingshu Zhai Yinan Tang Jianjiang Li Jidong Zhai
author_sort	Zan Zong
collection	DOAJ
description	As the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution. The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest. However, these computing resources are often geographically distributed. Due to the lack of awareness of heterogeneous devices and network topologies, existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively. To boost the computing capability of the connected heterogeneous clusters, we propose HGTrainer, an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training. HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators, with the awareness of relatively lower inter-cluster bandwidth. To achieve this goal, we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem. Besides, a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications. We evaluate HGTrainer on heterogeneous connected clusters with popular large language models. The experimental result shows that HGTrainer effectively improves 1.49× training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.
format	Article
id	doaj-art-776d3bf157a34d4885cc41f6b14eff79
institution	Kabale University
issn	2096-0654 2097-406X
language	English
publishDate	2025-06-01
publisher	Tsinghua University Press
record_format	Article
series	Big Data Mining and Analytics
spelling	doaj-art-776d3bf157a34d4885cc41f6b14eff792025-08-20T03:55:49ZengTsinghua University PressBig Data Mining and Analytics2096-06542097-406X2025-06-018496698010.26599/BDMA.2025.9020031Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted NetworksZan Zong0Minkun Guo1Mingshu Zhai2Yinan Tang3Jianjiang Li4Jidong Zhai5Department of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaIEIT SYSTEMS Co., Ltd., Jinan 250014, ChinaSchool of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing 100083 ChinaDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaAs the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution. The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest. However, these computing resources are often geographically distributed. Due to the lack of awareness of heterogeneous devices and network topologies, existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively. To boost the computing capability of the connected heterogeneous clusters, we propose HGTrainer, an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training. HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators, with the awareness of relatively lower inter-cluster bandwidth. To achieve this goal, we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem. Besides, a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications. We evaluate HGTrainer on heterogeneous connected clusters with popular large language models. The experimental result shows that HGTrainer effectively improves 1.49× training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.https://www.sciopen.com/article/10.26599/BDMA.2025.9020031deep learning systemlarge model trainingheterogeneousgeo-distributed clusters
spellingShingle	Zan Zong Minkun Guo Mingshu Zhai Yinan Tang Jianjiang Li Jidong Zhai Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks Big Data Mining and Analytics deep learning system large model training heterogeneous geo-distributed clusters
title	Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_full	Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_fullStr	Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_full_unstemmed	Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_short	Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_sort	training large models on heterogeneous and geo distributed resource with constricted networks
topic	deep learning system large model training heterogeneous geo-distributed clusters
url	https://www.sciopen.com/article/10.26599/BDMA.2025.9020031
work_keys_str_mv	AT zanzong traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks AT minkunguo traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks AT mingshuzhai traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks AT yinantang traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks AT jianjiangli traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks AT jidongzhai traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks

Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks

Similar Items