Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks

As the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybri...

Full description

Saved in:
Bibliographic Details
Main Authors: Zan Zong, Minkun Guo, Mingshu Zhai, Yinan Tang, Jianjiang Li, Jidong Zhai
Format: Article
Language:English
Published: Tsinghua University Press 2025-06-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2025.9020031
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849304132686446592
author Zan Zong
Minkun Guo
Mingshu Zhai
Yinan Tang
Jianjiang Li
Jidong Zhai
author_facet Zan Zong
Minkun Guo
Mingshu Zhai
Yinan Tang
Jianjiang Li
Jidong Zhai
author_sort Zan Zong
collection DOAJ
description As the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution. The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest. However, these computing resources are often geographically distributed. Due to the lack of awareness of heterogeneous devices and network topologies, existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively. To boost the computing capability of the connected heterogeneous clusters, we propose HGTrainer, an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training. HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators, with the awareness of relatively lower inter-cluster bandwidth. To achieve this goal, we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem. Besides, a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications. We evaluate HGTrainer on heterogeneous connected clusters with popular large language models. The experimental result shows that HGTrainer effectively improves 1.49× training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.
format Article
id doaj-art-776d3bf157a34d4885cc41f6b14eff79
institution Kabale University
issn 2096-0654
2097-406X
language English
publishDate 2025-06-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-776d3bf157a34d4885cc41f6b14eff792025-08-20T03:55:49ZengTsinghua University PressBig Data Mining and Analytics2096-06542097-406X2025-06-018496698010.26599/BDMA.2025.9020031Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted NetworksZan Zong0Minkun Guo1Mingshu Zhai2Yinan Tang3Jianjiang Li4Jidong Zhai5Department of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaIEIT SYSTEMS Co., Ltd., Jinan 250014, ChinaSchool of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing 100083 ChinaDepartment of Computer Science and Technology, Tsinghua University, Beijing 100084, ChinaAs the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution. The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest. However, these computing resources are often geographically distributed. Due to the lack of awareness of heterogeneous devices and network topologies, existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively. To boost the computing capability of the connected heterogeneous clusters, we propose HGTrainer, an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training. HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators, with the awareness of relatively lower inter-cluster bandwidth. To achieve this goal, we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem. Besides, a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications. We evaluate HGTrainer on heterogeneous connected clusters with popular large language models. The experimental result shows that HGTrainer effectively improves 1.49× training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.https://www.sciopen.com/article/10.26599/BDMA.2025.9020031deep learning systemlarge model trainingheterogeneousgeo-distributed clusters
spellingShingle Zan Zong
Minkun Guo
Mingshu Zhai
Yinan Tang
Jianjiang Li
Jidong Zhai
Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
Big Data Mining and Analytics
deep learning system
large model training
heterogeneous
geo-distributed clusters
title Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_full Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_fullStr Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_full_unstemmed Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_short Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks
title_sort training large models on heterogeneous and geo distributed resource with constricted networks
topic deep learning system
large model training
heterogeneous
geo-distributed clusters
url https://www.sciopen.com/article/10.26599/BDMA.2025.9020031
work_keys_str_mv AT zanzong traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks
AT minkunguo traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks
AT mingshuzhai traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks
AT yinantang traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks
AT jianjiangli traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks
AT jidongzhai traininglargemodelsonheterogeneousandgeodistributedresourcewithconstrictednetworks