Cloud-based intelligent computing center ten-thousand card cluster innovation and practice

To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on clou...

Full description

Saved in:
Bibliographic Details
Main Authors: DING Hongqing, ZHANG Pengfei, NIU Hongweihua, LI Zhiyong, ZHOU Danyuan, DING Guoqiang, LI Panpan, LI Daotong, ZHANG Jiuxian
Format: Article
Language:zho
Published: Beijing Xintong Media Co., Ltd 2024-12-01
Series:Dianxin kexue
Subjects:
Online Access:http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2024262/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841528756302249984
author DING Hongqing
ZHANG Pengfei
NIU Hongweihua
LI Zhiyong
ZHOU Danyuan
DING Guoqiang
LI Panpan
LI Daotong
ZHANG Jiuxian
author_facet DING Hongqing
ZHANG Pengfei
NIU Hongweihua
LI Zhiyong
ZHOU Danyuan
DING Guoqiang
LI Panpan
LI Daotong
ZHANG Jiuxian
author_sort DING Hongqing
collection DOAJ
description To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on cloud computing technology for constructing a ten-thousand card cluster in an intelligent computing center was proposed. A ten-thousand card cluster was constructed using 18 432 NPU units and an optimized RDMA network. A multi-plane network architecture was adopted, in conjunction with SDN technology to achieve RDMA network tenant isolation. The network load balancing strategy was optimized, resulting in a link load balancing error of less than 10% and an All-Reduce bandwidth of over 35 GB/s. By employing the optimized distributed storage protocol, the model’s breakpoint recovery time was reduced to half of its original duration. The validation results demonstrate that the domestic NPU ten-thousand card cluster, with the collaborative optimization of software and hardware, can not only meet the training needs of large models with hundreds of billions of parameters but also support the training tasks of large models with trillions of parameters.
format Article
id doaj-art-6f843256080a4e2bb68957a716cf128d
institution Kabale University
issn 1000-0801
language zho
publishDate 2024-12-01
publisher Beijing Xintong Media Co., Ltd
record_format Article
series Dianxin kexue
spelling doaj-art-6f843256080a4e2bb68957a716cf128d2025-01-15T03:34:26ZzhoBeijing Xintong Media Co., LtdDianxin kexue1000-08012024-12-014012513579426420Cloud-based intelligent computing center ten-thousand card cluster innovation and practiceDING HongqingZHANG PengfeiNIU HongweihuaLI ZhiyongZHOU DanyuanDING GuoqiangLI PanpanLI DaotongZHANG JiuxianTo address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on cloud computing technology for constructing a ten-thousand card cluster in an intelligent computing center was proposed. A ten-thousand card cluster was constructed using 18 432 NPU units and an optimized RDMA network. A multi-plane network architecture was adopted, in conjunction with SDN technology to achieve RDMA network tenant isolation. The network load balancing strategy was optimized, resulting in a link load balancing error of less than 10% and an All-Reduce bandwidth of over 35 GB/s. By employing the optimized distributed storage protocol, the model’s breakpoint recovery time was reduced to half of its original duration. The validation results demonstrate that the domestic NPU ten-thousand card cluster, with the collaborative optimization of software and hardware, can not only meet the training needs of large models with hundreds of billions of parameters but also support the training tasks of large models with trillions of parameters.http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2024262/supercomputer clusterintelligent computing centerten-thousand card clusterartificial intelligence
spellingShingle DING Hongqing
ZHANG Pengfei
NIU Hongweihua
LI Zhiyong
ZHOU Danyuan
DING Guoqiang
LI Panpan
LI Daotong
ZHANG Jiuxian
Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
Dianxin kexue
supercomputer cluster
intelligent computing center
ten-thousand card cluster
artificial intelligence
title Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
title_full Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
title_fullStr Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
title_full_unstemmed Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
title_short Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
title_sort cloud based intelligent computing center ten thousand card cluster innovation and practice
topic supercomputer cluster
intelligent computing center
ten-thousand card cluster
artificial intelligence
url http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2024262/
work_keys_str_mv AT dinghongqing cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice
AT zhangpengfei cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice
AT niuhongweihua cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice
AT lizhiyong cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice
AT zhoudanyuan cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice
AT dingguoqiang cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice
AT lipanpan cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice
AT lidaotong cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice
AT zhangjiuxian cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice