Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on clou...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | zho |
Published: |
Beijing Xintong Media Co., Ltd
2024-12-01
|
Series: | Dianxin kexue |
Subjects: | |
Online Access: | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2024262/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841528756302249984 |
---|---|
author | DING Hongqing ZHANG Pengfei NIU Hongweihua LI Zhiyong ZHOU Danyuan DING Guoqiang LI Panpan LI Daotong ZHANG Jiuxian |
author_facet | DING Hongqing ZHANG Pengfei NIU Hongweihua LI Zhiyong ZHOU Danyuan DING Guoqiang LI Panpan LI Daotong ZHANG Jiuxian |
author_sort | DING Hongqing |
collection | DOAJ |
description | To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on cloud computing technology for constructing a ten-thousand card cluster in an intelligent computing center was proposed. A ten-thousand card cluster was constructed using 18 432 NPU units and an optimized RDMA network. A multi-plane network architecture was adopted, in conjunction with SDN technology to achieve RDMA network tenant isolation. The network load balancing strategy was optimized, resulting in a link load balancing error of less than 10% and an All-Reduce bandwidth of over 35 GB/s. By employing the optimized distributed storage protocol, the model’s breakpoint recovery time was reduced to half of its original duration. The validation results demonstrate that the domestic NPU ten-thousand card cluster, with the collaborative optimization of software and hardware, can not only meet the training needs of large models with hundreds of billions of parameters but also support the training tasks of large models with trillions of parameters. |
format | Article |
id | doaj-art-6f843256080a4e2bb68957a716cf128d |
institution | Kabale University |
issn | 1000-0801 |
language | zho |
publishDate | 2024-12-01 |
publisher | Beijing Xintong Media Co., Ltd |
record_format | Article |
series | Dianxin kexue |
spelling | doaj-art-6f843256080a4e2bb68957a716cf128d2025-01-15T03:34:26ZzhoBeijing Xintong Media Co., LtdDianxin kexue1000-08012024-12-014012513579426420Cloud-based intelligent computing center ten-thousand card cluster innovation and practiceDING HongqingZHANG PengfeiNIU HongweihuaLI ZhiyongZHOU DanyuanDING GuoqiangLI PanpanLI DaotongZHANG JiuxianTo address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on cloud computing technology for constructing a ten-thousand card cluster in an intelligent computing center was proposed. A ten-thousand card cluster was constructed using 18 432 NPU units and an optimized RDMA network. A multi-plane network architecture was adopted, in conjunction with SDN technology to achieve RDMA network tenant isolation. The network load balancing strategy was optimized, resulting in a link load balancing error of less than 10% and an All-Reduce bandwidth of over 35 GB/s. By employing the optimized distributed storage protocol, the model’s breakpoint recovery time was reduced to half of its original duration. The validation results demonstrate that the domestic NPU ten-thousand card cluster, with the collaborative optimization of software and hardware, can not only meet the training needs of large models with hundreds of billions of parameters but also support the training tasks of large models with trillions of parameters.http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2024262/supercomputer clusterintelligent computing centerten-thousand card clusterartificial intelligence |
spellingShingle | DING Hongqing ZHANG Pengfei NIU Hongweihua LI Zhiyong ZHOU Danyuan DING Guoqiang LI Panpan LI Daotong ZHANG Jiuxian Cloud-based intelligent computing center ten-thousand card cluster innovation and practice Dianxin kexue supercomputer cluster intelligent computing center ten-thousand card cluster artificial intelligence |
title | Cloud-based intelligent computing center ten-thousand card cluster innovation and practice |
title_full | Cloud-based intelligent computing center ten-thousand card cluster innovation and practice |
title_fullStr | Cloud-based intelligent computing center ten-thousand card cluster innovation and practice |
title_full_unstemmed | Cloud-based intelligent computing center ten-thousand card cluster innovation and practice |
title_short | Cloud-based intelligent computing center ten-thousand card cluster innovation and practice |
title_sort | cloud based intelligent computing center ten thousand card cluster innovation and practice |
topic | supercomputer cluster intelligent computing center ten-thousand card cluster artificial intelligence |
url | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2024262/ |
work_keys_str_mv | AT dinghongqing cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice AT zhangpengfei cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice AT niuhongweihua cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice AT lizhiyong cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice AT zhoudanyuan cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice AT dingguoqiang cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice AT lipanpan cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice AT lidaotong cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice AT zhangjiuxian cloudbasedintelligentcomputingcentertenthousandcardclusterinnovationandpractice |