Cloud-based intelligent computing center ten-thousand card cluster innovation and practice
To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on clou...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Article |
Language: | zho |
Published: |
Beijing Xintong Media Co., Ltd
2024-12-01
|
Series: | Dianxin kexue |
Subjects: | |
Online Access: | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2024262/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on cloud computing technology for constructing a ten-thousand card cluster in an intelligent computing center was proposed. A ten-thousand card cluster was constructed using 18 432 NPU units and an optimized RDMA network. A multi-plane network architecture was adopted, in conjunction with SDN technology to achieve RDMA network tenant isolation. The network load balancing strategy was optimized, resulting in a link load balancing error of less than 10% and an All-Reduce bandwidth of over 35 GB/s. By employing the optimized distributed storage protocol, the model’s breakpoint recovery time was reduced to half of its original duration. The validation results demonstrate that the domestic NPU ten-thousand card cluster, with the collaborative optimization of software and hardware, can not only meet the training needs of large models with hundreds of billions of parameters but also support the training tasks of large models with trillions of parameters. |
---|---|
ISSN: | 1000-0801 |