Cloud-based intelligent computing center ten-thousand card cluster innovation and practice

To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on clou...

Full description

Saved in:
Bibliographic Details
Main Authors: DING Hongqing, ZHANG Pengfei, NIU Hongweihua, LI Zhiyong, ZHOU Danyuan, DING Guoqiang, LI Panpan, LI Daotong, ZHANG Jiuxian
Format: Article
Language:zho
Published: Beijing Xintong Media Co., Ltd 2024-12-01
Series:Dianxin kexue
Subjects:
Online Access:http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2024262/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:To address issues such as low availability of computing power in ultra-large scale computing clusters of intelligent computing centers, low maturity of domestically produced technologies, bottlenecks in large-scale networking efficiency, and complex operations and maintenance, a system based on cloud computing technology for constructing a ten-thousand card cluster in an intelligent computing center was proposed. A ten-thousand card cluster was constructed using 18 432 NPU units and an optimized RDMA network. A multi-plane network architecture was adopted, in conjunction with SDN technology to achieve RDMA network tenant isolation. The network load balancing strategy was optimized, resulting in a link load balancing error of less than 10% and an All-Reduce bandwidth of over 35 GB/s. By employing the optimized distributed storage protocol, the model’s breakpoint recovery time was reduced to half of its original duration. The validation results demonstrate that the domestic NPU ten-thousand card cluster, with the collaborative optimization of software and hardware, can not only meet the training needs of large models with hundreds of billions of parameters but also support the training tasks of large models with trillions of parameters.
ISSN:1000-0801