Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
In order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based...
Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | zho |
| Published: |
Beijing Xintong Media Co., Ltd
2025-07-01
|
| Series: | Dianxin kexue |
| Subjects: | |
| Online Access: | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850034055453605888 |
|---|---|
| author | LOU Tao NIU Hongweihua ZHANG Pengfei DONG Jiangfan LI Panpan LI Daotong XU Weidong YAO Chenghui XUE Lianhao TANG Ting XIANG Jie |
| author_facet | LOU Tao NIU Hongweihua ZHANG Pengfei DONG Jiangfan LI Panpan LI Daotong XU Weidong YAO Chenghui XUE Lianhao TANG Ting XIANG Jie |
| author_sort | LOU Tao |
| collection | DOAJ |
| description | In order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based on a completely domestic AI cluster was proposed. Through automatic distributed strategy recommendation, pipeline parallel optimization, overlap optimization and full-link profiling technology, the model FLOPS utilization (MFU) reached 45.13% when training a 405B large language model on 16 384 domestic NPU, which was more than 10% higher than the baseline performance. At the same time, a set of stability assurance mechanisms was built throughout the entire large language model training process to achieve real-time monitoring of key indicators before and during model training, as well as the ability to quickly diagnose faults after training task were interrupted. The experimental results show that the large language model training optimization solution proposed can effectively improve the utilization of computing efficiency, and has important guiding significance for the future construction of domestic AI cluster and large language model training. |
| format | Article |
| id | doaj-art-a21d2da6839d40f3bede00d082af6803 |
| institution | DOAJ |
| issn | 1000-0801 |
| language | zho |
| publishDate | 2025-07-01 |
| publisher | Beijing Xintong Media Co., Ltd |
| record_format | Article |
| series | Dianxin kexue |
| spelling | doaj-art-a21d2da6839d40f3bede00d082af68032025-08-20T02:57:57ZzhoBeijing Xintong Media Co., LtdDianxin kexue1000-08012025-07-0141122132120127996Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPULOU TaoNIU HongweihuaZHANG PengfeiDONG JiangfanLI PanpanLI DaotongXU WeidongYAO ChenghuiXUE LianhaoTANG TingXIANG JieIn order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based on a completely domestic AI cluster was proposed. Through automatic distributed strategy recommendation, pipeline parallel optimization, overlap optimization and full-link profiling technology, the model FLOPS utilization (MFU) reached 45.13% when training a 405B large language model on 16 384 domestic NPU, which was more than 10% higher than the baseline performance. At the same time, a set of stability assurance mechanisms was built throughout the entire large language model training process to achieve real-time monitoring of key indicators before and during model training, as well as the ability to quickly diagnose faults after training task were interrupted. The experimental results show that the large language model training optimization solution proposed can effectively improve the utilization of computing efficiency, and has important guiding significance for the future construction of domestic AI cluster and large language model training.http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/AI cluster with more than 10 000 cardsdomestic NPU accelerator cardmodel training optimization |
| spellingShingle | LOU Tao NIU Hongweihua ZHANG Pengfei DONG Jiangfan LI Panpan LI Daotong XU Weidong YAO Chenghui XUE Lianhao TANG Ting XIANG Jie Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU Dianxin kexue AI cluster with more than 10 000 cards domestic NPU accelerator card model training optimization |
| title | Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU |
| title_full | Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU |
| title_fullStr | Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU |
| title_full_unstemmed | Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU |
| title_short | Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU |
| title_sort | practice of large language model training optimization based on large scale ai cluster with more than 10 000 domestic npu |
| topic | AI cluster with more than 10 000 cards domestic NPU accelerator card model training optimization |
| url | http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/ |
| work_keys_str_mv | AT loutao practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT niuhongweihua practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT zhangpengfei practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT dongjiangfan practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT lipanpan practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT lidaotong practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT xuweidong practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT yaochenghui practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT xuelianhao practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT tangting practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT xiangjie practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu |