Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU

In order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based...

Full description

Saved in:
Bibliographic Details
Main Authors: LOU Tao, NIU Hongweihua, ZHANG Pengfei, DONG Jiangfan, LI Panpan, LI Daotong, XU Weidong, YAO Chenghui, XUE Lianhao, TANG Ting, XIANG Jie
Format: Article
Language:zho
Published: Beijing Xintong Media Co., Ltd 2025-07-01
Series:Dianxin kexue
Subjects:
Online Access:http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850034055453605888
author LOU Tao
NIU Hongweihua
ZHANG Pengfei
DONG Jiangfan
LI Panpan
LI Daotong
XU Weidong
YAO Chenghui
XUE Lianhao
TANG Ting
XIANG Jie
author_facet LOU Tao
NIU Hongweihua
ZHANG Pengfei
DONG Jiangfan
LI Panpan
LI Daotong
XU Weidong
YAO Chenghui
XUE Lianhao
TANG Ting
XIANG Jie
author_sort LOU Tao
collection DOAJ
description In order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based on a completely domestic AI cluster was proposed. Through automatic distributed strategy recommendation, pipeline parallel optimization, overlap optimization and full-link profiling technology, the model FLOPS utilization (MFU) reached 45.13% when training a 405B large language model on 16 384 domestic NPU, which was more than 10% higher than the baseline performance. At the same time, a set of stability assurance mechanisms was built throughout the entire large language model training process to achieve real-time monitoring of key indicators before and during model training, as well as the ability to quickly diagnose faults after training task were interrupted. The experimental results show that the large language model training optimization solution proposed can effectively improve the utilization of computing efficiency, and has important guiding significance for the future construction of domestic AI cluster and large language model training.
format Article
id doaj-art-a21d2da6839d40f3bede00d082af6803
institution DOAJ
issn 1000-0801
language zho
publishDate 2025-07-01
publisher Beijing Xintong Media Co., Ltd
record_format Article
series Dianxin kexue
spelling doaj-art-a21d2da6839d40f3bede00d082af68032025-08-20T02:57:57ZzhoBeijing Xintong Media Co., LtdDianxin kexue1000-08012025-07-0141122132120127996Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPULOU TaoNIU HongweihuaZHANG PengfeiDONG JiangfanLI PanpanLI DaotongXU WeidongYAO ChenghuiXUE LianhaoTANG TingXIANG JieIn order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based on a completely domestic AI cluster was proposed. Through automatic distributed strategy recommendation, pipeline parallel optimization, overlap optimization and full-link profiling technology, the model FLOPS utilization (MFU) reached 45.13% when training a 405B large language model on 16 384 domestic NPU, which was more than 10% higher than the baseline performance. At the same time, a set of stability assurance mechanisms was built throughout the entire large language model training process to achieve real-time monitoring of key indicators before and during model training, as well as the ability to quickly diagnose faults after training task were interrupted. The experimental results show that the large language model training optimization solution proposed can effectively improve the utilization of computing efficiency, and has important guiding significance for the future construction of domestic AI cluster and large language model training.http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/AI cluster with more than 10 000 cardsdomestic NPU accelerator cardmodel training optimization
spellingShingle LOU Tao
NIU Hongweihua
ZHANG Pengfei
DONG Jiangfan
LI Panpan
LI Daotong
XU Weidong
YAO Chenghui
XUE Lianhao
TANG Ting
XIANG Jie
Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
Dianxin kexue
AI cluster with more than 10 000 cards
domestic NPU accelerator card
model training optimization
title Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_full Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_fullStr Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_full_unstemmed Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_short Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_sort practice of large language model training optimization based on large scale ai cluster with more than 10 000 domestic npu
topic AI cluster with more than 10 000 cards
domestic NPU accelerator card
model training optimization
url http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/
work_keys_str_mv AT loutao practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT niuhongweihua practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT zhangpengfei practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT dongjiangfan practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT lipanpan practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT lidaotong practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT xuweidong practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT yaochenghui practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT xuelianhao practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT tangting practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu
AT xiangjie practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu