Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU

In order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based...

Full description

Saved in:

Bibliographic Details
Main Authors:	LOU Tao, NIU Hongweihua, ZHANG Pengfei, DONG Jiangfan, LI Panpan, LI Daotong, XU Weidong, YAO Chenghui, XUE Lianhao, TANG Ting, XIANG Jie
Format:	Article
Language:	zho
Published:	Beijing Xintong Media Co., Ltd 2025-07-01
Series:	Dianxin kexue
Subjects:	AI cluster with more than 10 000 cards domestic NPU accelerator card model training optimization
Online Access:	http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850034055453605888
author	LOU Tao NIU Hongweihua ZHANG Pengfei DONG Jiangfan LI Panpan LI Daotong XU Weidong YAO Chenghui XUE Lianhao TANG Ting XIANG Jie
author_facet	LOU Tao NIU Hongweihua ZHANG Pengfei DONG Jiangfan LI Panpan LI Daotong XU Weidong YAO Chenghui XUE Lianhao TANG Ting XIANG Jie
author_sort	LOU Tao
collection	DOAJ
description	In order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based on a completely domestic AI cluster was proposed. Through automatic distributed strategy recommendation, pipeline parallel optimization, overlap optimization and full-link profiling technology, the model FLOPS utilization (MFU) reached 45.13% when training a 405B large language model on 16 384 domestic NPU, which was more than 10% higher than the baseline performance. At the same time, a set of stability assurance mechanisms was built throughout the entire large language model training process to achieve real-time monitoring of key indicators before and during model training, as well as the ability to quickly diagnose faults after training task were interrupted. The experimental results show that the large language model training optimization solution proposed can effectively improve the utilization of computing efficiency, and has important guiding significance for the future construction of domestic AI cluster and large language model training.
format	Article
id	doaj-art-a21d2da6839d40f3bede00d082af6803
institution	DOAJ
issn	1000-0801
language	zho
publishDate	2025-07-01
publisher	Beijing Xintong Media Co., Ltd
record_format	Article
series	Dianxin kexue
spelling	doaj-art-a21d2da6839d40f3bede00d082af68032025-08-20T02:57:57ZzhoBeijing Xintong Media Co., LtdDianxin kexue1000-08012025-07-0141122132120127996Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPULOU TaoNIU HongweihuaZHANG PengfeiDONG JiangfanLI PanpanLI DaotongXU WeidongYAO ChenghuiXUE LianhaoTANG TingXIANG JieIn order to solve the problems of low computing efficiency utilization, poor stability, high difficulty in training optimization, and imperfect domestic accelerator technology ecology in AI cluster model training with more than 10 000 NPU, a large language model training optimization solution based on a completely domestic AI cluster was proposed. Through automatic distributed strategy recommendation, pipeline parallel optimization, overlap optimization and full-link profiling technology, the model FLOPS utilization (MFU) reached 45.13% when training a 405B large language model on 16 384 domestic NPU, which was more than 10% higher than the baseline performance. At the same time, a set of stability assurance mechanisms was built throughout the entire large language model training process to achieve real-time monitoring of key indicators before and during model training, as well as the ability to quickly diagnose faults after training task were interrupted. The experimental results show that the large language model training optimization solution proposed can effectively improve the utilization of computing efficiency, and has important guiding significance for the future construction of domestic AI cluster and large language model training.http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/AI cluster with more than 10 000 cardsdomestic NPU accelerator cardmodel training optimization
spellingShingle	LOU Tao NIU Hongweihua ZHANG Pengfei DONG Jiangfan LI Panpan LI Daotong XU Weidong YAO Chenghui XUE Lianhao TANG Ting XIANG Jie Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU Dianxin kexue AI cluster with more than 10 000 cards domestic NPU accelerator card model training optimization
title	Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_full	Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_fullStr	Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_full_unstemmed	Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_short	Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU
title_sort	practice of large language model training optimization based on large scale ai cluster with more than 10 000 domestic npu
topic	AI cluster with more than 10 000 cards domestic NPU accelerator card model training optimization
url	http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025166/
work_keys_str_mv	AT loutao practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT niuhongweihua practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT zhangpengfei practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT dongjiangfan practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT lipanpan practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT lidaotong practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT xuweidong practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT yaochenghui practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT xuelianhao practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT tangting practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu AT xiangjie practiceoflargelanguagemodeltrainingoptimizationbasedonlargescaleaiclusterwithmorethan10000domesticnpu

Practice of large language model training optimization based on large-scale AI cluster with more than 10 000 domestic NPU

Similar Items