A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters

A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over...

Full description

Saved in:
Bibliographic Details
Main Authors: NIU Hongweihua, HUANG Yongbao, DING Guoqiang, HUANG Bao, ZHAO Zhiwen, XU Yang, WANG Tao, ZHANG Ruiling, WANG Xuan, ZHANG Yixiang
Format: Article
Language:zho
Published: Beijing Xintong Media Co., Ltd 2025-07-01
Series:Dianxin kexue
Subjects:
Online Access:http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025151/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849243553744551936
author NIU Hongweihua
HUANG Yongbao
DING Guoqiang
HUANG Bao
ZHAO Zhiwen
XU Yang
WANG Tao
ZHANG Ruiling
WANG Xuan
ZHANG Yixiang
author_facet NIU Hongweihua
HUANG Yongbao
DING Guoqiang
HUANG Bao
ZHAO Zhiwen
XU Yang
WANG Tao
ZHANG Ruiling
WANG Xuan
ZHANG Yixiang
author_sort NIU Hongweihua
collection DOAJ
description A data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over ten thousand computing cards. The cluster performance data was collected by employing heterogeneous resource integrated collection technology and distributed real-time big data ETL techniques. Fault diagnosis was performed using an enhanced SA-BiLSTM deep learning model, improving the explainability of diagnostic model outputs via knowledge graph analysis and matching for the generation of fault diagnosis reports. In the process of extracting time series features with the deep learning model, weighted fusion of features extracted at different scales , thereby improving the accuracy of the fault diagnosis model. In fault diagnosis simulation experiments conducted on an 18 000-card cluster, it was observed that the loss value gradually converged and stabilized at 0.047, achieving an accuracy rate of 98.4%. Practical has shown that the proposed stability assurance scheme can effectively support large-scale model training and enhance the reliability of intelligent computing clusters, providing a solid foundation for the construction of larger-scale intelligent computing clusters and the training of large models in the future.
format Article
id doaj-art-ae0d66fca8de48a4903e3dc9c1938b1b
institution Kabale University
issn 1000-0801
language zho
publishDate 2025-07-01
publisher Beijing Xintong Media Co., Ltd
record_format Article
series Dianxin kexue
spelling doaj-art-ae0d66fca8de48a4903e3dc9c1938b1b2025-08-20T03:59:26ZzhoBeijing Xintong Media Co., LtdDianxin kexue1000-08012025-07-0141145163120127747A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clustersNIU HongweihuaHUANG YongbaoDING GuoqiangHUANG BaoZHAO ZhiwenXU YangWANG TaoZHANG RuilingWANG XuanZHANG YixiangA data and knowledge-driven stability assurance scheme for such clusters was proposed to address the issues of frequent hardware failures, persistently high task training failure rates, and difficulties in cross-domain problem localization within ultra-large intelligent computing clusters with over ten thousand computing cards. The cluster performance data was collected by employing heterogeneous resource integrated collection technology and distributed real-time big data ETL techniques. Fault diagnosis was performed using an enhanced SA-BiLSTM deep learning model, improving the explainability of diagnostic model outputs via knowledge graph analysis and matching for the generation of fault diagnosis reports. In the process of extracting time series features with the deep learning model, weighted fusion of features extracted at different scales , thereby improving the accuracy of the fault diagnosis model. In fault diagnosis simulation experiments conducted on an 18 000-card cluster, it was observed that the loss value gradually converged and stabilized at 0.047, achieving an accuracy rate of 98.4%. Practical has shown that the proposed stability assurance scheme can effectively support large-scale model training and enhance the reliability of intelligent computing clusters, providing a solid foundation for the construction of larger-scale intelligent computing clusters and the training of large models in the future.http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025151/intelligent computing clusterfault diagnosisSA-BiLSTMknowledge graph
spellingShingle NIU Hongweihua
HUANG Yongbao
DING Guoqiang
HUANG Bao
ZHAO Zhiwen
XU Yang
WANG Tao
ZHANG Ruiling
WANG Xuan
ZHANG Yixiang
A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
Dianxin kexue
intelligent computing cluster
fault diagnosis
SA-BiLSTM
knowledge graph
title A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
title_full A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
title_fullStr A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
title_full_unstemmed A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
title_short A data and knowledge-driven practice for ensuring stability in ultra-large intelligent computing clusters
title_sort data and knowledge driven practice for ensuring stability in ultra large intelligent computing clusters
topic intelligent computing cluster
fault diagnosis
SA-BiLSTM
knowledge graph
url http://www.telecomsci.com/zh/article/doi/10.11959/j.issn.1000-0801.2025151/
work_keys_str_mv AT niuhongweihua adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT huangyongbao adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT dingguoqiang adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT huangbao adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT zhaozhiwen adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT xuyang adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT wangtao adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT zhangruiling adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT wangxuan adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT zhangyixiang adataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT niuhongweihua dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT huangyongbao dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT dingguoqiang dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT huangbao dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT zhaozhiwen dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT xuyang dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT wangtao dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT zhangruiling dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT wangxuan dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters
AT zhangyixiang dataandknowledgedrivenpracticeforensuringstabilityinultralargeintelligentcomputingclusters