Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT

Abstract The capability parameters of third-party testing institutions not only serve as a critical reflection of their technical and quality management capabilities but also form the key basis for categorizing their testing abilities. However, current Chinese phrase-based descriptions of these capa...

Full description

Saved in:
Bibliographic Details
Main Authors: Gaoqing Xu, Qun Chen, Shuhang Jiang, Xiaohang Fu, Yiwei Wang, Qingchun Jiao
Format: Article
Language:English
Published: Nature Portfolio 2025-04-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-92296-y
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850265194494689280
author Gaoqing Xu
Qun Chen
Shuhang Jiang
Xiaohang Fu
Yiwei Wang
Qingchun Jiao
author_facet Gaoqing Xu
Qun Chen
Shuhang Jiang
Xiaohang Fu
Yiwei Wang
Qingchun Jiao
author_sort Gaoqing Xu
collection DOAJ
description Abstract The capability parameters of third-party testing institutions not only serve as a critical reflection of their technical and quality management capabilities but also form the key basis for categorizing their testing abilities. However, current Chinese phrase-based descriptions of these capability parameters are influenced by diverse expression styles and varying internal standards, making it difficult to establish consistent criteria for classifying testing capabilities. This inconsistency presents notable difficulties for clients and regulatory bodies. Therefore, leveraging clustering techniques to uncover the intrinsic relationships and latent information between testing capabilities and their corresponding parameters is one of the crucial approaches to achieving scientific and reasonable classification of testing capabilities. Traditional text feature extraction methods suffer from several limitations, including sparse features, high-dimensional features, and lack of semantic information. These shortcomings complicate the classification and analysis of testing capability descriptions. To address this issue, this study focuses on the “products and testing objects” within the capability parameters of Chinese testing institutions as the research subject and proposes a joint model based on BERT and semi-supervised K-Means clustering. This model employs BERT to extract textual features from Chinese descriptive phrases and combines them with a small number of labeled samples for semi-supervised K-Means clustering analysis. The clustering results are then used to train a multi-output Softmax classifier, thereby enabling the classification of testing capabilities for third-party institutions. Experimental results demonstrate that the proposed method outperforms traditional methods such as TF-IDF and one-hot encoding when applied to the Chinese description datasets of testing institutions. Specifically, it exhibits advantages in reducing the dimensionality of textual features and enhancing clustering performance. When the proportion of labeled samples accounts for 10% of the total sample size, the method achieves optimal clustering results, with an average classifier accuracy of 89.8%.
format Article
id doaj-art-36b827fb85af44da8cf2c8e08e5379ac
institution OA Journals
issn 2045-2322
language English
publishDate 2025-04-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-36b827fb85af44da8cf2c8e08e5379ac2025-08-20T01:54:30ZengNature PortfolioScientific Reports2045-23222025-04-0115111710.1038/s41598-025-92296-yAnalyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERTGaoqing Xu0Qun Chen1Shuhang Jiang2Xiaohang Fu3Yiwei Wang4Qingchun Jiao5Zhejiang Institute of Standardization, Zhijiang Standardization Think TankZhejiang Institute of Standardization, Technology Innovation Center of State Market Regulation Management (Research and Application of Digital Market Regulation)Zhejiang Jinhui Digital Technology Co. LtdZhejiang Jinhui Digital Technology Co. LtdZhejiang Light Industrial Products Inspection and Research InstituteZhejiang University of Science and Technology, ProfessorAbstract The capability parameters of third-party testing institutions not only serve as a critical reflection of their technical and quality management capabilities but also form the key basis for categorizing their testing abilities. However, current Chinese phrase-based descriptions of these capability parameters are influenced by diverse expression styles and varying internal standards, making it difficult to establish consistent criteria for classifying testing capabilities. This inconsistency presents notable difficulties for clients and regulatory bodies. Therefore, leveraging clustering techniques to uncover the intrinsic relationships and latent information between testing capabilities and their corresponding parameters is one of the crucial approaches to achieving scientific and reasonable classification of testing capabilities. Traditional text feature extraction methods suffer from several limitations, including sparse features, high-dimensional features, and lack of semantic information. These shortcomings complicate the classification and analysis of testing capability descriptions. To address this issue, this study focuses on the “products and testing objects” within the capability parameters of Chinese testing institutions as the research subject and proposes a joint model based on BERT and semi-supervised K-Means clustering. This model employs BERT to extract textual features from Chinese descriptive phrases and combines them with a small number of labeled samples for semi-supervised K-Means clustering analysis. The clustering results are then used to train a multi-output Softmax classifier, thereby enabling the classification of testing capabilities for third-party institutions. Experimental results demonstrate that the proposed method outperforms traditional methods such as TF-IDF and one-hot encoding when applied to the Chinese description datasets of testing institutions. Specifically, it exhibits advantages in reducing the dimensionality of textual features and enhancing clustering performance. When the proportion of labeled samples accounts for 10% of the total sample size, the method achieves optimal clustering results, with an average classifier accuracy of 89.8%.https://doi.org/10.1038/s41598-025-92296-yDetection CapabilitiesChinese phrasesSemi-supervisedK-MeansBERT
spellingShingle Gaoqing Xu
Qun Chen
Shuhang Jiang
Xiaohang Fu
Yiwei Wang
Qingchun Jiao
Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT
Scientific Reports
Detection Capabilities
Chinese phrases
Semi-supervised
K-Means
BERT
title Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT
title_full Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT
title_fullStr Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT
title_full_unstemmed Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT
title_short Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT
title_sort analyzing the capability description of testing institution in chinese phrase using a joint approach of semi supervised k means clustering and bert
topic Detection Capabilities
Chinese phrases
Semi-supervised
K-Means
BERT
url https://doi.org/10.1038/s41598-025-92296-y
work_keys_str_mv AT gaoqingxu analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert
AT qunchen analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert
AT shuhangjiang analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert
AT xiaohangfu analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert
AT yiweiwang analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert
AT qingchunjiao analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert