Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT
Abstract The capability parameters of third-party testing institutions not only serve as a critical reflection of their technical and quality management capabilities but also form the key basis for categorizing their testing abilities. However, current Chinese phrase-based descriptions of these capa...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-04-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-92296-y |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850265194494689280 |
|---|---|
| author | Gaoqing Xu Qun Chen Shuhang Jiang Xiaohang Fu Yiwei Wang Qingchun Jiao |
| author_facet | Gaoqing Xu Qun Chen Shuhang Jiang Xiaohang Fu Yiwei Wang Qingchun Jiao |
| author_sort | Gaoqing Xu |
| collection | DOAJ |
| description | Abstract The capability parameters of third-party testing institutions not only serve as a critical reflection of their technical and quality management capabilities but also form the key basis for categorizing their testing abilities. However, current Chinese phrase-based descriptions of these capability parameters are influenced by diverse expression styles and varying internal standards, making it difficult to establish consistent criteria for classifying testing capabilities. This inconsistency presents notable difficulties for clients and regulatory bodies. Therefore, leveraging clustering techniques to uncover the intrinsic relationships and latent information between testing capabilities and their corresponding parameters is one of the crucial approaches to achieving scientific and reasonable classification of testing capabilities. Traditional text feature extraction methods suffer from several limitations, including sparse features, high-dimensional features, and lack of semantic information. These shortcomings complicate the classification and analysis of testing capability descriptions. To address this issue, this study focuses on the “products and testing objects” within the capability parameters of Chinese testing institutions as the research subject and proposes a joint model based on BERT and semi-supervised K-Means clustering. This model employs BERT to extract textual features from Chinese descriptive phrases and combines them with a small number of labeled samples for semi-supervised K-Means clustering analysis. The clustering results are then used to train a multi-output Softmax classifier, thereby enabling the classification of testing capabilities for third-party institutions. Experimental results demonstrate that the proposed method outperforms traditional methods such as TF-IDF and one-hot encoding when applied to the Chinese description datasets of testing institutions. Specifically, it exhibits advantages in reducing the dimensionality of textual features and enhancing clustering performance. When the proportion of labeled samples accounts for 10% of the total sample size, the method achieves optimal clustering results, with an average classifier accuracy of 89.8%. |
| format | Article |
| id | doaj-art-36b827fb85af44da8cf2c8e08e5379ac |
| institution | OA Journals |
| issn | 2045-2322 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-36b827fb85af44da8cf2c8e08e5379ac2025-08-20T01:54:30ZengNature PortfolioScientific Reports2045-23222025-04-0115111710.1038/s41598-025-92296-yAnalyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERTGaoqing Xu0Qun Chen1Shuhang Jiang2Xiaohang Fu3Yiwei Wang4Qingchun Jiao5Zhejiang Institute of Standardization, Zhijiang Standardization Think TankZhejiang Institute of Standardization, Technology Innovation Center of State Market Regulation Management (Research and Application of Digital Market Regulation)Zhejiang Jinhui Digital Technology Co. LtdZhejiang Jinhui Digital Technology Co. LtdZhejiang Light Industrial Products Inspection and Research InstituteZhejiang University of Science and Technology, ProfessorAbstract The capability parameters of third-party testing institutions not only serve as a critical reflection of their technical and quality management capabilities but also form the key basis for categorizing their testing abilities. However, current Chinese phrase-based descriptions of these capability parameters are influenced by diverse expression styles and varying internal standards, making it difficult to establish consistent criteria for classifying testing capabilities. This inconsistency presents notable difficulties for clients and regulatory bodies. Therefore, leveraging clustering techniques to uncover the intrinsic relationships and latent information between testing capabilities and their corresponding parameters is one of the crucial approaches to achieving scientific and reasonable classification of testing capabilities. Traditional text feature extraction methods suffer from several limitations, including sparse features, high-dimensional features, and lack of semantic information. These shortcomings complicate the classification and analysis of testing capability descriptions. To address this issue, this study focuses on the “products and testing objects” within the capability parameters of Chinese testing institutions as the research subject and proposes a joint model based on BERT and semi-supervised K-Means clustering. This model employs BERT to extract textual features from Chinese descriptive phrases and combines them with a small number of labeled samples for semi-supervised K-Means clustering analysis. The clustering results are then used to train a multi-output Softmax classifier, thereby enabling the classification of testing capabilities for third-party institutions. Experimental results demonstrate that the proposed method outperforms traditional methods such as TF-IDF and one-hot encoding when applied to the Chinese description datasets of testing institutions. Specifically, it exhibits advantages in reducing the dimensionality of textual features and enhancing clustering performance. When the proportion of labeled samples accounts for 10% of the total sample size, the method achieves optimal clustering results, with an average classifier accuracy of 89.8%.https://doi.org/10.1038/s41598-025-92296-yDetection CapabilitiesChinese phrasesSemi-supervisedK-MeansBERT |
| spellingShingle | Gaoqing Xu Qun Chen Shuhang Jiang Xiaohang Fu Yiwei Wang Qingchun Jiao Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT Scientific Reports Detection Capabilities Chinese phrases Semi-supervised K-Means BERT |
| title | Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT |
| title_full | Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT |
| title_fullStr | Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT |
| title_full_unstemmed | Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT |
| title_short | Analyzing the capability description of testing institution in Chinese phrase using a joint approach of semi-supervised K-Means clustering and BERT |
| title_sort | analyzing the capability description of testing institution in chinese phrase using a joint approach of semi supervised k means clustering and bert |
| topic | Detection Capabilities Chinese phrases Semi-supervised K-Means BERT |
| url | https://doi.org/10.1038/s41598-025-92296-y |
| work_keys_str_mv | AT gaoqingxu analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert AT qunchen analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert AT shuhangjiang analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert AT xiaohangfu analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert AT yiweiwang analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert AT qingchunjiao analyzingthecapabilitydescriptionoftestinginstitutioninchinesephraseusingajointapproachofsemisupervisedkmeansclusteringandbert |