Improved self-training-based distant label denoising method for cybersecurity entity extractions.

The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific co...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ke Zhang, Yunpeng Wang, Ou Li, Sirui Hao, Junjiang He, Xiaolong Lan, Jinneng Yang, Yang Ye
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2024-01-01
Series:	PLoS ONE
Online Access:	https://doi.org/10.1371/journal.pone.0315479
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850120506753155072
author	Ke Zhang Yunpeng Wang Ou Li Sirui Hao Junjiang He Xiaolong Lan Jinneng Yang Yang Ye
author_facet	Ke Zhang Yunpeng Wang Ou Li Sirui Hao Junjiang He Xiaolong Lan Jinneng Yang Yang Ye
author_sort	Ke Zhang
collection	DOAJ
description	The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.
format	Article
id	doaj-art-2993c10c3ed14f2fa196bcfe8f002341
institution	OA Journals
issn	1932-6203
language	English
publishDate	2024-01-01
publisher	Public Library of Science (PLoS)
record_format	Article
series	PLoS ONE
spelling	doaj-art-2993c10c3ed14f2fa196bcfe8f0023412025-08-20T02:35:21ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-011912e031547910.1371/journal.pone.0315479Improved self-training-based distant label denoising method for cybersecurity entity extractions.Ke ZhangYunpeng WangOu LiSirui HaoJunjiang HeXiaolong LanJinneng YangYang YeThe task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.https://doi.org/10.1371/journal.pone.0315479
spellingShingle	Ke Zhang Yunpeng Wang Ou Li Sirui Hao Junjiang He Xiaolong Lan Jinneng Yang Yang Ye Improved self-training-based distant label denoising method for cybersecurity entity extractions. PLoS ONE
title	Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_full	Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_fullStr	Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_full_unstemmed	Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_short	Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_sort	improved self training based distant label denoising method for cybersecurity entity extractions
url	https://doi.org/10.1371/journal.pone.0315479
work_keys_str_mv	AT kezhang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT yunpengwang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT ouli improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT siruihao improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT junjianghe improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT xiaolonglan improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT jinnengyang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT yangye improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions

Improved self-training-based distant label denoising method for cybersecurity entity extractions.

Similar Items