Improved self-training-based distant label denoising method for cybersecurity entity extractions.

The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific co...

Full description

Saved in:
Bibliographic Details
Main Authors: Ke Zhang, Yunpeng Wang, Ou Li, Sirui Hao, Junjiang He, Xiaolong Lan, Jinneng Yang, Yang Ye
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0315479
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850120506753155072
author Ke Zhang
Yunpeng Wang
Ou Li
Sirui Hao
Junjiang He
Xiaolong Lan
Jinneng Yang
Yang Ye
author_facet Ke Zhang
Yunpeng Wang
Ou Li
Sirui Hao
Junjiang He
Xiaolong Lan
Jinneng Yang
Yang Ye
author_sort Ke Zhang
collection DOAJ
description The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.
format Article
id doaj-art-2993c10c3ed14f2fa196bcfe8f002341
institution OA Journals
issn 1932-6203
language English
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj-art-2993c10c3ed14f2fa196bcfe8f0023412025-08-20T02:35:21ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-011912e031547910.1371/journal.pone.0315479Improved self-training-based distant label denoising method for cybersecurity entity extractions.Ke ZhangYunpeng WangOu LiSirui HaoJunjiang HeXiaolong LanJinneng YangYang YeThe task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.https://doi.org/10.1371/journal.pone.0315479
spellingShingle Ke Zhang
Yunpeng Wang
Ou Li
Sirui Hao
Junjiang He
Xiaolong Lan
Jinneng Yang
Yang Ye
Improved self-training-based distant label denoising method for cybersecurity entity extractions.
PLoS ONE
title Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_full Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_fullStr Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_full_unstemmed Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_short Improved self-training-based distant label denoising method for cybersecurity entity extractions.
title_sort improved self training based distant label denoising method for cybersecurity entity extractions
url https://doi.org/10.1371/journal.pone.0315479
work_keys_str_mv AT kezhang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions
AT yunpengwang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions
AT ouli improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions
AT siruihao improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions
AT junjianghe improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions
AT xiaolonglan improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions
AT jinnengyang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions
AT yangye improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions