Improved self-training-based distant label denoising method for cybersecurity entity extractions.
The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific co...
Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2024-01-01
|
| Series: | PLoS ONE |
| Online Access: | https://doi.org/10.1371/journal.pone.0315479 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850120506753155072 |
|---|---|
| author | Ke Zhang Yunpeng Wang Ou Li Sirui Hao Junjiang He Xiaolong Lan Jinneng Yang Yang Ye |
| author_facet | Ke Zhang Yunpeng Wang Ou Li Sirui Hao Junjiang He Xiaolong Lan Jinneng Yang Yang Ye |
| author_sort | Ke Zhang |
| collection | DOAJ |
| description | The task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class. |
| format | Article |
| id | doaj-art-2993c10c3ed14f2fa196bcfe8f002341 |
| institution | OA Journals |
| issn | 1932-6203 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| series | PLoS ONE |
| spelling | doaj-art-2993c10c3ed14f2fa196bcfe8f0023412025-08-20T02:35:21ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-011912e031547910.1371/journal.pone.0315479Improved self-training-based distant label denoising method for cybersecurity entity extractions.Ke ZhangYunpeng WangOu LiSirui HaoJunjiang HeXiaolong LanJinneng YangYang YeThe task of named entity recognition (NER) plays a crucial role in extracting cybersecurity-related information. Existing approaches for cybersecurity entity extraction predominantly rely on manual labelling data, resulting in labour-intensive processes due to the lack of a cybersecurity-specific corpus. In this paper, we propose an improved self-training-based distant label denoising method for cybersecurity entity extraction. Firstly, we create two domain dictionaries of cybersecurity. Then, an algorithm that combines reverse maximum matching and part-of-speech tagging restrictions is proposed, for generating distant labels for the cybersecurity domain corpus. Lastly, we propose a high-confidence text selection method and an improved self-training algorithm that incorporates a teacher-student model and weight update constraints, for exploring the true labels of low-confidence text using a model trained on high-confidence text, thereby reducing the noise in the distant annotation data. Experimental results demonstrate that the cybersecurity distantly-labelled data we obtained is of high quality. Additionally, the proposed constrained self-training algorithm effectively improves the F1 score of several state-of-the-art NER models on this dataset, yielding a 3.5% improvement for the Vendor class and a 3.35% improvement for the Product class.https://doi.org/10.1371/journal.pone.0315479 |
| spellingShingle | Ke Zhang Yunpeng Wang Ou Li Sirui Hao Junjiang He Xiaolong Lan Jinneng Yang Yang Ye Improved self-training-based distant label denoising method for cybersecurity entity extractions. PLoS ONE |
| title | Improved self-training-based distant label denoising method for cybersecurity entity extractions. |
| title_full | Improved self-training-based distant label denoising method for cybersecurity entity extractions. |
| title_fullStr | Improved self-training-based distant label denoising method for cybersecurity entity extractions. |
| title_full_unstemmed | Improved self-training-based distant label denoising method for cybersecurity entity extractions. |
| title_short | Improved self-training-based distant label denoising method for cybersecurity entity extractions. |
| title_sort | improved self training based distant label denoising method for cybersecurity entity extractions |
| url | https://doi.org/10.1371/journal.pone.0315479 |
| work_keys_str_mv | AT kezhang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT yunpengwang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT ouli improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT siruihao improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT junjianghe improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT xiaolonglan improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT jinnengyang improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions AT yangye improvedselftrainingbaseddistantlabeldenoisingmethodforcybersecurityentityextractions |