Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction

Accurately identifying security bug reports remains a key challenge in software development. Due to the varying expertise of bug reporters, many security bug reports are incorrectly labeled as non-security bug reports, this increases the security risk of the software and the workload of developers t...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jinfeng Ji, Geunseok Yang
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Bug security prediction stop words text similarity software bug report analysis
Online Access:	https://ieeexplore.ieee.org/document/10978022/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849322333936812032
author	Jinfeng Ji Geunseok Yang
author_facet	Jinfeng Ji Geunseok Yang
author_sort	Jinfeng Ji
collection	DOAJ
description	Accurately identifying security bug reports remains a key challenge in software development. Due to the varying expertise of bug reporters, many security bug reports are incorrectly labeled as non-security bug reports, this increases the security risk of the software and the workload of developers to identify these incorrectly labeled reports from bug reports. This study aims to improve the prediction of security bug reports by addressing the class imbalance problem and enhancing the generalization ability of the model across projects. To achieve this goal, we propose a deep learning-based prediction method combined with a novel data augmentation method based on cross-project text similarity. The bug report data is collected from four open-source projects: Ambari, Camel, Derby, and Wicket, where the number of security bug reports is 56, 74, 179, and 47, respectively, and the number of non-security bug reports is significantly higher. To alleviate the imbalance phenomenon and leverage cross-project knowledge, we augment the dataset by identifying and merging semantically similar security bug reports from other projects. We evaluate 5 deep learning models, including CNN, LSTM, GRU, Transformer, and BERT. Our approach achieved F1 scores between 0.60 and 0.98, with the best performance using LSTM and GRU models, especially LSTM on Ambari, GRU on Camel and Ambari, they both achieved an F1 score of 0.98. The overall average F1 score is 0.77, a significant improvement over the baseline classification. The results show that data augmentation based on cross-project similarities is an effective strategy to improve security bug report prediction, especially in imbalanced datasets. This approach can help developers detect security-related issues more effectively, reduce the risk of misclassification, and enhance overall software security.
format	Article
id	doaj-art-8852e616df484a34a95c7e7f9c91f341
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-8852e616df484a34a95c7e7f9c91f3412025-08-20T03:49:23ZengIEEEIEEE Access2169-35362025-01-0113804168042810.1109/ACCESS.2025.356481810978022Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report PredictionJinfeng Ji0https://orcid.org/0009-0005-4627-9304Geunseok Yang1https://orcid.org/0000-0001-5677-5129Department of Computer Applied Mathematics, Hankyong National University, Anseong, South KoreaDepartment of Computer Applied Mathematics, Computer System Institute, Hankyong National University, Anseong, South KoreaAccurately identifying security bug reports remains a key challenge in software development. Due to the varying expertise of bug reporters, many security bug reports are incorrectly labeled as non-security bug reports, this increases the security risk of the software and the workload of developers to identify these incorrectly labeled reports from bug reports. This study aims to improve the prediction of security bug reports by addressing the class imbalance problem and enhancing the generalization ability of the model across projects. To achieve this goal, we propose a deep learning-based prediction method combined with a novel data augmentation method based on cross-project text similarity. The bug report data is collected from four open-source projects: Ambari, Camel, Derby, and Wicket, where the number of security bug reports is 56, 74, 179, and 47, respectively, and the number of non-security bug reports is significantly higher. To alleviate the imbalance phenomenon and leverage cross-project knowledge, we augment the dataset by identifying and merging semantically similar security bug reports from other projects. We evaluate 5 deep learning models, including CNN, LSTM, GRU, Transformer, and BERT. Our approach achieved F1 scores between 0.60 and 0.98, with the best performance using LSTM and GRU models, especially LSTM on Ambari, GRU on Camel and Ambari, they both achieved an F1 score of 0.98. The overall average F1 score is 0.77, a significant improvement over the baseline classification. The results show that data augmentation based on cross-project similarities is an effective strategy to improve security bug report prediction, especially in imbalanced datasets. This approach can help developers detect security-related issues more effectively, reduce the risk of misclassification, and enhance overall software security.https://ieeexplore.ieee.org/document/10978022/Bug security predictionstop wordstext similaritysoftware bug report analysis
spellingShingle	Jinfeng Ji Geunseok Yang Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction IEEE Access Bug security prediction stop words text similarity software bug report analysis
title	Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_full	Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_fullStr	Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_full_unstemmed	Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_short	Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_sort	leveraging cross project similarity for data augmentation and security bug report prediction
topic	Bug security prediction stop words text similarity software bug report analysis
url	https://ieeexplore.ieee.org/document/10978022/
work_keys_str_mv	AT jinfengji leveragingcrossprojectsimilarityfordataaugmentationandsecuritybugreportprediction AT geunseokyang leveragingcrossprojectsimilarityfordataaugmentationandsecuritybugreportprediction

Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction

Similar Items