Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction

Accurately identifying security bug reports remains a key challenge in software development. Due to the varying expertise of bug reporters, many security bug reports are incorrectly labeled as non-security bug reports, this increases the security risk of the software and the workload of developers t...

Full description

Saved in:
Bibliographic Details
Main Authors: Jinfeng Ji, Geunseok Yang
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10978022/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849322333936812032
author Jinfeng Ji
Geunseok Yang
author_facet Jinfeng Ji
Geunseok Yang
author_sort Jinfeng Ji
collection DOAJ
description Accurately identifying security bug reports remains a key challenge in software development. Due to the varying expertise of bug reporters, many security bug reports are incorrectly labeled as non-security bug reports, this increases the security risk of the software and the workload of developers to identify these incorrectly labeled reports from bug reports. This study aims to improve the prediction of security bug reports by addressing the class imbalance problem and enhancing the generalization ability of the model across projects. To achieve this goal, we propose a deep learning-based prediction method combined with a novel data augmentation method based on cross-project text similarity. The bug report data is collected from four open-source projects: Ambari, Camel, Derby, and Wicket, where the number of security bug reports is 56, 74, 179, and 47, respectively, and the number of non-security bug reports is significantly higher. To alleviate the imbalance phenomenon and leverage cross-project knowledge, we augment the dataset by identifying and merging semantically similar security bug reports from other projects. We evaluate 5 deep learning models, including CNN, LSTM, GRU, Transformer, and BERT. Our approach achieved F1 scores between 0.60 and 0.98, with the best performance using LSTM and GRU models, especially LSTM on Ambari, GRU on Camel and Ambari, they both achieved an F1 score of 0.98. The overall average F1 score is 0.77, a significant improvement over the baseline classification. The results show that data augmentation based on cross-project similarities is an effective strategy to improve security bug report prediction, especially in imbalanced datasets. This approach can help developers detect security-related issues more effectively, reduce the risk of misclassification, and enhance overall software security.
format Article
id doaj-art-8852e616df484a34a95c7e7f9c91f341
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-8852e616df484a34a95c7e7f9c91f3412025-08-20T03:49:23ZengIEEEIEEE Access2169-35362025-01-0113804168042810.1109/ACCESS.2025.356481810978022Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report PredictionJinfeng Ji0https://orcid.org/0009-0005-4627-9304Geunseok Yang1https://orcid.org/0000-0001-5677-5129Department of Computer Applied Mathematics, Hankyong National University, Anseong, South KoreaDepartment of Computer Applied Mathematics, Computer System Institute, Hankyong National University, Anseong, South KoreaAccurately identifying security bug reports remains a key challenge in software development. Due to the varying expertise of bug reporters, many security bug reports are incorrectly labeled as non-security bug reports, this increases the security risk of the software and the workload of developers to identify these incorrectly labeled reports from bug reports. This study aims to improve the prediction of security bug reports by addressing the class imbalance problem and enhancing the generalization ability of the model across projects. To achieve this goal, we propose a deep learning-based prediction method combined with a novel data augmentation method based on cross-project text similarity. The bug report data is collected from four open-source projects: Ambari, Camel, Derby, and Wicket, where the number of security bug reports is 56, 74, 179, and 47, respectively, and the number of non-security bug reports is significantly higher. To alleviate the imbalance phenomenon and leverage cross-project knowledge, we augment the dataset by identifying and merging semantically similar security bug reports from other projects. We evaluate 5 deep learning models, including CNN, LSTM, GRU, Transformer, and BERT. Our approach achieved F1 scores between 0.60 and 0.98, with the best performance using LSTM and GRU models, especially LSTM on Ambari, GRU on Camel and Ambari, they both achieved an F1 score of 0.98. The overall average F1 score is 0.77, a significant improvement over the baseline classification. The results show that data augmentation based on cross-project similarities is an effective strategy to improve security bug report prediction, especially in imbalanced datasets. This approach can help developers detect security-related issues more effectively, reduce the risk of misclassification, and enhance overall software security.https://ieeexplore.ieee.org/document/10978022/Bug security predictionstop wordstext similaritysoftware bug report analysis
spellingShingle Jinfeng Ji
Geunseok Yang
Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
IEEE Access
Bug security prediction
stop words
text similarity
software bug report analysis
title Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_full Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_fullStr Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_full_unstemmed Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_short Leveraging Cross-Project Similarity for Data Augmentation and Security Bug Report Prediction
title_sort leveraging cross project similarity for data augmentation and security bug report prediction
topic Bug security prediction
stop words
text similarity
software bug report analysis
url https://ieeexplore.ieee.org/document/10978022/
work_keys_str_mv AT jinfengji leveragingcrossprojectsimilarityfordataaugmentationandsecuritybugreportprediction
AT geunseokyang leveragingcrossprojectsimilarityfordataaugmentationandsecuritybugreportprediction