Noisy Token Removal for Bug Localization: The Impact of Semantically Confusing Misguiding Terms

A bug report is a technical document describing bugs that have occurred in the software. Finding the source code files to resolve a reported bug is a laborious task. To automate this process, information retrieval-based bug localization (IRBL) techniques have been proposed. These techniques assess t...

Full description

Saved in:
Bibliographic Details
Main Authors: Youngkyoung Kim, Misoo Kim, Eunseok Lee
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10755074/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850236722271485952
author Youngkyoung Kim
Misoo Kim
Eunseok Lee
author_facet Youngkyoung Kim
Misoo Kim
Eunseok Lee
author_sort Youngkyoung Kim
collection DOAJ
description A bug report is a technical document describing bugs that have occurred in the software. Finding the source code files to resolve a reported bug is a laborious task. To automate this process, information retrieval-based bug localization (IRBL) techniques have been proposed. These techniques assess the relevance between the bug report and source files, providing developers with a ranked list of source files. They rely heavily on text tokens, making it essential to remove noisy tokens from the input tokens. To address the problem of prevalent noisy tokens deteriorating IRBL performance, we define impactful noisy words as misguiding terms and investigate their prevalence and impact. We employed a deep learning model combined with explainable AI techniques to detect misguiding terms, leveraging their semantic embedding capabilities. We conducted extensive experiments on 24 open-source software projects and three IRBL models. By removing misguiding terms, the mean reciprocal rank of bug localization improved by 19%, 17%, and 27% for three models on average and up to 120%. The proposed approach effectively distinguishes between beneficial terms and noise, leading to superior IRBL performance compared to the existing noise detection approaches, with consistent improvements observed across 24 projects.
format Article
id doaj-art-eb58e899f5d947169cf18d703e7eceac
institution OA Journals
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-eb58e899f5d947169cf18d703e7eceac2025-08-20T02:01:54ZengIEEEIEEE Access2169-35362024-01-011217239617240910.1109/ACCESS.2024.350036710755074Noisy Token Removal for Bug Localization: The Impact of Semantically Confusing Misguiding TermsYoungkyoung Kim0https://orcid.org/0000-0001-5457-7997Misoo Kim1https://orcid.org/0000-0002-8274-5457Eunseok Lee2https://orcid.org/0000-0002-6557-8087Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon-si, Gyeonggi-do, South KoreaDepartment of Artificial Intelligence Convergence, Chonnam National University, Gwangju, South KoreaCollege of Computing and Informatics, Sungkyunkwan University, Suwon-si, Gyeonggi-do, South KoreaA bug report is a technical document describing bugs that have occurred in the software. Finding the source code files to resolve a reported bug is a laborious task. To automate this process, information retrieval-based bug localization (IRBL) techniques have been proposed. These techniques assess the relevance between the bug report and source files, providing developers with a ranked list of source files. They rely heavily on text tokens, making it essential to remove noisy tokens from the input tokens. To address the problem of prevalent noisy tokens deteriorating IRBL performance, we define impactful noisy words as misguiding terms and investigate their prevalence and impact. We employed a deep learning model combined with explainable AI techniques to detect misguiding terms, leveraging their semantic embedding capabilities. We conducted extensive experiments on 24 open-source software projects and three IRBL models. By removing misguiding terms, the mean reciprocal rank of bug localization improved by 19%, 17%, and 27% for three models on average and up to 120%. The proposed approach effectively distinguishes between beneficial terms and noise, leading to superior IRBL performance compared to the existing noise detection approaches, with consistent improvements observed across 24 projects.https://ieeexplore.ieee.org/document/10755074/Automated debuggingbug localizationbug reportdeep learningexplainable AIinformation retrieval
spellingShingle Youngkyoung Kim
Misoo Kim
Eunseok Lee
Noisy Token Removal for Bug Localization: The Impact of Semantically Confusing Misguiding Terms
IEEE Access
Automated debugging
bug localization
bug report
deep learning
explainable AI
information retrieval
title Noisy Token Removal for Bug Localization: The Impact of Semantically Confusing Misguiding Terms
title_full Noisy Token Removal for Bug Localization: The Impact of Semantically Confusing Misguiding Terms
title_fullStr Noisy Token Removal for Bug Localization: The Impact of Semantically Confusing Misguiding Terms
title_full_unstemmed Noisy Token Removal for Bug Localization: The Impact of Semantically Confusing Misguiding Terms
title_short Noisy Token Removal for Bug Localization: The Impact of Semantically Confusing Misguiding Terms
title_sort noisy token removal for bug localization the impact of semantically confusing misguiding terms
topic Automated debugging
bug localization
bug report
deep learning
explainable AI
information retrieval
url https://ieeexplore.ieee.org/document/10755074/
work_keys_str_mv AT youngkyoungkim noisytokenremovalforbuglocalizationtheimpactofsemanticallyconfusingmisguidingterms
AT misookim noisytokenremovalforbuglocalizationtheimpactofsemanticallyconfusingmisguidingterms
AT eunseoklee noisytokenremovalforbuglocalizationtheimpactofsemanticallyconfusingmisguidingterms