Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion

Abstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-ba...

Full description

Saved in:
Bibliographic Details
Main Authors: A-Seong Moon, Kyumin Kim, Jaesung Lee
Format: Article
Language:English
Published: SpringerOpen 2025-08-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-025-01257-9
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849764763309965312
author A-Seong Moon
Kyumin Kim
Jaesung Lee
author_facet A-Seong Moon
Kyumin Kim
Jaesung Lee
author_sort A-Seong Moon
collection DOAJ
description Abstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-based information retrieval model. Unlike traditional random token deletion methods that delete tokens with equal probability, the proposed method calculates token importance by considering both passage and corpus-level frequencies, leading to more effective token deletion. Experimental results demonstrate that the proposed approach significantly improves Top-k accuracy on smaller datasets compared to conventional augmentation techniques. While maintaining competitive performance on larger-scale datasets, its relative effectiveness is particularly notable in scenarios characterized by limited training data and severe class imbalance. This confirms its potential to improve the generalizability of information retrieval models. The source code is publicly available at https://github.com/asmoon002/DPR_TD .
format Article
id doaj-art-cae1d25d4ba642ed81522114b2a46462
institution DOAJ
issn 2196-1115
language English
publishDate 2025-08-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj-art-cae1d25d4ba642ed81522114b2a464622025-08-20T03:05:03ZengSpringerOpenJournal of Big Data2196-11152025-08-0112112810.1186/s40537-025-01257-9Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletionA-Seong Moon0Kyumin Kim1Jaesung Lee2Deparment of Artificial Intelligence, Chung-Ang UniversityDeparment of Artificial Intelligence, Chung-Ang UniversityDeparment of Artificial Intelligence, Chung-Ang UniversityAbstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-based information retrieval model. Unlike traditional random token deletion methods that delete tokens with equal probability, the proposed method calculates token importance by considering both passage and corpus-level frequencies, leading to more effective token deletion. Experimental results demonstrate that the proposed approach significantly improves Top-k accuracy on smaller datasets compared to conventional augmentation techniques. While maintaining competitive performance on larger-scale datasets, its relative effectiveness is particularly notable in scenarios characterized by limited training data and severe class imbalance. This confirms its potential to improve the generalizability of information retrieval models. The source code is publicly available at https://github.com/asmoon002/DPR_TD .https://doi.org/10.1186/s40537-025-01257-9Information retrievalData augmentationNatural language processingClass imbalance
spellingShingle A-Seong Moon
Kyumin Kim
Jaesung Lee
Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion
Journal of Big Data
Information retrieval
Data augmentation
Natural language processing
Class imbalance
title Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion
title_full Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion
title_fullStr Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion
title_full_unstemmed Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion
title_short Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion
title_sort data augmentation for dense passage retrieval using corpus passage frequency based token deletion
topic Information retrieval
Data augmentation
Natural language processing
Class imbalance
url https://doi.org/10.1186/s40537-025-01257-9
work_keys_str_mv AT aseongmoon dataaugmentationfordensepassageretrievalusingcorpuspassagefrequencybasedtokendeletion
AT kyuminkim dataaugmentationfordensepassageretrievalusingcorpuspassagefrequencybasedtokendeletion
AT jaesunglee dataaugmentationfordensepassageretrievalusingcorpuspassagefrequencybasedtokendeletion