Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion
Abstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-ba...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-08-01
|
| Series: | Journal of Big Data |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s40537-025-01257-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849764763309965312 |
|---|---|
| author | A-Seong Moon Kyumin Kim Jaesung Lee |
| author_facet | A-Seong Moon Kyumin Kim Jaesung Lee |
| author_sort | A-Seong Moon |
| collection | DOAJ |
| description | Abstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-based information retrieval model. Unlike traditional random token deletion methods that delete tokens with equal probability, the proposed method calculates token importance by considering both passage and corpus-level frequencies, leading to more effective token deletion. Experimental results demonstrate that the proposed approach significantly improves Top-k accuracy on smaller datasets compared to conventional augmentation techniques. While maintaining competitive performance on larger-scale datasets, its relative effectiveness is particularly notable in scenarios characterized by limited training data and severe class imbalance. This confirms its potential to improve the generalizability of information retrieval models. The source code is publicly available at https://github.com/asmoon002/DPR_TD . |
| format | Article |
| id | doaj-art-cae1d25d4ba642ed81522114b2a46462 |
| institution | DOAJ |
| issn | 2196-1115 |
| language | English |
| publishDate | 2025-08-01 |
| publisher | SpringerOpen |
| record_format | Article |
| series | Journal of Big Data |
| spelling | doaj-art-cae1d25d4ba642ed81522114b2a464622025-08-20T03:05:03ZengSpringerOpenJournal of Big Data2196-11152025-08-0112112810.1186/s40537-025-01257-9Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletionA-Seong Moon0Kyumin Kim1Jaesung Lee2Deparment of Artificial Intelligence, Chung-Ang UniversityDeparment of Artificial Intelligence, Chung-Ang UniversityDeparment of Artificial Intelligence, Chung-Ang UniversityAbstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-based information retrieval model. Unlike traditional random token deletion methods that delete tokens with equal probability, the proposed method calculates token importance by considering both passage and corpus-level frequencies, leading to more effective token deletion. Experimental results demonstrate that the proposed approach significantly improves Top-k accuracy on smaller datasets compared to conventional augmentation techniques. While maintaining competitive performance on larger-scale datasets, its relative effectiveness is particularly notable in scenarios characterized by limited training data and severe class imbalance. This confirms its potential to improve the generalizability of information retrieval models. The source code is publicly available at https://github.com/asmoon002/DPR_TD .https://doi.org/10.1186/s40537-025-01257-9Information retrievalData augmentationNatural language processingClass imbalance |
| spellingShingle | A-Seong Moon Kyumin Kim Jaesung Lee Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion Journal of Big Data Information retrieval Data augmentation Natural language processing Class imbalance |
| title | Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion |
| title_full | Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion |
| title_fullStr | Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion |
| title_full_unstemmed | Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion |
| title_short | Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion |
| title_sort | data augmentation for dense passage retrieval using corpus passage frequency based token deletion |
| topic | Information retrieval Data augmentation Natural language processing Class imbalance |
| url | https://doi.org/10.1186/s40537-025-01257-9 |
| work_keys_str_mv | AT aseongmoon dataaugmentationfordensepassageretrievalusingcorpuspassagefrequencybasedtokendeletion AT kyuminkim dataaugmentationfordensepassageretrievalusingcorpuspassagefrequencybasedtokendeletion AT jaesunglee dataaugmentationfordensepassageretrievalusingcorpuspassagefrequencybasedtokendeletion |