Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion
Abstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-ba...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-08-01
|
| Series: | Journal of Big Data |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s40537-025-01257-9 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-based information retrieval model. Unlike traditional random token deletion methods that delete tokens with equal probability, the proposed method calculates token importance by considering both passage and corpus-level frequencies, leading to more effective token deletion. Experimental results demonstrate that the proposed approach significantly improves Top-k accuracy on smaller datasets compared to conventional augmentation techniques. While maintaining competitive performance on larger-scale datasets, its relative effectiveness is particularly notable in scenarios characterized by limited training data and severe class imbalance. This confirms its potential to improve the generalizability of information retrieval models. The source code is publicly available at https://github.com/asmoon002/DPR_TD . |
|---|---|
| ISSN: | 2196-1115 |