Data augmentation for dense passage retrieval using corpus-passage frequency-based token deletion

Abstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-ba...

Full description

Saved in:
Bibliographic Details
Main Authors: A-Seong Moon, Kyumin Kim, Jaesung Lee
Format: Article
Language:English
Published: SpringerOpen 2025-08-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-025-01257-9
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract This paper proposes a novel data augmentation method to address class imbalance in large-scale information retrieval systems. In particular, a corpus-passage frequency-based token deletion technique is introduced to improve the accuracy of Dense Passage Retrieval, which is a dense vector-based information retrieval model. Unlike traditional random token deletion methods that delete tokens with equal probability, the proposed method calculates token importance by considering both passage and corpus-level frequencies, leading to more effective token deletion. Experimental results demonstrate that the proposed approach significantly improves Top-k accuracy on smaller datasets compared to conventional augmentation techniques. While maintaining competitive performance on larger-scale datasets, its relative effectiveness is particularly notable in scenarios characterized by limited training data and severe class imbalance. This confirms its potential to improve the generalizability of information retrieval models. The source code is publicly available at https://github.com/asmoon002/DPR_TD .
ISSN:2196-1115