RMixDA: A Random Compound Data Augmentation Operation Framework for Neural Machine Translation

In Neural Machine Translation (NMT), data augmentation is an effective method for improving model robustness by generating diverse augmented data from existing datasets. Typically, the quality of augmented data is evaluated based on its similarity and diversity to training data. Current approaches o...

Full description

Saved in:
Bibliographic Details
Main Authors: Huijun Wang, Xiaojing Du, Xinkun Hao
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11037679/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In Neural Machine Translation (NMT), data augmentation is an effective method for improving model robustness by generating diverse augmented data from existing datasets. Typically, the quality of augmented data is evaluated based on its similarity and diversity to training data. Current approaches often rely on a single type of operation, limiting their ability to fully capture the range of possible variations in real-world data. This limitation will affect the robustness and generalization of the model. To address this, we propose RMixDA: a Random Mixed Algorithm Framework For Data Augmentation, which dynamically integrates multiple NMT-oriented data augmentation methods. RMixDA defines an extensible collection of multi-augmentation operations using the Backus-Naur Form (BNF) framework, which categorizes data augmentation methods into pre-embedding <inline-formula> <tex-math notation="LaTeX">$(pre\_embed\_op)$ </tex-math></inline-formula>, embedding <inline-formula> <tex-math notation="LaTeX">$(embed\_op)$ </tex-math></inline-formula> and post-embedding <inline-formula> <tex-math notation="LaTeX">$(post\_embed\_op)$ </tex-math></inline-formula> levels. These operations are combined using a random walk algorithm to generate diverse augmented data, for maintaining both similarity and diversity. We further incorporate contrastive learning, identifying augmented data as positive or negative samples to evaluate similarity more effectively. Additionally, a novel evaluation method for augmented data quality is introduced, which is regularized by the NMT model&#x2019;s loss function to improve the positive correlation between augmented data quality and model training. Extensive experimental studies demonstrate the effectiveness of RMixDA and its scalability for integrating diverse data augmentation operations. On the IWSLT14 German-English and WMT14 English-German benchmarks, RMixDA achieves BLEU scores of 37.87 and 29.13, respectively, outperforming state-of-the-art methods by up to 3.44 and 1.83. This framework showcases practical utility in real-world NMT tasks, particularly for enhancing translation quality in low-resource language scenarios.
ISSN:2169-3536