Multimodal retrieval-augmented generation framework for machine translation

The development of multimodal machine translation (MMT) systems has attracted significant interest due to their potential to enhance translation accuracy with visual information. However, there are two limitations: (i) scarce large-scale corpus data in the form of (text, image, text) triplets and (i...

Full description

Saved in:
Bibliographic Details
Main Author: Shijian Li
Format: Article
Language:English
Published: Electronics and Telecommunications Research Institute (ETRI) 2025-08-01
Series:ETRI Journal
Subjects:
Online Access:https://doi.org/10.4218/etrij.2024-0196
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The development of multimodal machine translation (MMT) systems has attracted significant interest due to their potential to enhance translation accuracy with visual information. However, there are two limitations: (i) scarce large-scale corpus data in the form of (text, image, text) triplets and (ii) the semantic information learned by pre-training cannot transfer to multilingual translation tasks. To address these challenges, we propose a novel multimodal retrieval-augmented generation framework for machine translation, abbreviated as MRF-MT. Specifically, using the source text as a query, we retrieve relevant (image, text) pairs to guide image generation and feed the generated images into the image encoder of Multilingual Contrastive Language-Image Pre-training (M-CLIP) for learning visual information. Subsequently, we employ a projection network to transfer visual information learned by M-CLIP as a decoder prefix to Multilingual Bidirectional and Auto-Regressive Transformers (mBART) and train the mBART decoder using a two-stage pre-training pipeline. Initially, the mBART decoder is trained for image captioning with a visual–textual decoder prefix from M-CLIP's image encoder projection network. Subsequently, it undergoes training for caption translation, using prefixes from M-CLIP's text encoder. Extensive experiments show that MFR-MT achieves promising performance compared with baselines.
ISSN:1225-6463
2233-7326