Integrating visual memory for image captioning

Abstract Most existing image captioning models use region-level features extracted by object detectors as input and obtain advanced performance. However, although region features provide high-level semantic information, it is still limited by their local nature and detector performance and inevitabl...

Full description

Saved in:

Bibliographic Details
Main Authors:	Jiahui Wei, Tongtong Wu
Format:	Article
Language:	English
Published:	Springer 2025-05-01
Series:	Discover Applied Sciences
Subjects:	Image captioning Memory mechanism Transformer Attention
Online Access:	https://doi.org/10.1007/s42452-025-07045-7
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract Most existing image captioning models use region-level features extracted by object detectors as input and obtain advanced performance. However, although region features provide high-level semantic information, it is still limited by their local nature and detector performance and inevitably lack context and detailed information. In this paper, we propose a novel memory mechanism for storing visual a priori information and design a Transformer-based image captioner with Memory Enhanced Encoder to exploit memory. Specifically, we divide visual memory into two forms: long-term and short-term memory. Where long-term memory aims to preserve visual consensus, it is shared across all input data and is maintained over time. In contrast, short-term memory is used to provide detailed information about images of the same type as the input samples, which is not shared with other inputs and is deleted when used up. In order to obtain meaningful long- and short-term memory vectors, we designed sophisticated learning methods for them individually. We also design a Memory Enhanced Encoder (MEE) based on the Transformer-Encoder to capture long-term and short-term memory vectors at each layer to learn robust visual representations. We conducted extensive experiments on the MSCOCO dataset to demonstrate the method’s effectiveness. Also, the results show that our model has superiority over many state-of-the-arts.
ISSN:	3004-9261

Integrating visual memory for image captioning

Similar Items