Integrating visual memory for image captioning
Abstract Most existing image captioning models use region-level features extracted by object detectors as input and obtain advanced performance. However, although region features provide high-level semantic information, it is still limited by their local nature and detector performance and inevitabl...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-05-01
|
| Series: | Discover Applied Sciences |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s42452-025-07045-7 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract Most existing image captioning models use region-level features extracted by object detectors as input and obtain advanced performance. However, although region features provide high-level semantic information, it is still limited by their local nature and detector performance and inevitably lack context and detailed information. In this paper, we propose a novel memory mechanism for storing visual a priori information and design a Transformer-based image captioner with Memory Enhanced Encoder to exploit memory. Specifically, we divide visual memory into two forms: long-term and short-term memory. Where long-term memory aims to preserve visual consensus, it is shared across all input data and is maintained over time. In contrast, short-term memory is used to provide detailed information about images of the same type as the input samples, which is not shared with other inputs and is deleted when used up. In order to obtain meaningful long- and short-term memory vectors, we designed sophisticated learning methods for them individually. We also design a Memory Enhanced Encoder (MEE) based on the Transformer-Encoder to capture long-term and short-term memory vectors at each layer to learn robust visual representations. We conducted extensive experiments on the MSCOCO dataset to demonstrate the method’s effectiveness. Also, the results show that our model has superiority over many state-of-the-arts. |
|---|---|
| ISSN: | 3004-9261 |