Improved IEC performance via emotional stimuli-aware captioning

Abstract Image emotion classification (IEC), a crucial task in computer vision, aims to infer the emotional state of subjects in images. Existing techniques have focused on the use of semantic information to support visual features. However, a significant affective gap persists between low-level pix...

Full description

Saved in:
Bibliographic Details
Main Authors: Zibo Zhou, Zhengjun Zhai, Xin Gao, Jiaqi Zhu
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-06094-7
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Image emotion classification (IEC), a crucial task in computer vision, aims to infer the emotional state of subjects in images. Existing techniques have focused on the use of semantic information to support visual features. However, a significant affective gap persists between low-level pixel information and high-level emotions, due to the abstract and complex nature of cognitive processes. This gap limits corresponding semantic representations and hinders the resulting model performance. In this study, we draw inspiration from psychological findings and advances in natural language processing. Specifically, we explore the use of image captions as auxiliary information, combined with visual features, for enhanced emotional discernment. We introduce the emotional stimuli-aware captioning network (ESCNet), which leverages generative captions to augment visual representations. An affective captioning dataset, based on emotional attributes, is also developed to generate emotion-related captions and pre-train the image captioning model. Visual features related to the captions are then generated to highlight emotionally charged words and a fusion module combining cross-attention with self-attention is introduced to learn correlations between images and captions. We also introduce a variable-weight loss function to emphasize hard-to-classify samples. Extensive validation experiments using multiple public datasets demonstrated that our approach outperformed state-of-the-art models. Ablation studies and visualization results further confirmed the effectiveness of our proposed network and its modules.
ISSN:2045-2322