Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention

Automatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional as...

Full description

Saved in:
Bibliographic Details
Main Authors: Kavi Priya S, Pon Karthika K, Jayakumar Kaliappan, Senthil Kumaran Selvaraj, Nagalakshmi R, Baye Molla
Format: Article
Language:English
Published: Wiley 2022-01-01
Series:Applied Computational Intelligence and Soft Computing
Online Access:http://dx.doi.org/10.1155/2022/2756396
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832555225697746944
author Kavi Priya S
Pon Karthika K
Jayakumar Kaliappan
Senthil Kumaran Selvaraj
Nagalakshmi R
Baye Molla
author_facet Kavi Priya S
Pon Karthika K
Jayakumar Kaliappan
Senthil Kumaran Selvaraj
Nagalakshmi R
Baye Molla
author_sort Kavi Priya S
collection DOAJ
description Automatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional aspects of the image. The proposed work encapsulates the facial emotional features to produce more expressive captions similar to human-annotated ones with the help of Cross Stage Partial Dense Network (CSPDenseNet) and Self-attentive Bidirectional Long Short-Term Memory (BiLSTM) network. The encoding unit captures the facial expressions and dense image features using a Facial Expression Recognition (FER) model and CSPDense neural network, respectively. Further, the word embedding vectors of the ground truth image captions are created and learned using the Word2Vec embedding technique. Then, the extracted image feature vectors and word vectors are fused to form an encoding vector representing the rich image content. The decoding unit employs a self-attention mechanism encompassed with BiLSTM to create more descriptive and relevant captions in natural language. The Flickr11k dataset, a subset of the Flickr30k dataset is used to train, test, and evaluate the present model based on five benchmark image captioning metrics. They are BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROGUE), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE). The experimental analysis indicates that the proposed model enhances the quality of captions with 0.6012(BLEU-1), 0.3992(BLEU-2), 0.2703(BLEU-3), 0.1921(BLEU-4), 0.1932(METEOR), 0.2617(CIDEr), 0.4793(ROUGE-L), and 0.1260(SPICE) scores, respectively, using additive emotional characteristics and behavioral components of the objects present in the image.
format Article
id doaj-art-68aa06a81d4d4f4caf2d1d33eb2fd207
institution Kabale University
issn 1687-9732
language English
publishDate 2022-01-01
publisher Wiley
record_format Article
series Applied Computational Intelligence and Soft Computing
spelling doaj-art-68aa06a81d4d4f4caf2d1d33eb2fd2072025-02-03T05:49:19ZengWileyApplied Computational Intelligence and Soft Computing1687-97322022-01-01202210.1155/2022/2756396Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-AttentionKavi Priya S0Pon Karthika K1Jayakumar Kaliappan2Senthil Kumaran Selvaraj3Nagalakshmi R4Baye Molla5Department of Computer Science and EngineeringDepartment of Computer Science and EngineeringDepartment of AnalyticsDepartment of Manufacturing EngineeringDepartment of Computer Science and EngineeringSchool of Mechanical EngineeringAutomatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional aspects of the image. The proposed work encapsulates the facial emotional features to produce more expressive captions similar to human-annotated ones with the help of Cross Stage Partial Dense Network (CSPDenseNet) and Self-attentive Bidirectional Long Short-Term Memory (BiLSTM) network. The encoding unit captures the facial expressions and dense image features using a Facial Expression Recognition (FER) model and CSPDense neural network, respectively. Further, the word embedding vectors of the ground truth image captions are created and learned using the Word2Vec embedding technique. Then, the extracted image feature vectors and word vectors are fused to form an encoding vector representing the rich image content. The decoding unit employs a self-attention mechanism encompassed with BiLSTM to create more descriptive and relevant captions in natural language. The Flickr11k dataset, a subset of the Flickr30k dataset is used to train, test, and evaluate the present model based on five benchmark image captioning metrics. They are BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROGUE), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE). The experimental analysis indicates that the proposed model enhances the quality of captions with 0.6012(BLEU-1), 0.3992(BLEU-2), 0.2703(BLEU-3), 0.1921(BLEU-4), 0.1932(METEOR), 0.2617(CIDEr), 0.4793(ROUGE-L), and 0.1260(SPICE) scores, respectively, using additive emotional characteristics and behavioral components of the objects present in the image.http://dx.doi.org/10.1155/2022/2756396
spellingShingle Kavi Priya S
Pon Karthika K
Jayakumar Kaliappan
Senthil Kumaran Selvaraj
Nagalakshmi R
Baye Molla
Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
Applied Computational Intelligence and Soft Computing
title Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_full Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_fullStr Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_full_unstemmed Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_short Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_sort caption generation based on emotions using cspdensenet and bilstm with self attention
url http://dx.doi.org/10.1155/2022/2756396
work_keys_str_mv AT kavipriyas captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention
AT ponkarthikak captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention
AT jayakumarkaliappan captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention
AT senthilkumaranselvaraj captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention
AT nagalakshmir captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention
AT bayemolla captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention