Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention

Automatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional as...

Full description

Saved in:

Bibliographic Details
Main Authors:	Kavi Priya S, Pon Karthika K, Jayakumar Kaliappan, Senthil Kumaran Selvaraj, Nagalakshmi R, Baye Molla
Format:	Article
Language:	English
Published:	Wiley 2022-01-01
Series:	Applied Computational Intelligence and Soft Computing
Online Access:	http://dx.doi.org/10.1155/2022/2756396
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832555225697746944
author	Kavi Priya S Pon Karthika K Jayakumar Kaliappan Senthil Kumaran Selvaraj Nagalakshmi R Baye Molla
author_facet	Kavi Priya S Pon Karthika K Jayakumar Kaliappan Senthil Kumaran Selvaraj Nagalakshmi R Baye Molla
author_sort	Kavi Priya S
collection	DOAJ
description	Automatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional aspects of the image. The proposed work encapsulates the facial emotional features to produce more expressive captions similar to human-annotated ones with the help of Cross Stage Partial Dense Network (CSPDenseNet) and Self-attentive Bidirectional Long Short-Term Memory (BiLSTM) network. The encoding unit captures the facial expressions and dense image features using a Facial Expression Recognition (FER) model and CSPDense neural network, respectively. Further, the word embedding vectors of the ground truth image captions are created and learned using the Word2Vec embedding technique. Then, the extracted image feature vectors and word vectors are fused to form an encoding vector representing the rich image content. The decoding unit employs a self-attention mechanism encompassed with BiLSTM to create more descriptive and relevant captions in natural language. The Flickr11k dataset, a subset of the Flickr30k dataset is used to train, test, and evaluate the present model based on five benchmark image captioning metrics. They are BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROGUE), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE). The experimental analysis indicates that the proposed model enhances the quality of captions with 0.6012(BLEU-1), 0.3992(BLEU-2), 0.2703(BLEU-3), 0.1921(BLEU-4), 0.1932(METEOR), 0.2617(CIDEr), 0.4793(ROUGE-L), and 0.1260(SPICE) scores, respectively, using additive emotional characteristics and behavioral components of the objects present in the image.
format	Article
id	doaj-art-68aa06a81d4d4f4caf2d1d33eb2fd207
institution	Kabale University
issn	1687-9732
language	English
publishDate	2022-01-01
publisher	Wiley
record_format	Article
series	Applied Computational Intelligence and Soft Computing
spelling	doaj-art-68aa06a81d4d4f4caf2d1d33eb2fd2072025-02-03T05:49:19ZengWileyApplied Computational Intelligence and Soft Computing1687-97322022-01-01202210.1155/2022/2756396Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-AttentionKavi Priya S0Pon Karthika K1Jayakumar Kaliappan2Senthil Kumaran Selvaraj3Nagalakshmi R4Baye Molla5Department of Computer Science and EngineeringDepartment of Computer Science and EngineeringDepartment of AnalyticsDepartment of Manufacturing EngineeringDepartment of Computer Science and EngineeringSchool of Mechanical EngineeringAutomatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional aspects of the image. The proposed work encapsulates the facial emotional features to produce more expressive captions similar to human-annotated ones with the help of Cross Stage Partial Dense Network (CSPDenseNet) and Self-attentive Bidirectional Long Short-Term Memory (BiLSTM) network. The encoding unit captures the facial expressions and dense image features using a Facial Expression Recognition (FER) model and CSPDense neural network, respectively. Further, the word embedding vectors of the ground truth image captions are created and learned using the Word2Vec embedding technique. Then, the extracted image feature vectors and word vectors are fused to form an encoding vector representing the rich image content. The decoding unit employs a self-attention mechanism encompassed with BiLSTM to create more descriptive and relevant captions in natural language. The Flickr11k dataset, a subset of the Flickr30k dataset is used to train, test, and evaluate the present model based on five benchmark image captioning metrics. They are BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROGUE), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE). The experimental analysis indicates that the proposed model enhances the quality of captions with 0.6012(BLEU-1), 0.3992(BLEU-2), 0.2703(BLEU-3), 0.1921(BLEU-4), 0.1932(METEOR), 0.2617(CIDEr), 0.4793(ROUGE-L), and 0.1260(SPICE) scores, respectively, using additive emotional characteristics and behavioral components of the objects present in the image.http://dx.doi.org/10.1155/2022/2756396
spellingShingle	Kavi Priya S Pon Karthika K Jayakumar Kaliappan Senthil Kumaran Selvaraj Nagalakshmi R Baye Molla Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention Applied Computational Intelligence and Soft Computing
title	Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_full	Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_fullStr	Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_full_unstemmed	Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_short	Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
title_sort	caption generation based on emotions using cspdensenet and bilstm with self attention
url	http://dx.doi.org/10.1155/2022/2756396
work_keys_str_mv	AT kavipriyas captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT ponkarthikak captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT jayakumarkaliappan captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT senthilkumaranselvaraj captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT nagalakshmir captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT bayemolla captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention

Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention

Similar Items