Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention
Automatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional as...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Wiley
2022-01-01
|
Series: | Applied Computational Intelligence and Soft Computing |
Online Access: | http://dx.doi.org/10.1155/2022/2756396 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832555225697746944 |
---|---|
author | Kavi Priya S Pon Karthika K Jayakumar Kaliappan Senthil Kumaran Selvaraj Nagalakshmi R Baye Molla |
author_facet | Kavi Priya S Pon Karthika K Jayakumar Kaliappan Senthil Kumaran Selvaraj Nagalakshmi R Baye Molla |
author_sort | Kavi Priya S |
collection | DOAJ |
description | Automatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional aspects of the image. The proposed work encapsulates the facial emotional features to produce more expressive captions similar to human-annotated ones with the help of Cross Stage Partial Dense Network (CSPDenseNet) and Self-attentive Bidirectional Long Short-Term Memory (BiLSTM) network. The encoding unit captures the facial expressions and dense image features using a Facial Expression Recognition (FER) model and CSPDense neural network, respectively. Further, the word embedding vectors of the ground truth image captions are created and learned using the Word2Vec embedding technique. Then, the extracted image feature vectors and word vectors are fused to form an encoding vector representing the rich image content. The decoding unit employs a self-attention mechanism encompassed with BiLSTM to create more descriptive and relevant captions in natural language. The Flickr11k dataset, a subset of the Flickr30k dataset is used to train, test, and evaluate the present model based on five benchmark image captioning metrics. They are BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROGUE), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE). The experimental analysis indicates that the proposed model enhances the quality of captions with 0.6012(BLEU-1), 0.3992(BLEU-2), 0.2703(BLEU-3), 0.1921(BLEU-4), 0.1932(METEOR), 0.2617(CIDEr), 0.4793(ROUGE-L), and 0.1260(SPICE) scores, respectively, using additive emotional characteristics and behavioral components of the objects present in the image. |
format | Article |
id | doaj-art-68aa06a81d4d4f4caf2d1d33eb2fd207 |
institution | Kabale University |
issn | 1687-9732 |
language | English |
publishDate | 2022-01-01 |
publisher | Wiley |
record_format | Article |
series | Applied Computational Intelligence and Soft Computing |
spelling | doaj-art-68aa06a81d4d4f4caf2d1d33eb2fd2072025-02-03T05:49:19ZengWileyApplied Computational Intelligence and Soft Computing1687-97322022-01-01202210.1155/2022/2756396Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-AttentionKavi Priya S0Pon Karthika K1Jayakumar Kaliappan2Senthil Kumaran Selvaraj3Nagalakshmi R4Baye Molla5Department of Computer Science and EngineeringDepartment of Computer Science and EngineeringDepartment of AnalyticsDepartment of Manufacturing EngineeringDepartment of Computer Science and EngineeringSchool of Mechanical EngineeringAutomatic image caption generation is an intricate task of describing an image in natural language by gaining insights present in an image. Featuring facial expressions in the conventional image captioning system brings out new prospects to generate pertinent descriptions, revealing the emotional aspects of the image. The proposed work encapsulates the facial emotional features to produce more expressive captions similar to human-annotated ones with the help of Cross Stage Partial Dense Network (CSPDenseNet) and Self-attentive Bidirectional Long Short-Term Memory (BiLSTM) network. The encoding unit captures the facial expressions and dense image features using a Facial Expression Recognition (FER) model and CSPDense neural network, respectively. Further, the word embedding vectors of the ground truth image captions are created and learned using the Word2Vec embedding technique. Then, the extracted image feature vectors and word vectors are fused to form an encoding vector representing the rich image content. The decoding unit employs a self-attention mechanism encompassed with BiLSTM to create more descriptive and relevant captions in natural language. The Flickr11k dataset, a subset of the Flickr30k dataset is used to train, test, and evaluate the present model based on five benchmark image captioning metrics. They are BiLingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Recall-Oriented Understudy for Gisting Evaluation (ROGUE), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE). The experimental analysis indicates that the proposed model enhances the quality of captions with 0.6012(BLEU-1), 0.3992(BLEU-2), 0.2703(BLEU-3), 0.1921(BLEU-4), 0.1932(METEOR), 0.2617(CIDEr), 0.4793(ROUGE-L), and 0.1260(SPICE) scores, respectively, using additive emotional characteristics and behavioral components of the objects present in the image.http://dx.doi.org/10.1155/2022/2756396 |
spellingShingle | Kavi Priya S Pon Karthika K Jayakumar Kaliappan Senthil Kumaran Selvaraj Nagalakshmi R Baye Molla Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention Applied Computational Intelligence and Soft Computing |
title | Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention |
title_full | Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention |
title_fullStr | Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention |
title_full_unstemmed | Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention |
title_short | Caption Generation Based on Emotions Using CSPDenseNet and BiLSTM with Self-Attention |
title_sort | caption generation based on emotions using cspdensenet and bilstm with self attention |
url | http://dx.doi.org/10.1155/2022/2756396 |
work_keys_str_mv | AT kavipriyas captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT ponkarthikak captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT jayakumarkaliappan captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT senthilkumaranselvaraj captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT nagalakshmir captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention AT bayemolla captiongenerationbasedonemotionsusingcspdensenetandbilstmwithselfattention |