Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search

Image captioning is a fascinating and fast-evolving research project that integrates two domains: Natural Language Processing and Computer Vision. Creating appropriate captions is a difficult task due to the many activities portrayed in the backdrop image. To mitigate these drawbacks, the envisioned...

Full description

Saved in:
Bibliographic Details
Main Authors: P. V. Kavitha, V. Karpagam
Format: Article
Language:English
Published: Taylor & Francis Group 2025-07-01
Series:Automatika
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/00051144.2025.2485695
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849702240398344192
author P. V. Kavitha
V. Karpagam
author_facet P. V. Kavitha
V. Karpagam
author_sort P. V. Kavitha
collection DOAJ
description Image captioning is a fascinating and fast-evolving research project that integrates two domains: Natural Language Processing and Computer Vision. Creating appropriate captions is a difficult task due to the many activities portrayed in the backdrop image. To mitigate these drawbacks, the envisioned work employs a ResNet50 encoder for image feature extraction and a Hybrid LSTM–GRU decoder optimized with Beam Search to produce text descriptions. Beam search is a search technique that enables caption generation with higher quality and consistency by investigating many paths in the search space and choosing the most likely option based on a score or probability. The findings compare CNN models such as VGG16, InceptionV3, ResNet50 and DenseNet121 with language model LSTM in terms of loss and accuracy on the Flickr8k dataset. To further boost the performance of caption quality, the proposed method uses ResNet50 + Hybrid LSTM–GRU with Beam search, which produces a good accuracy of 0.8932 and a lower loss of 0.4013 on the Flickr8k dataset. The proposed method, ResNet50 + hybrid LSTM–GRU with Beam Search, beats the findings of the aforementioned encoder–decoder models with Greedy Search in terms of the BLEU score of 0.6034.
format Article
id doaj-art-2f1c857ee67045e39270ee3d870eb550
institution DOAJ
issn 0005-1144
1848-3380
language English
publishDate 2025-07-01
publisher Taylor & Francis Group
record_format Article
series Automatika
spelling doaj-art-2f1c857ee67045e39270ee3d870eb5502025-08-20T03:17:43ZengTaylor & Francis GroupAutomatika0005-11441848-33802025-07-0166339441010.1080/00051144.2025.2485695Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam searchP. V. Kavitha0V. Karpagam1Department of Artificial Intelligence and Data Science, Sri Ramakrishna Engineering College, Coimbatore, IndiaDepartment of Artificial Intelligence and Data Science, Sri Ramakrishna Engineering College, Coimbatore, IndiaImage captioning is a fascinating and fast-evolving research project that integrates two domains: Natural Language Processing and Computer Vision. Creating appropriate captions is a difficult task due to the many activities portrayed in the backdrop image. To mitigate these drawbacks, the envisioned work employs a ResNet50 encoder for image feature extraction and a Hybrid LSTM–GRU decoder optimized with Beam Search to produce text descriptions. Beam search is a search technique that enables caption generation with higher quality and consistency by investigating many paths in the search space and choosing the most likely option based on a score or probability. The findings compare CNN models such as VGG16, InceptionV3, ResNet50 and DenseNet121 with language model LSTM in terms of loss and accuracy on the Flickr8k dataset. To further boost the performance of caption quality, the proposed method uses ResNet50 + Hybrid LSTM–GRU with Beam search, which produces a good accuracy of 0.8932 and a lower loss of 0.4013 on the Flickr8k dataset. The proposed method, ResNet50 + hybrid LSTM–GRU with Beam Search, beats the findings of the aforementioned encoder–decoder models with Greedy Search in terms of the BLEU score of 0.6034.https://www.tandfonline.com/doi/10.1080/00051144.2025.2485695Deep learninggreedy search and beam searchconvolutional neural networkResnet50Hybrid LSTM–GRU
spellingShingle P. V. Kavitha
V. Karpagam
Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search
Automatika
Deep learning
greedy search and beam search
convolutional neural network
Resnet50
Hybrid LSTM–GRU
title Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search
title_full Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search
title_fullStr Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search
title_full_unstemmed Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search
title_short Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search
title_sort image captioning deep learning model using resnet50 encoder and hybrid lstm gru decoder optimized with beam search
topic Deep learning
greedy search and beam search
convolutional neural network
Resnet50
Hybrid LSTM–GRU
url https://www.tandfonline.com/doi/10.1080/00051144.2025.2485695
work_keys_str_mv AT pvkavitha imagecaptioningdeeplearningmodelusingresnet50encoderandhybridlstmgrudecoderoptimizedwithbeamsearch
AT vkarpagam imagecaptioningdeeplearningmodelusingresnet50encoderandhybridlstmgrudecoderoptimizedwithbeamsearch