Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search
Image captioning is a fascinating and fast-evolving research project that integrates two domains: Natural Language Processing and Computer Vision. Creating appropriate captions is a difficult task due to the many activities portrayed in the backdrop image. To mitigate these drawbacks, the envisioned...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Taylor & Francis Group
2025-07-01
|
| Series: | Automatika |
| Subjects: | |
| Online Access: | https://www.tandfonline.com/doi/10.1080/00051144.2025.2485695 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849702240398344192 |
|---|---|
| author | P. V. Kavitha V. Karpagam |
| author_facet | P. V. Kavitha V. Karpagam |
| author_sort | P. V. Kavitha |
| collection | DOAJ |
| description | Image captioning is a fascinating and fast-evolving research project that integrates two domains: Natural Language Processing and Computer Vision. Creating appropriate captions is a difficult task due to the many activities portrayed in the backdrop image. To mitigate these drawbacks, the envisioned work employs a ResNet50 encoder for image feature extraction and a Hybrid LSTM–GRU decoder optimized with Beam Search to produce text descriptions. Beam search is a search technique that enables caption generation with higher quality and consistency by investigating many paths in the search space and choosing the most likely option based on a score or probability. The findings compare CNN models such as VGG16, InceptionV3, ResNet50 and DenseNet121 with language model LSTM in terms of loss and accuracy on the Flickr8k dataset. To further boost the performance of caption quality, the proposed method uses ResNet50 + Hybrid LSTM–GRU with Beam search, which produces a good accuracy of 0.8932 and a lower loss of 0.4013 on the Flickr8k dataset. The proposed method, ResNet50 + hybrid LSTM–GRU with Beam Search, beats the findings of the aforementioned encoder–decoder models with Greedy Search in terms of the BLEU score of 0.6034. |
| format | Article |
| id | doaj-art-2f1c857ee67045e39270ee3d870eb550 |
| institution | DOAJ |
| issn | 0005-1144 1848-3380 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Taylor & Francis Group |
| record_format | Article |
| series | Automatika |
| spelling | doaj-art-2f1c857ee67045e39270ee3d870eb5502025-08-20T03:17:43ZengTaylor & Francis GroupAutomatika0005-11441848-33802025-07-0166339441010.1080/00051144.2025.2485695Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam searchP. V. Kavitha0V. Karpagam1Department of Artificial Intelligence and Data Science, Sri Ramakrishna Engineering College, Coimbatore, IndiaDepartment of Artificial Intelligence and Data Science, Sri Ramakrishna Engineering College, Coimbatore, IndiaImage captioning is a fascinating and fast-evolving research project that integrates two domains: Natural Language Processing and Computer Vision. Creating appropriate captions is a difficult task due to the many activities portrayed in the backdrop image. To mitigate these drawbacks, the envisioned work employs a ResNet50 encoder for image feature extraction and a Hybrid LSTM–GRU decoder optimized with Beam Search to produce text descriptions. Beam search is a search technique that enables caption generation with higher quality and consistency by investigating many paths in the search space and choosing the most likely option based on a score or probability. The findings compare CNN models such as VGG16, InceptionV3, ResNet50 and DenseNet121 with language model LSTM in terms of loss and accuracy on the Flickr8k dataset. To further boost the performance of caption quality, the proposed method uses ResNet50 + Hybrid LSTM–GRU with Beam search, which produces a good accuracy of 0.8932 and a lower loss of 0.4013 on the Flickr8k dataset. The proposed method, ResNet50 + hybrid LSTM–GRU with Beam Search, beats the findings of the aforementioned encoder–decoder models with Greedy Search in terms of the BLEU score of 0.6034.https://www.tandfonline.com/doi/10.1080/00051144.2025.2485695Deep learninggreedy search and beam searchconvolutional neural networkResnet50Hybrid LSTM–GRU |
| spellingShingle | P. V. Kavitha V. Karpagam Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search Automatika Deep learning greedy search and beam search convolutional neural network Resnet50 Hybrid LSTM–GRU |
| title | Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search |
| title_full | Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search |
| title_fullStr | Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search |
| title_full_unstemmed | Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search |
| title_short | Image captioning deep learning model using ResNet50 encoder and hybrid LSTM–GRU decoder optimized with beam search |
| title_sort | image captioning deep learning model using resnet50 encoder and hybrid lstm gru decoder optimized with beam search |
| topic | Deep learning greedy search and beam search convolutional neural network Resnet50 Hybrid LSTM–GRU |
| url | https://www.tandfonline.com/doi/10.1080/00051144.2025.2485695 |
| work_keys_str_mv | AT pvkavitha imagecaptioningdeeplearningmodelusingresnet50encoderandhybridlstmgrudecoderoptimizedwithbeamsearch AT vkarpagam imagecaptioningdeeplearningmodelusingresnet50encoderandhybridlstmgrudecoderoptimizedwithbeamsearch |